A survey of robot learning from demonstration

Similar documents
A Case-Based Approach To Imitation Learning in Robotic Agents

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Lecture 10: Reinforcement Learning

Axiom 2013 Team Description Paper

CS Machine Learning

Chapter 2. Intelligent Agents. Outline. Agents and environments. Rationality. PEAS (Performance measure, Environment, Actuators, Sensors)

Learning Human Utility from Video Demonstrations for Deductive Planning in Robotics

Reinforcement Learning by Comparing Immediate Reward

Lecture 1: Machine Learning Basics

Python Machine Learning

Speeding Up Reinforcement Learning with Behavior Transfer

Rule Learning With Negation: Issues Regarding Effectiveness

A MULTI-AGENT SYSTEM FOR A DISTANCE SUPPORT IN EDUCATIONAL ROBOTICS

Concept Acquisition Without Representation William Dylan Sabo

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

Intelligent Agents. Chapter 2. Chapter 2 1

What is Thinking (Cognition)?

Rover Races Grades: 3-5 Prep Time: ~45 Minutes Lesson Time: ~105 minutes

Carnegie Mellon University Department of Computer Science /615 - Database Applications C. Faloutsos & A. Pavlo, Spring 2014.

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Speech Emotion Recognition Using Support Vector Machine

Word Segmentation of Off-line Handwritten Documents

ProFusion2 Sensor Data Fusion for Multiple Active Safety Applications

Surprise-Based Learning for Autonomous Systems

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Rule Learning with Negation: Issues Regarding Effectiveness

Robot Learning Simultaneously a Task and How to Interpret Human Instructions

Degree Qualification Profiles Intellectual Skills

Designing a Rubric to Assess the Modelling Phase of Student Design Projects in Upper Year Engineering Courses

TOKEN-BASED APPROACH FOR SCALABLE TEAM COORDINATION. by Yang Xu PhD of Information Sciences

Computers Change the World

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

DOCTOR OF PHILOSOPHY HANDBOOK

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Agents and environments. Intelligent Agents. Reminders. Vacuum-cleaner world. Outline. A vacuum-cleaner agent. Chapter 2 Actuators

Seminar - Organic Computing

Applying Fuzzy Rule-Based System on FMEA to Assess the Risks on Project-Based Software Engineering Education

Developing an Assessment Plan to Learn About Student Learning

On the Combined Behavior of Autonomous Resource Management Agents

Innovative Methods for Teaching Engineering Courses

Robot Shaping: Developing Autonomous Agents through Learning*

Reducing Features to Improve Bug Prediction

LEGO MINDSTORMS Education EV3 Coding Activities

Comparison of network inference packages and methods for multiple networks inference

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

Stacks Teacher notes. Activity description. Suitability. Time. AMP resources. Equipment. Key mathematical language. Key processes

DIGITAL GAMING & INTERACTIVE MEDIA BACHELOR S DEGREE. Junior Year. Summer (Bridge Quarter) Fall Winter Spring GAME Credits.

Designing a Computer to Play Nim: A Mini-Capstone Project in Digital Design I

Case Acquisition Strategies for Case-Based Reasoning in Real-Time Strategy Games

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

Introduction to Simulation

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Towards a Collaboration Framework for Selection of ICT Tools

AMULTIAGENT system [1] can be defined as a group of

Data Fusion Models in WSNs: Comparison and Analysis

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

A GENERIC SPLIT PROCESS MODEL FOR ASSET MANAGEMENT DECISION-MAKING

Defragmenting Textual Data by Leveraging the Syntactic Structure of the English Language

Conversation Starters: Using Spatial Context to Initiate Dialogue in First Person Perspective Games

Software Maintenance

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

SOFTWARE EVALUATION TOOL

Learning Prospective Robot Behavior

Laboratorio di Intelligenza Artificiale e Robotica

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

CSL465/603 - Machine Learning

Speech Recognition at ICSI: Broadcast News and beyond

A Reinforcement Learning Variant for Control Scheduling

The Effect of Discourse Markers on the Speaking Production of EFL Students. Iman Moradimanesh

Spring 2016 Stony Brook University Instructor: Dr. Paul Fodor

Learning From the Past with Experiment Databases

Radius STEM Readiness TM

Integrating Blended Learning into the Classroom

MYCIN. The MYCIN Task

How to Judge the Quality of an Objective Classroom Test

School of Innovative Technologies and Engineering

Shockwheat. Statistics 1, Activity 1

Lecture 6: Applications

Missouri Mathematics Grade-Level Expectations

Software Development: Programming Paradigms (SCQF level 8)

Classroom Activities/Lesson Plan

Student Perceptions of Reflective Learning Activities

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Georgetown University at TREC 2017 Dynamic Domain Track

University of Victoria School of Exercise Science, Physical and Health Education EPHE 245 MOTOR LEARNING. Calendar Description Units: 1.

Truth Inference in Crowdsourcing: Is the Problem Solved?

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Multivariate k-nearest Neighbor Regression for Time Series data -

ADDIE: A systematic methodology for instructional design that includes five phases: Analysis, Design, Development, Implementation, and Evaluation.

Robot manipulations and development of spatial imagery

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Laboratorio di Intelligenza Artificiale e Robotica

*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN

Human Emotion Recognition From Speech

Generative models and adversarial training

B. How to write a research paper

12- A whirlwind tour of statistics

Transcription:

A survey of robot learning from demonstration Brenna D. Argall, Sonia Chernova, Manuela Veloso, Brett Browning Presented by Aalhad Patankar

Overview of learning from demonstration (LfD) Learning from Demonstration: Deriving a policy from examples provided by a teacher Different from reinforcement learning, in which a policy is derived from experience, such as exploration of different states and actions in reinforcement learning

What is learning from demonstration (LfD)? Policy: a mapping between actions and world state E.g. moving an actuator (action) and the location of a box near the robot (world state) Examples: A sequence of state-action pairs that are recorded by some sort of teacher demonstrator demonstration Policy derivation Teacher Learner

Two phases of LfD Gathering examples: the process of recording example data to derive a policy from Deriving policies: analyzing examples to determine a policy

Advantages of LfD Does not require expert knowledge of domain dynamics, which depends heavily on the accuracy of the world model Intuitive, as humans already communicate knowledge in this way Demonstration focuses the dataset only to area in the state-space encountered during demonstration

Formal definition World consists of states S and actions A States Z are observable states which are mapped from S to Z by mapping M A policy : Z-> A is a selection of actions A based on the observable world states

Design choices: demonstrator Choice of demonstrators have big impacts on the algorithms used for derivation of policy Can be broken down into who designs the demonstration, and which body executes the demonstration E.g. human designer tele-operating a robot, robot designing and executing demonstration Human demonstrators usually used

Design choices: demonstration technique Whether policy is derived after all training data is obtained (batch), or is developed incrementally as data becomes available (interactive) Problem space continuity: whether states are discretized or continuous Discretized example: states broken as box on table, box held by robot, box on floor etc Continuous example: in same example, using 3D position of robot s effectors and box throughout actions Continuity of problem space has big effects on what algorithms are used in the policy derivation stage

Building the example dataset: correspondence Because of differences in the teacher s sensors and actuators (human eyes, human joints) and the robot s sensors and actuators, a direct transfer of information from teacher to student is often difficult This issue, called correspondence, and can be broken down into two categories: Record mapping: correspondence between teacher s actions and recorded data Embodiment mapping: correspondence between recorded data and learner s execution

Building the example dataset: correspondence Data acquisition for LfD can be broken down into categories based on correspondence I(z,a) means identity function (direct mapping), while g(z,a) is a mapping function used for correspondence

Teleoperation Human operator controls a robot teacher Direct record and embodiment mapping, as all recording and execution is done on the student body itself by human operator E.g. human controlling a robot s movements through remote control to teach it to find a box

Shadowing Robotic platform shadows human teacher, and recordings are done from robotic platform Direct embodiment mapping because robot s own sensors are used to record data, but record mapping required between human actions and robot demonstration in shadowing step

Sensors on Teacher Sensors are placed directly on teaching platform, so record correspondence issues are alleviated Can come with large overhead such as specialized sensors and a customized environment

External observation Sensors external to the body executing the demonstration are used to record data Less reliable and less precise, but comes with less overhead

Deriving a policy: mapping function Attempts to calculate the underlying function behind the states and actions and generalize over set of training data Two major categories: classification and regression Is heavily influenced by demonstration design choices mentioned earlier

Mapping function: Categorization Input is categorized into discrete classes and outputs discrete robot actions Many algorithms, such as k-nearest Neighbors, Gaussian Mixture Models, and Bayesian networks are used to perform the classification, depending on the application Can be done for low level robot movement (controlling a car in a simulated environment), mid-level motion primitives (teaching a robot to flip an egg), and high level complex actions (ball sorting task)

Mapping function: Regression Maps demonstration states to continuous action outputs Lazy learning: function approximation is done on demand whenever a current observation needs to be mapped at run-time At opposite end, all function approximation done prior to run-time No adjustments to policy done at run-time Very computationally expensive

Mapping function example: ball sorting Chernova, S, Veloso M. Teaching Multi-Robot Coordination using Demonstration of Communication and State Sharing. Carnegie Mellon Institute. International Foundation for Autonomous Agents and Multiagent Systems, 2008.

System model A transition model is developed from demonstration data and state-action exploration done by the robot A reward function is used to associate rewards with states (Reinforcement Learning) Reward function can be user designed (engineered reward) or learned from demonstration data

System model example: robotic goalkeeper https://www.youtube.com/watch?v=cif2sbvy-j0

Plans Actions are composed of pre-conditions, the state that must take place before an action can occur, and post-conditions, the state immediately after the action Non state-action information, such as intentions and annotations can be provided by the teacher to the learner in addition to demonstration data

Example with plans: clearing a table Task: clearing a table Pre-programmed actions: pick, drop, search, etc. available to robot After demonstration, robot learns how these actions relate to objects and states, and learns mapping between sequence of actions and states Veeraragha H, Veloso M. Teaching Sequential Tasks with Repetition through Demonstration. International Foundation for Autonomous Agents and Multiagent Systems, 2008.

Failure modes for demonstration dataset Sparse datasets lacking demonstration for some states raise the question: What should the learner do upon encountering an undemonstrated state? Generalize state information based on learning from demonstrated states Request and acquire additional demonstrations Poor demonstration data quality Sub-optimal and unsuccessful teacher demonstrations Demonstrations that are ambiguous in the state space

Future directions Feature selection selecting too many features is computationally expensive and can confuse learning process, while too few features might lead to insufficient data for policy inference What is an intuitive way to select the right features? Including temporal data Currently, most algorithms discard temporal data Repetitive tasks become difficult to sequentialize Actions that have no perceivable effect on the states are difficult to learn from Temporal data could alleviate both these issues

Future directions Multi-robot demonstration learning Both agents could request advice from human teacher or provide demonstrations for one another Refined evaluation metrics Currently, LfD projects are highly domain and task specific Field lacks a cross-domain standard for evaluating performance

Questions?