Robot Autonomy Inverse Reinforcement Learning

Similar documents
Lecture 10: Reinforcement Learning

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Reinforcement Learning by Comparing Immediate Reward

Regret-based Reward Elicitation for Markov Decision Processes

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

arxiv: v2 [cs.ro] 3 Mar 2017

Lecture 1: Machine Learning Basics

The Strong Minimalist Thesis and Bounded Optimality

Axiom 2013 Team Description Paper

BMBF Project ROBUKOM: Robust Communication Networks

Speeding Up Reinforcement Learning with Behavior Transfer

AMULTIAGENT system [1] can be defined as a group of

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Learning Human Utility from Video Demonstrations for Deductive Planning in Robotics

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Learning Prospective Robot Behavior

Discriminative Learning of Beam-Search Heuristics for Planning

Georgetown University at TREC 2017 Dynamic Domain Track

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Artificial Neural Networks written examination

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

Improving Action Selection in MDP s via Knowledge Transfer

Laboratorio di Intelligenza Artificiale e Robotica

High-level Reinforcement Learning in Strategy Games

Robot Learning Simultaneously a Task and How to Interpret Human Instructions

Lecture 6: Applications

Math 1313 Section 2.1 Example 2: Given the following Linear Program, Determine the vertices of the feasible set. Subject to:

FF+FPG: Guiding a Policy-Gradient Planner

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

AI Agent for Ice Hockey Atari 2600

Task Completion Transfer Learning for Reward Inference

An OO Framework for building Intelligence and Learning properties in Software Agents

Transferring End-to-End Visuomotor Control from Simulation to Real World for a Multi-Stage Task

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

Robot manipulations and development of spatial imagery

Task Completion Transfer Learning for Reward Inference

BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY

Truth Inference in Crowdsourcing: Is the Problem Solved?

CSL465/603 - Machine Learning

An investigation of imitation learning algorithms for structured prediction

Seminar - Organic Computing

A Case-Based Approach To Imitation Learning in Robotic Agents

Transfer Learning Action Models by Measuring the Similarity of Different Domains

Laboratorio di Intelligenza Artificiale e Robotica

Centralized Assignment of Students to Majors: Evidence from the University of Costa Rica. Job Market Paper

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Active Learning. Yingyu Liang Computer Sciences 760 Fall

College Pricing and Income Inequality

Semi-Supervised Face Detection

College Pricing and Income Inequality

Comparison of network inference packages and methods for multiple networks inference

Cooperative Game Theoretic Models for Decision-Making in Contexts of Library Cooperation 1

A Reinforcement Learning Variant for Control Scheduling

Generative models and adversarial training

Decision Analysis. Decision-Making Problem. Decision Analysis. Part 1 Decision Analysis and Decision Tables. Decision Analysis, Part 1

Corrective Feedback and Persistent Learning for Information Extraction

arxiv: v1 [cs.lg] 8 Mar 2017

Multivariate k-nearest Neighbor Regression for Time Series data -

Lesson plan for Maze Game 1: Using vector representations to move through a maze Time for activity: homework for 20 minutes

Development of Multistage Tests based on Teacher Ratings

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

CS Machine Learning

Learning Methods for Fuzzy Systems

Teaching a Laboratory Section

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

A Case Study: News Classification Based on Term Frequency

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

University of Victoria School of Exercise Science, Physical and Health Education EPHE 245 MOTOR LEARNING. Calendar Description Units: 1.

Universityy. The content of

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Rule Learning With Negation: Issues Regarding Effectiveness

3D DIGITAL ANIMATION TECHNIQUES (3DAT)

Multi-Dimensional, Multi-Level, and Multi-Timepoint Item Response Modeling.

Self Study Report Computer Science

Chapter 2. Intelligent Agents. Outline. Agents and environments. Rationality. PEAS (Performance measure, Environment, Actuators, Sensors)

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Vocational Training Dropouts: The Role of Secondary Jobs

Team Formation for Generalized Tasks in Expertise Social Networks

COMPUTATIONAL COMPLEXITY OF LEFT-ASSOCIATIVE GRAMMAR

AN EXAMPLE OF THE GOMORY CUTTING PLANE ALGORITHM. max z = 3x 1 + 4x 2. 3x 1 x x x x N 2

arxiv: v1 [cs.lg] 3 May 2013

Probabilistic Latent Semantic Analysis

HARPER ADAMS UNIVERSITY Programme Specification

An empirical study of learning speed in backpropagation

Purdue Data Summit Communication of Big Data Analytics. New SAT Predictive Validity Case Study

(Sub)Gradient Descent

Quantitative Evaluation of an Intuitive Teaching Method for Industrial Robot Using a Force / Moment Direction Sensor

Saliency in Human-Computer Interaction *

Teachable Robots: Understanding Human Teaching Behavior to Build More Effective Robot Learners

Softprop: Softmax Neural Network Backpropagation Learning

MTH 141 Calculus 1 Syllabus Spring 2017

Dialog-based Language Learning

A Comparison of Annealing Techniques for Academic Course Scheduling

Probability and Game Theory Course Syllabus

A simulated annealing and hill-climbing algorithm for the traveling tournament problem

Agents and environments. Intelligent Agents. Reminders. Vacuum-cleaner world. Outline. A vacuum-cleaner agent. Chapter 2 Actuators

DOCTOR OF PHILOSOPHY HANDBOOK

AC : DESIGNING AN UNDERGRADUATE ROBOTICS ENGINEERING CURRICULUM: UNIFIED ROBOTICS I AND II

Software Maintenance

Carter M. Mast. Participants: Peter Mackenzie-Helnwein, Pedro Arduino, and Greg Miller. 6 th MPM Workshop Albuquerque, New Mexico August 9-10, 2010

Transcription:

16-662 Robot Autonomy Inverse Reinforcement Learning Katharina Muelling kmuelling@nrec.ri.cmu.edu NSH 4521

Last Lecture Autonomous learning from scratch is hard Real world exploration Reward function What can we do: Effective representation Use prior knowledge Imitation Learning for creating good starting points (prior knowledge) Dynamical System Motor Primitives to represent motor skills I learned to ride with RL Pic: researchers.lille.i nria.fr/~munos/

Effective Representation of Motor Skills Dynamic System Motor Primitives Arbitrarily shaped smooth movements Simple to adapt Stable and robust Linear in parameters w: Easy to learn through imitation and reinforcement learning Shape! Not goal or intention! a

Dynamical System Motor Primitives What do we gain from this representation? Motor policy representation that performs an automatic mapping of states to actions over time π w g θ t θሶ t t T = a t+1 Mapping depends on shape parameters w

Concept Imitation Learning: Imitation Learning Given a set of labeled training data (demonstrations), learn a function that maps the (observed) state to an action. Teacher Record Mapping Recording Embodiment Mapping Learner Problems: Correspondence Problem Need to know what to imitate

Today s Lecture Case study: Learning motor skills in ball in a cup Inverse Reinforcement Learning Examples of Inverse Reinforcement Learning Case study: Learning strategies Shortcomings of Inverse Reinforcement Learning

How to Learn from Demonstrations Control Policy p Behavioral Cloning Expert Demonstration s i, a i, r i i=1:t Learner π(a s)

How to Learn from Demonstrations Reward R Reinforcement Learning, Optimal Control Control Policy p Dynamical Model T Behavioral Cloning Inverse Reinforcement Learning Expert Demonstration s i, a i, r i i=1:t Learner R

Learning from Demonstration Case Study: Learning motor skills from demonstration

Learning Hitting Motions in Table Tennis Represent motor policy as DMP Reduces the learning problem to finding the right trajectory weights Initiate good policy through demonstration Learned through interactions with the world which DMP to associate with state

ሷ ሷ Case Study: Ball in a Cup Goal: Get the ball into the cup 1) Represent motor policy as dynamical system motor primitive θ~π w θ t s t 2) Learn initial parameter w from demonstration J. Kober and J. Peters, Policy Search for Motor Primitives in Robotics, NIPS, 2008 Mind the number of local models 3) Perturb parameters to change acceleration pattern by sampling from normal distribution w = w + E σ T t=1 ε t Q π E σt t=1 Q π

Case Study: Ball in a Cup Reward J. Kober and J. Peters, Policy Search for Motor Primitives in Robotics, NIPS, 2008

Solving a MDP Reward R Reinforcement Learning, Optimal Control Control Policy p Dynamical Model T Behavioral Cloning Inverse Reinforcement Learning Expert Demonstration Katharina Muelling (NREC, Carnegie Mellon University) 13

Imitation Learning Demonstrated Behavior Novel Scene Ratliff et al.: Maximum Margin Planning, 2006

Imitation Learning Demonstrated Behavior Learned Behavior Ratliff et al.: Maximum Margin Planning, 2006

Inverse Reinforcement Learning What is this robot up to?

Inverse Reinforcement Learning What is this robot up to?

Inverse Reinforcement Learning What is this robot up to?

Inverse Reinforcement Learning What is this robot up to?

Imitation Learning Demonstrated Behavior Novel Scene Ratliff et al.: Maximum Margin Planning, 2006

Imitation Learning Demonstrated Behavior Learned Behavior Ratliff et al.: Maximum Margin Planning, 2006

Inverse Reinforcement Learning Learning Input Features Behavior

Inverse Reinforcement Learning Learning Input Features Behavior

Inverse Reinforcement Learning Learning RL Input Features Reward Function Behavior

Inverse Reinforcement Learning Input Features Ratliff et al.: Maximum Margin Planning, 2006 Reward Function Behavior

Inverse Reinforcement Learning Reinforcement Learning goal: Given an MDP, maximize the expected return p = argmax J(p ) p J π = E γ t R s t, a t π t=0 Hand Designed Environment Observable Reward Reinforcement Learning Behavior K. Muelling (National Robotics Engineering Center, CMU) 26

Inverse Reinforcement Learning Reinforcement Learning goal: Given an MDP, maximize the expected return p = argmax J(p ) p J π = E γ t R s t, a t π t=0 Problems: Reward function defines the desired behavior Can be hard to define a good reward function that guides the learning process, especially when human behavior is considered K. Muelling (National Robotics Engineering Center, CMU) 27

Inverse Reinforcement Learning Idea: If you really want to imitate, you need to find the reward function rather than the policy! A Markov Decision Process without a reward function is denoted by MDP\R Environment Reward Reinforcement Learning Behavior K. Muelling (National Robotics Engineering Center, CMU) 28

IRL: Basic Idea N { } n=1 Given a MDP\R and a set of demonstrations D = t n from an expert, find the reward function R = σ m i=1 w i f i (s, a) that satisfies For all policies π: J(π E ) J(π) Basic assumption: Reward function can be written as a linear combination of known reward features R s, a = m i=1 w i f i s, a = w T f(s, a)

Inverse Reinforcement Learning Idea: Change reward: higher lower π

Inverse Reinforcement Learning Idea: Change reward: higher lower π

IRL: Basic Idea N { } n=1 Given a MDP\R and a set of demonstrations D = t n from an expert, find the reward function R = σ m i=1 w i f i (s, a) that satisfies For all policies π: J(π E ) J(π) Basic assumption: Reward function can be written as a linear combination of known reward features R s, a = m i=1 w i f i s, a = w T f(s, a)

IRL: Basic Idea Basic assumption: Reward function can be written as a linear combination of known reward features R s, a = Rewrite the expected return as: J π = E γ t R s t, a t π t=0 = E w T f s t, a t π t=0 m i=1 w i f i s, a = w T f(s, a)

IRL: Basic Idea Basic assumption: Reward function can be written as a linear combination of known reward features R s, a = Based on this assumption we can rewrite the expected return as: J π = E γ t R s t, a t π t=0 m i=1 w i f i s, a = w T f(s, a) = E w T f s t, a t π t=0 = w T E f s t, a t π t=0 Feature expectation/count m(p)

IRL: Basic Idea J(π) = w T E γ t f s t, a t π J(π E ) J(π) t=0 Find a weight vector w, s.t.: m(p) w T μ π E w T μ π π Feature expectation/count Can be estimated from sample trajectories Problems: We do not have the policy p E, we only have some observed trajectories Reward function ambiguity: A large class of reward functions may lead to the same optimal policy Assumes we can enumerate all policies

IRL: Basic Idea Reward function ambiguity: need additional constraints! Much of the literature in IRL focuses on solving this problem! How did Abbeel and Ng address the problem? m(pe) Maximize the difference between expert and other policies m(p)

Apprenticeship Learning via IRL Assumptions: We can observe the state-action pairs Agent is goal driven and follows some optimal policy Access to a reinforcement learning solver Solver returns optimal policy Ng and Abbeel: Apprenticeship Learning via Inverse Reinforcement Learning, 04

Apprenticeship Learning via IRL Given a set of m demonstrations, compute the expected feature counts Goal: m μ E = 1 m i=1 t=0 γ t f s t i Find a policy π whose performance is close to that of the expert demonstrator E t=0 γ t R(s t )หπ E E = w T μ E w T μ π w 2 μ E μ π 2 1ε = ε t=0 γ t R(s t ) π Ng and Abbeel: Apprenticeship Learning via Inverse Reinforcement Learning, 04

Apprenticeship Learning via IRL Initialize: Random w, and compute μ 0 Algorithm: 1. Compute t = max min w T μ E μ i, and w: w 2 1 j w i being the w that realizes this maximum 2. If t i ε: terminate 3. Compute π i+1 using the RL solver and R = wf 4. Compute new μ i+1 m(pe) m(p) Ng and Abbeel: Apprenticeship Learning via Inverse Reinforcement Learning, 04

Inverse Reinforcement Learning Route Planning Examples Ratliff et al., 2006 Parking lot navigation Abbeel et al., 2008 Quadruped locomotion Kolter et al. 2008

Inverse Reinforcement Learning Pedestrian Prediction Ziebart et al., 2009 Activity Forecasting Kitani et al., 2012

Case Study: Table Tennis Can we learn higher level strategies with inverse reinforcement learning?

How can we learn a manipulation tasks? Learning Strategies: Learning strategic elements from demonstrations using Inverse Reinforcement Learning Learning Movements: Learning motor skills from demonstration Learning how to select and generalize motor primitives State State s Supervisory System Augmented State s Motion Generation Joint Values Execution Motor Torques u Action Teacher Policy Learning Signal Policy K. Muelling (National Robotics Engineering Center, CMU) 43

How can we represent such a strategy? Representing the strategy: Markov Decision Process (S,A,T,R)

How can we represent such a strategy? Representing the strategy: Markov Decision Process (S,A,T,R)

How can we represent such a strategy? Representing the strategy: Markov Decision Process (S,A,T,R)

Finding reward function for table tennis Coming back to the table tennis example: Can we find a reward function from which we can generate a higher level strategy? Problem in the table tennis experiment We do not have a perfect dynamical model We cannot compute all possible policies π Testing three model-free IRL methods Two model-free versions of max-margin IRL P. Abbeel and A. Ng, Apprenticeship learning via inverse reinforcement learning, ICML 2004 Model-free relative entropy IRL Boularias et al., Relative entropy inverse reinfocement learning, AISTATS 2011 K. Muelling (National Robotics Engineering Center, CMU) 47

Finding reward function for table tennis Model free Maximum Margin Additional trajectories of non-optimal strategies With max w τ D T J E s t, w J N k s t, w λ w 2 t=1 H J s 1, w = 1 H i=1 Most similar state wf(s i, a i ) Set horizon H=3 horizon -> Corresponds to planning two steps ahead

Experimental Setup Need many non-optimal and/or random trajectories: How can we generate them? What and how to record? Pilot studies K. Muelling (National Robotics Engineering Center, CMU) 49

Experimental Setup Subject 5 naïve players 2 skilled players 1 permanent opponent (skilled) Experiments 1) 10 min cooperative table tennis 2) Semi competitive game (coop. opponent and comp. subject) 3) Competitive game K. Muelling et al, 2014 K. Muelling (National Robotics Engineering Center, CMU) 50

IRL for Table Tennis Reward features that describe the world Table preferences Distance to the edge (δ t ) Distance to the opponent (δ o ) Moving direction of the opponent (v o ) Velocity ball (v b ) Orientation ball (θ y, θ z ) Proximity elbow (δ elbow ) Smash K. Muelling (National Robotics Engineering Center, CMU) 51

IRL for Table Tennis What do you think? Which features are important?

IRL for Table Tennis What do you think? Which features are important?

What did the system learn? Preferences Expert: Forehand are avoided Backhand are preferred Playing ball flat and cross towards backhand area Increase distance between ball and opponent K. Muelling (National Robotics Engineering Center, CMU) 54

Main Findings Possible Strategy that distinguish expert and non-expert players s T-2 s T-1 Planning ahead: Expert plans up to two steps ahead! K. Muelling (National Robotics Engineering Center, CMU) 55

Evaluation Able to distinguish between Skills of the player on strategic level Different playing styles K. Muelling (National Robotics Engineering Center, CMU) 56

Inverse Reinforcement Learning Problems: Need dynamic model Need RL solver or planner Depends on the hand designed features

Summary Imitation Learning: Learning from demonstration is a great tool to initiate learning and to make learning on real robots possible. Representing movements with DMPs allow to efficiently learn movements from demonstration and through self improvement. When learning from demonstration keep in mind: What you want to learn. Is it possible to map human demonstration to robot learner? Does it make sense to map human demonstration to the robot? There are different ways to learn from demonstration.

Summary Inverse Reinforcement Learning vs Behavioral Cloning Reward function defines the underlying behavior! Can we recover the reward function from demonstrations? Apprenticeship Learning: Can we find a policy that is at least as good as the demonstrated one with IRL? Can we directly learn the policy? Formulated as supervised learning problem: 1) Fix policy class 2) Find suitable ML 3) Learn policy directly from demonstrations