Robot Autonomy Inverse Reinforcement Learning

16-662 Robot Autonomy Inverse Reinforcement Learning Katharina Muelling kmuelling@nrec.ri.cmu.edu NSH 4521

Last Lecture Autonomous learning from scratch is hard Real world exploration Reward function What can we do: Effective representation Use prior knowledge Imitation Learning for creating good starting points (prior knowledge) Dynamical System Motor Primitives to represent motor skills I learned to ride with RL Pic: researchers.lille.i nria.fr/~munos/

Effective Representation of Motor Skills Dynamic System Motor Primitives Arbitrarily shaped smooth movements Simple to adapt Stable and robust Linear in parameters w: Easy to learn through imitation and reinforcement learning Shape! Not goal or intention! a

Dynamical System Motor Primitives What do we gain from this representation? Motor policy representation that performs an automatic mapping of states to actions over time π w g θ t θሶ t t T = a t+1 Mapping depends on shape parameters w

Concept Imitation Learning: Imitation Learning Given a set of labeled training data (demonstrations), learn a function that maps the (observed) state to an action. Teacher Record Mapping Recording Embodiment Mapping Learner Problems: Correspondence Problem Need to know what to imitate

Today s Lecture Case study: Learning motor skills in ball in a cup Inverse Reinforcement Learning Examples of Inverse Reinforcement Learning Case study: Learning strategies Shortcomings of Inverse Reinforcement Learning

How to Learn from Demonstrations Control Policy p Behavioral Cloning Expert Demonstration s i, a i, r i i=1:t Learner π(a s)

How to Learn from Demonstrations Reward R Reinforcement Learning, Optimal Control Control Policy p Dynamical Model T Behavioral Cloning Inverse Reinforcement Learning Expert Demonstration s i, a i, r i i=1:t Learner R

Learning from Demonstration Case Study: Learning motor skills from demonstration

Learning Hitting Motions in Table Tennis Represent motor policy as DMP Reduces the learning problem to finding the right trajectory weights Initiate good policy through demonstration Learned through interactions with the world which DMP to associate with state

ሷ ሷ Case Study: Ball in a Cup Goal: Get the ball into the cup 1) Represent motor policy as dynamical system motor primitive θ~π w θ t s t 2) Learn initial parameter w from demonstration J. Kober and J. Peters, Policy Search for Motor Primitives in Robotics, NIPS, 2008 Mind the number of local models 3) Perturb parameters to change acceleration pattern by sampling from normal distribution w = w + E σ T t=1 ε t Q π E σt t=1 Q π

Case Study: Ball in a Cup Reward J. Kober and J. Peters, Policy Search for Motor Primitives in Robotics, NIPS, 2008

Solving a MDP Reward R Reinforcement Learning, Optimal Control Control Policy p Dynamical Model T Behavioral Cloning Inverse Reinforcement Learning Expert Demonstration Katharina Muelling (NREC, Carnegie Mellon University) 13

Imitation Learning Demonstrated Behavior Novel Scene Ratliff et al.: Maximum Margin Planning, 2006

Imitation Learning Demonstrated Behavior Learned Behavior Ratliff et al.: Maximum Margin Planning, 2006

Inverse Reinforcement Learning What is this robot up to?

Imitation Learning Demonstrated Behavior Novel Scene Ratliff et al.: Maximum Margin Planning, 2006

Imitation Learning Demonstrated Behavior Learned Behavior Ratliff et al.: Maximum Margin Planning, 2006

Inverse Reinforcement Learning Learning Input Features Behavior

Inverse Reinforcement Learning Learning RL Input Features Reward Function Behavior

Inverse Reinforcement Learning Input Features Ratliff et al.: Maximum Margin Planning, 2006 Reward Function Behavior

Inverse Reinforcement Learning Reinforcement Learning goal: Given an MDP, maximize the expected return p = argmax J(p ) p J π = E γ t R s t, a t π t=0 Hand Designed Environment Observable Reward Reinforcement Learning Behavior K. Muelling (National Robotics Engineering Center, CMU) 26

Inverse Reinforcement Learning Reinforcement Learning goal: Given an MDP, maximize the expected return p = argmax J(p ) p J π = E γ t R s t, a t π t=0 Problems: Reward function defines the desired behavior Can be hard to define a good reward function that guides the learning process, especially when human behavior is considered K. Muelling (National Robotics Engineering Center, CMU) 27

Inverse Reinforcement Learning Idea: If you really want to imitate, you need to find the reward function rather than the policy! A Markov Decision Process without a reward function is denoted by MDP\R Environment Reward Reinforcement Learning Behavior K. Muelling (National Robotics Engineering Center, CMU) 28

IRL: Basic Idea N { } n=1 Given a MDP\R and a set of demonstrations D = t n from an expert, find the reward function R = σ m i=1 w i f i (s, a) that satisfies For all policies π: J(π E ) J(π) Basic assumption: Reward function can be written as a linear combination of known reward features R s, a = m i=1 w i f i s, a = w T f(s, a)

Inverse Reinforcement Learning Idea: Change reward: higher lower π

IRL: Basic Idea Basic assumption: Reward function can be written as a linear combination of known reward features R s, a = Rewrite the expected return as: J π = E γ t R s t, a t π t=0 = E w T f s t, a t π t=0 m i=1 w i f i s, a = w T f(s, a)

IRL: Basic Idea Basic assumption: Reward function can be written as a linear combination of known reward features R s, a = Based on this assumption we can rewrite the expected return as: J π = E γ t R s t, a t π t=0 m i=1 w i f i s, a = w T f(s, a) = E w T f s t, a t π t=0 = w T E f s t, a t π t=0 Feature expectation/count m(p)

IRL: Basic Idea J(π) = w T E γ t f s t, a t π J(π E ) J(π) t=0 Find a weight vector w, s.t.: m(p) w T μ π E w T μ π π Feature expectation/count Can be estimated from sample trajectories Problems: We do not have the policy p E, we only have some observed trajectories Reward function ambiguity: A large class of reward functions may lead to the same optimal policy Assumes we can enumerate all policies

IRL: Basic Idea Reward function ambiguity: need additional constraints! Much of the literature in IRL focuses on solving this problem! How did Abbeel and Ng address the problem? m(pe) Maximize the difference between expert and other policies m(p)

Apprenticeship Learning via IRL Assumptions: We can observe the state-action pairs Agent is goal driven and follows some optimal policy Access to a reinforcement learning solver Solver returns optimal policy Ng and Abbeel: Apprenticeship Learning via Inverse Reinforcement Learning, 04

Apprenticeship Learning via IRL Given a set of m demonstrations, compute the expected feature counts Goal: m μ E = 1 m i=1 t=0 γ t f s t i Find a policy π whose performance is close to that of the expert demonstrator E t=0 γ t R(s t )หπ E E = w T μ E w T μ π w 2 μ E μ π 2 1ε = ε t=0 γ t R(s t ) π Ng and Abbeel: Apprenticeship Learning via Inverse Reinforcement Learning, 04

Apprenticeship Learning via IRL Initialize: Random w, and compute μ 0 Algorithm: 1. Compute t = max min w T μ E μ i, and w: w 2 1 j w i being the w that realizes this maximum 2. If t i ε: terminate 3. Compute π i+1 using the RL solver and R = wf 4. Compute new μ i+1 m(pe) m(p) Ng and Abbeel: Apprenticeship Learning via Inverse Reinforcement Learning, 04

Inverse Reinforcement Learning Route Planning Examples Ratliff et al., 2006 Parking lot navigation Abbeel et al., 2008 Quadruped locomotion Kolter et al. 2008

Inverse Reinforcement Learning Pedestrian Prediction Ziebart et al., 2009 Activity Forecasting Kitani et al., 2012

Case Study: Table Tennis Can we learn higher level strategies with inverse reinforcement learning?

How can we learn a manipulation tasks? Learning Strategies: Learning strategic elements from demonstrations using Inverse Reinforcement Learning Learning Movements: Learning motor skills from demonstration Learning how to select and generalize motor primitives State State s Supervisory System Augmented State s Motion Generation Joint Values Execution Motor Torques u Action Teacher Policy Learning Signal Policy K. Muelling (National Robotics Engineering Center, CMU) 43

How can we represent such a strategy? Representing the strategy: Markov Decision Process (S,A,T,R)

Finding reward function for table tennis Coming back to the table tennis example: Can we find a reward function from which we can generate a higher level strategy? Problem in the table tennis experiment We do not have a perfect dynamical model We cannot compute all possible policies π Testing three model-free IRL methods Two model-free versions of max-margin IRL P. Abbeel and A. Ng, Apprenticeship learning via inverse reinforcement learning, ICML 2004 Model-free relative entropy IRL Boularias et al., Relative entropy inverse reinfocement learning, AISTATS 2011 K. Muelling (National Robotics Engineering Center, CMU) 47

Finding reward function for table tennis Model free Maximum Margin Additional trajectories of non-optimal strategies With max w τ D T J E s t, w J N k s t, w λ w 2 t=1 H J s 1, w = 1 H i=1 Most similar state wf(s i, a i ) Set horizon H=3 horizon -> Corresponds to planning two steps ahead

Experimental Setup Need many non-optimal and/or random trajectories: How can we generate them? What and how to record? Pilot studies K. Muelling (National Robotics Engineering Center, CMU) 49

Experimental Setup Subject 5 naïve players 2 skilled players 1 permanent opponent (skilled) Experiments 1) 10 min cooperative table tennis 2) Semi competitive game (coop. opponent and comp. subject) 3) Competitive game K. Muelling et al, 2014 K. Muelling (National Robotics Engineering Center, CMU) 50

IRL for Table Tennis Reward features that describe the world Table preferences Distance to the edge (δ t ) Distance to the opponent (δ o ) Moving direction of the opponent (v o ) Velocity ball (v b ) Orientation ball (θ y, θ z ) Proximity elbow (δ elbow ) Smash K. Muelling (National Robotics Engineering Center, CMU) 51

IRL for Table Tennis What do you think? Which features are important?

What did the system learn? Preferences Expert: Forehand are avoided Backhand are preferred Playing ball flat and cross towards backhand area Increase distance between ball and opponent K. Muelling (National Robotics Engineering Center, CMU) 54

Main Findings Possible Strategy that distinguish expert and non-expert players s T-2 s T-1 Planning ahead: Expert plans up to two steps ahead! K. Muelling (National Robotics Engineering Center, CMU) 55

Evaluation Able to distinguish between Skills of the player on strategic level Different playing styles K. Muelling (National Robotics Engineering Center, CMU) 56

Inverse Reinforcement Learning Problems: Need dynamic model Need RL solver or planner Depends on the hand designed features

Summary Imitation Learning: Learning from demonstration is a great tool to initiate learning and to make learning on real robots possible. Representing movements with DMPs allow to efficiently learn movements from demonstration and through self improvement. When learning from demonstration keep in mind: What you want to learn. Is it possible to map human demonstration to robot learner? Does it make sense to map human demonstration to the robot? There are different ways to learn from demonstration.

Summary Inverse Reinforcement Learning vs Behavioral Cloning Reward function defines the underlying behavior! Can we recover the reward function from demonstrations? Apprenticeship Learning: Can we find a policy that is at least as good as the demonstrated one with IRL? Can we directly learn the policy? Formulated as supervised learning problem: 1) Fix policy class 2) Find suitable ML 3) Learn policy directly from demonstrations