Continuous reinforcement learning in cognitive robotics

Continuous reinforcement learning in cognitive robotics Igor Farkaš CNC research group Department of Applied Informatics / Centre for Cognitive Science FMFI, Comenius University in Bratislava AI seminar, 3.12.2012, Bratislava 1

Talk outline Brief introduction to reinforcement learning RL for continuous spaces CACLA (Wiering & van Hasselt, 2007) Master theses: Integration of motor actions and language (T. Malik, 2011) Learning to reach and grasp objects (L. Zdechovan, 2012) Extended version in: Frontiers in Neurorobotics (2012) Laureát ŠVK, Cena rektora, ACM Spy Galéria najlepších Learning defense manoeuvres in fencing (J. Blanář, 2011) Laureát ŠVK, Celoštátna ŠVOČ (3. miesto, Apl. informatika) 2

Learning paradigms in machine learning supervised (with teacher) unsupervised (self organized) reinforcement learning (partial feedback) 3

Reward and value function RL: developed for discrete space (state, actions) Agent: maximize long-term rewards: γ (0,1 discount factor (future rewards are valued less) Rt =r t + γ r t +1 + γ 2 r t + 2+... Value function: V (s)= Eπ {R t s t =s }= Eπ { k =0 γ r t +k +1 s t =s } Optimal policy: V (s)=max π V (s) Bellman equation: π k π V π (s)=r (s)+max a s ' T ( s, a, s ')V ( s ') (value iteration) model-based, dynamic programming (not RL yet) 4

Active RL control task exploration enabled (non-greedy behavior), model-free Control task choosing best actions Learning state-action Q-values: Optimal policy: Q (s, a)=max π Q ( s, a) Q-learning update (off-policy): V ( s)=max a Q (s, a) π converges faster Qt +1 (s t, at )=Qt (s t, at )+α [r t +1 + γ max a Qt (st +1, a) Qt (s t, a t )] SARSA update: (on-policy) Qt +1 (s t, at )=Qt (s t, at )+α [r t +1 + γ Qt (s t +1, at +1 ) Qt (st, at )] 5

Actor-critic architecture for RL Problem formalization: S = states, A = actions, R = rewards, T = transitions [s(t),a] s(t+1) Agent's goal: maximize longterm reward (from env.) TD (temporal difference) learning uses error Actor chooses actions Critic estimates reward in visited states Exploration vs exploitation 6

Continuous A-C Learning Automaton (van Hasselt & Wiering, 2007) Critic Actor output Hidden space representation State vector input for t =0,1,2... do choose action at Ac t ( st ), using exploration make action at,get to next state s t + 1 update critic: V t + 1 r t +1 + γ V t (s t + 1 ) if V t +1 (s t )>V t (s t ), then update actor's parameters such that: Ac t +1 at end if end for 7

Example 1: Learning object-directed actions (MSc thesis of T. Malík, 2011) Simulated robot icub learns object-directed motor actions, experimental design inspired by Sugita & Tani (2005) Sensorimotor coordination involved icub faithful simulation of physics, 3-year old child, 52 DoF Link to language grounding the meaning of words 8

Integration of action learning and language 9

Module 1: target object localizer Image processing using OpenCV Input for multi-layer perceptron Trained with the teacher 10

Module 2: Learning action execution Evolution of reward after training The agent learned 3 actions (POINT, TOUCH, PUSH) toward the target object at different locations (left, middle, right) 11

Module 3: executed action naming 12

Example 1: summary Agent successfully learned to look at target position given the cue about object shape or color (module 1) Agent learned to execute actions (given by action name and target position) via interaction with the environment (module 2) Agent learned to name executed actions (module 3) What's missing? Link between execution and observation (mirror system) Test for scaling up Grasping and manipulation to be added 13

Example 2: icub learns to reach and grasp Objects of various sizes, orientations, at various positions Reaching and grasping modules with A-C architecture (MLPs) trained separately Right arm used Reward based on visual, haptic and pressure information 14

Reaching: Actor architecture Critic has the same input vector Critic has the same input vector n=9 neurons per dimension 15

Reaching: Reward function design 16

CACLA modifications tested Two modifications tried to speed up training Actors learns: in original CACLA if: in modif CACLA if: in reward CACLA if: V t ( s t +1 ) ~ after explored action Assumption: Reward is available in each step during the episode. 17

Grasping: Actor architecture Critic has the same input vector 18

Grasp features (Oztop & Arbib, 2002) 19

Grasping: our state vector 20

Grasping: reward function design 21

Results: Reaching 22

Results: Grasping 23

Example 2: Summary icub learned to reach and grasp objects of various sizes, orientations and positions with certain accuracy. 3 types of grasping learned: power, side, precision (roughly in this order) We compared 3 versions of CACLA algorithm for reaching: final performance was roughly the same. Final performance is also quite robust w.r.t. some model parameters (learning rate, exploration degree). (appropriate) reward drives learning Further improvements should be possible 24

Example 3: Fencing Trainer attacks the agent with the sword, using a limited repertoire of preprogrammed actions (with variations possible) Agent learned to defend itself Agent uses CACLA for learning 4 DoF in arm used 25

3 types of attack From left From middle From right 26

Model architecture 27

CACLA and its modification Trajectory generation for trainer: Bezier curves Reward design: 2 major components 1 - distance b/w swords distance from defender's body Original CACLA problem with condition for adapting the actor (too weak) 20% testing error Modification MCACLA both actions compared Mental simulation involved Improved the actor's performance (up to 100%) 28

Modified CACLA 29

Original and modified CACLA 30

Performance comparison 31

Conclusion CACLA a new algorithm for continuous spaces RL cognitively plausible, rather slow, improvements should be possible Reward design important feature Ďakujem za pozornosť. Vďaka bývalým študentom: Tomáš Malík, Lukáš Zdechovan, Jaroslav Blanář 32