Neural Dynamics and Reinforcement Learning Presented By: Matthew Luciw DFT SUMMER SCHOOL, 2013 IDSIA Istituto Dalle Molle Di Studi sull Intelligenza Artificiale
IDSIA Lugano, Switzerland www.idsia.ch Our Lab s Director: Juergen Schmidhuber -Cognitive robotics, robot learning, universal search and learning algorithms, Kolmogorov complexity, algorithmic probability, Speed Prior, minimal description length, generalization and data compression, recurrent neural networks, financial forecasting with low-complexity nets, independent component analysis, low-complexity codes, reinforcement learning in partially observable environments, adaptive subgoal generation, multiagent learning, artificial evolution, probabilistic program evolution, automatic music composition, metalearning, self-modifying policies, Gödel machines, low-complexity art, theories of interestingness and beauty-
Motivation How do we learn sequences of behavior, to achieve goals, in the DFT framework? How are these sequences learned from delayed rewards? How do sequences of these behaviors emerge as an agent autonomously explores its environment?
Reinforcement Environment state / situation s t reward r t AGENT action a t r t+1 ENVIRONMENT s t+1
Reinforcement Learning Basics Agent in situation s t chooses action a t Outcome: new situation s t+1 Agent perceives situation s t+1 and reward r t+1 Policy: law of how the agent acts Reinforcement learning is both improving the policy and selecting actions to provide experience stream (history) s 0 a 0 s 1 r 1 a 1 s 2 r 2 a 2 Goal: produce history to maximize sum of r i
Reinforcement Learning: What We Need 1. What is the learner s internal state? e.g., state values, state-action values Needs states and actions to be defined 2. How does the agent sense the world state? Sensors? Features? 3. How are possible actions evaluated? e.g., state-action values, one-step state predictor and state values 4. How are possible actions chosen? policy exploration method 5. How are the actions executed? e.g., low-level controllers 6. How is the internal state updated? e.g., value iteration, Q-learning, SARSA
Color Hue IDSIA Elementary Behaviors for RL (Example: Find Color) Sensory Input Perceptual Field Activity Perceptual Field Output 20 40 60 80 50 100 150 200 250 300 Current Intention: Green Preshape Pixel Column Motor Field Intention Nodes Heading Direction Motors EBs cover #2 - how does the agent sense the world state?, and #5 how are the actions executed? CoS node
Learner s Internal State Value Function Value function predicts reward, estimates total future reward given a course of action State values (γ is a discount factor) State-action values Learner estimates values from experience
Elementary Behaviors can function as States for an RL system Discretize the continuous world
Behavior Chaining Functionally a deterministic state transition Lets add multiple outcome EBs, and (possibly multiple) ways to select one of them
Adaptive Value Nodes for Policy Learning IDSIA Intention Nodes CoS Nodes Value Nodes Perceptual Field Output Greedy policy execution becomes: for the previously completed EB, select the intention of the next most valuable EB this encodes a sequence of EBs
Adaptive Value Nodes for Policy Learning IDSIA Intention Nodes CoS Nodes Value Nodes Perceptual Field Output This covers #1 what is the internal state of the learner? and #3 how are possible actions evaluated?
Learning the Policy IDSIA In RL, we re generally trying to learn an optimal policy If we know the dynamics of the environment and the reward function (together: the model), we can use dynamic programming to get Dynamics of environment: Reward function:
Temporal Difference Learning IDSIA We can learn an optimal policy without learning the model with model-free methods These learn directly (on-line) from experience Update estimate of V(s) after visiting state s
Our DFT TD-Learning Algorithm DN-SARSA(λ) combines: a process description of DFT to allow operation in real-time, continuous environments, with RL algorithm SARSA(λ) to enable agent to learn sequences of behaviors that lead to reward Deals with #6 how is the internal state updated?
SARSA(λ) TD Algorithm DN-SARSA(λ) Dynamic Neural-SARSA(λ)
DN-SARSA(λ) Architecture Value opposition field is where the TD-error calculation lives Eligibility trace - this particular implementation uses Item and Order working memory (Sohrob) Transient pulse cells do state transition signaling and memory of last stateaction
Avg.TD Eror Cumulative Reward IDSIA Epuck in a Color Sequence Learning Task 15000 10000 Four EBs 5000 1. Find Blue Explore 2. Find Red 0 0 1 2 3 Time Step [S 3. Find Yellow (b) 4. Find Purple 4 2 0 (a) Explore 2 0 2000 4000 6000 8 Error Measurem
The Eligibility Trace is Important for sequence learning
~~ The Tree of Life ~~ α A B C D A B C D A B C D A B C D A B C D all possible histories Ω
Somewhere, A Reward C D +100!
What Caused It? C D +100! C D +0
Memory Capacity Can Matter! A B C D +100!
Memory Capacity Can Matter! A B REINFORCED C D +100!
Memory Capacity Can Matter! A B B A REINFORCED NOPE C C D +100! D +0
Grid World Analogy
Grid World Analogy
Grid World Analogy
The Eligibility Trace is Essential for our system But the length of sequences it can learn is limited Note: if a sequence is very long, you couldn t learn it either
Avg.TD Eror Cumulative Reward IDSIA Epuck in a Color Sequence Learning Task 15000 10000 Four EBs 5000 1. Find Blue Explore 2. Find Red 0 0 1 2 3 Time Step [S 3. Find Yellow (b) 4. Find Purple 4 2 0 (a) Explore 2 0 2000 4000 6000 8 Error Measurem
One Last Thing: Policy Iteration IDSIA More than TD value updates are needed to achieve This constitutes policy evaluation prediction of return for some policy But we will only learn the values of the policy through which the agent is sampling the stateaction space Policy improvement change policy to increase prediction of return Need to interleave policy evaluation and policy improvement to get Epsilon greedy - More random exploration early, (hopefully) mostly exploitation later
Avg.TD Eror Cumulative Reward Cumulative Reward Time Step IDSIA 15000 10000 Should have used e-greedy!!! Exploit Plots 5 x 104 4 3 Time When Correct Sequence Learned 4 5000 Explore 0 0 1 2 3 4 5 6 7 Time Step [S * 32] x 10 4 (b) 2 1 0 2 x 104 1 2 3 4 5 6 7 8 9 10 11 12 13 Run # (c) Sequence Finding Difficulty(Run 6) 2 0 Explore 2 0 2000 4000 6000 8000 10000 12000 Error Measurements (d) Exploit 1 0 0 1 2 3 4 5 6 7 Time Step [S * 32] x 10 4 (e)
Dynamics of Behavioral Transitions
Simulated Environment: Exploration Video See Demo Material
Sequence Learned Transferred to THE REAL WORLD Video See Demo Material
Different Agent+Environment
Start Possible Transitions Reward if A-> B-> C -> D -> E Search (A) Grab (C) Transp. (D) Approach (B) Drop (E) FAIL
Goal Sequence Video See Demo Material
Learning the Sequence (Now with E-Greedy!) Video See Demo Material
Exploration Mishaps Video See Demo Material
Exploration Mishaps Video See Demo Material
Exploration Mishaps Video See Demo Material
Nao Experiment Boris Duran, Gauss Lee, Robert Lowe
Motivation Dynamic Field Theory Behavioral Organization in DFT SARSA / DN-SARSA Conclusion
Video See Demo Material
Video See Demo Material
Conclusions Reinforcement Learning can enable Neural Dynamics models to autonomously learn rewarding behavioral sequences There are some limitations of the current method
References Kazerounian*, S., Luciw*, M., Richter, M., Sandamirskaya, Y. (2013). Autonomous Reinforcement of Behavioral Sequences in Neural Dynamics. Proceedings of the International Joint Conference on Neural Networks (IJCNN). Duran, B., Lee, G., Lowe, R. (2013), Learning a DFT-Based Sequence with Reinforcement Learning: A NAO Implementation. PALADYN Journal of Behavioral Robotics. Sandamirskaya, Y., Richter, M., Schöner, G. (2011). A Neural-Dynamic Architecture for Behavioral Organization of an Embodied Agent. Proceedings of the International Conference on Development and Learning (ICDL). Sutton, R.S., Barto, A. G. (1998). Reinforcement Learning: An Introduction. Material from: http://www.inf.ed.ac.uk/teaching/courses/rl/slides/