Neural Dynamics and Reinforcement Learning

Size: px

Start display at page:

Download "Neural Dynamics and Reinforcement Learning"

Natalie Peters
6 years ago
Views:

1 Neural Dynamics and Reinforcement Learning Presented By: Matthew Luciw DFT SUMMER SCHOOL, 2013 IDSIA Istituto Dalle Molle Di Studi sull Intelligenza Artificiale

2 IDSIA Lugano, Switzerland Our Lab s Director: Juergen Schmidhuber -Cognitive robotics, robot learning, universal search and learning algorithms, Kolmogorov complexity, algorithmic probability, Speed Prior, minimal description length, generalization and data compression, recurrent neural networks, financial forecasting with low-complexity nets, independent component analysis, low-complexity codes, reinforcement learning in partially observable environments, adaptive subgoal generation, multiagent learning, artificial evolution, probabilistic program evolution, automatic music composition, metalearning, self-modifying policies, Gödel machines, low-complexity art, theories of interestingness and beauty-

3 Motivation How do we learn sequences of behavior, to achieve goals, in the DFT framework? How are these sequences learned from delayed rewards? How do sequences of these behaviors emerge as an agent autonomously explores its environment?

4 Reinforcement Environment state / situation s t reward r t AGENT action a t r t+1 ENVIRONMENT s t+1

5 Reinforcement Learning Basics Agent in situation s t chooses action a t Outcome: new situation s t+1 Agent perceives situation s t+1 and reward r t+1 Policy: law of how the agent acts Reinforcement learning is both improving the policy and selecting actions to provide experience stream (history) s 0 a 0 s 1 r 1 a 1 s 2 r 2 a 2 Goal: produce history to maximize sum of r i

6 Reinforcement Learning: What We Need 1. What is the learner s internal state? e.g., state values, state-action values Needs states and actions to be defined 2. How does the agent sense the world state? Sensors? Features? 3. How are possible actions evaluated? e.g., state-action values, one-step state predictor and state values 4. How are possible actions chosen? policy exploration method 5. How are the actions executed? e.g., low-level controllers 6. How is the internal state updated? e.g., value iteration, Q-learning, SARSA

Color Hue IDSIA Elementary Behaviors for RL (Example: Find

Field Output 20 40 60 80 50 100 150 200 250 300 Current

Nodes Heading Direction Motors EBs cover #2 - how does the

7 Color Hue IDSIA Elementary Behaviors for RL (Example: Find Color) Sensory Input Perceptual Field Activity Perceptual Field Output Current Intention: Green Preshape Pixel Column Motor Field Intention Nodes Heading Direction Motors EBs cover #2 - how does the agent sense the world state?, and #5 how are the actions executed? CoS node

8 Learner s Internal State Value Function Value function predicts reward, estimates total future reward given a course of action State values (γ is a discount factor) State-action values Learner estimates values from experience

9 Elementary Behaviors can function as States for an RL system Discretize the continuous world

10 Behavior Chaining Functionally a deterministic state transition Lets add multiple outcome EBs, and (possibly multiple) ways to select one of them

11 Adaptive Value Nodes for Policy Learning IDSIA Intention Nodes CoS Nodes Value Nodes Perceptual Field Output Greedy policy execution becomes: for the previously completed EB, select the intention of the next most valuable EB this encodes a sequence of EBs

12 Adaptive Value Nodes for Policy Learning IDSIA Intention Nodes CoS Nodes Value Nodes Perceptual Field Output This covers #1 what is the internal state of the learner? and #3 how are possible actions evaluated?

13 Learning the Policy IDSIA In RL, we re generally trying to learn an optimal policy If we know the dynamics of the environment and the reward function (together: the model), we can use dynamic programming to get Dynamics of environment: Reward function:

14 Temporal Difference Learning IDSIA We can learn an optimal policy without learning the model with model-free methods These learn directly (on-line) from experience Update estimate of V(s) after visiting state s

15 Our DFT TD-Learning Algorithm DN-SARSA(λ) combines: a process description of DFT to allow operation in real-time, continuous environments, with RL algorithm SARSA(λ) to enable agent to learn sequences of behaviors that lead to reward Deals with #6 how is the internal state updated?

16 SARSA(λ) TD Algorithm DN-SARSA(λ) Dynamic Neural-SARSA(λ)

17 DN-SARSA(λ) Architecture Value opposition field is where the TD-error calculation lives Eligibility trace - this particular implementation uses Item and Order working memory (Sohrob) Transient pulse cells do state transition signaling and memory of last stateaction

18 Avg.TD Eror Cumulative Reward IDSIA Epuck in a Color Sequence Learning Task Four EBs Find Blue Explore 2. Find Red Time Step [S 3. Find Yellow (b) 4. Find Purple (a) Explore Error Measurem

19 The Eligibility Trace is Important for sequence learning

20 ~~ The Tree of Life ~~ α A B C D A B C D A B C D A B C D A B C D all possible histories Ω

21 Somewhere, A Reward C D +100!

22 What Caused It? C D +100! C D +0

23 Memory Capacity Can Matter! A B C D +100!

24 Memory Capacity Can Matter! A B REINFORCED C D +100!

25 Memory Capacity Can Matter! A B B A REINFORCED NOPE C C D +100! D +0

26 Grid World Analogy

27 Grid World Analogy

28 Grid World Analogy

29 The Eligibility Trace is Essential for our system But the length of sequences it can learn is limited Note: if a sequence is very long, you couldn t learn it either

Find Blue Explore 2. Find Red 0 0 1 2 3 Time Step [S 3.

30 Avg.TD Eror Cumulative Reward IDSIA Epuck in a Color Sequence Learning Task Four EBs Find Blue Explore 2. Find Red Time Step [S 3. Find Yellow (b) 4. Find Purple (a) Explore Error Measurem

31 One Last Thing: Policy Iteration IDSIA More than TD value updates are needed to achieve This constitutes policy evaluation prediction of return for some policy But we will only learn the values of the policy through which the agent is sampling the stateaction space Policy improvement change policy to increase prediction of return Need to interleave policy evaluation and policy improvement to get Epsilon greedy - More random exploration early, (hopefully) mostly exploitation later

32 Avg.TD Eror Cumulative Reward Cumulative Reward Time Step IDSIA Should have used e-greedy!!! Exploit Plots 5 x Time When Correct Sequence Learned Explore Time Step [S * 32] x 10 4 (b) x Run # (c) Sequence Finding Difficulty(Run 6) 2 0 Explore Error Measurements (d) Exploit Time Step [S * 32] x 10 4 (e)

33 Dynamics of Behavioral Transitions

34 Simulated Environment: Exploration Video See Demo Material

35 Sequence Learned Transferred to THE REAL WORLD Video See Demo Material

36 Different Agent+Environment

37 Start Possible Transitions Reward if A-> B-> C -> D -> E Search (A) Grab (C) Transp. (D) Approach (B) Drop (E) FAIL

38 Goal Sequence Video See Demo Material

39 Learning the Sequence (Now with E-Greedy!) Video See Demo Material

40 Exploration Mishaps Video See Demo Material

41 Exploration Mishaps Video See Demo Material

42 Exploration Mishaps Video See Demo Material

43 Nao Experiment Boris Duran, Gauss Lee, Robert Lowe

44 Motivation Dynamic Field Theory Behavioral Organization in DFT SARSA / DN-SARSA Conclusion

45 Video See Demo Material

46 Video See Demo Material

47 Conclusions Reinforcement Learning can enable Neural Dynamics models to autonomously learn rewarding behavioral sequences There are some limitations of the current method

48 References Kazerounian*, S., Luciw*, M., Richter, M., Sandamirskaya, Y. (2013). Autonomous Reinforcement of Behavioral Sequences in Neural Dynamics. Proceedings of the International Joint Conference on Neural Networks (IJCNN). Duran, B., Lee, G., Lowe, R. (2013), Learning a DFT-Based Sequence with Reinforcement Learning: A NAO Implementation. PALADYN Journal of Behavioral Robotics. Sandamirskaya, Y., Richter, M., Schöner, G. (2011). A Neural-Dynamic Architecture for Behavioral Organization of an Embodied Agent. Proceedings of the International Conference on Development and Learning (ICDL). Sutton, R.S., Barto, A. G. (1998). Reinforcement Learning: An Introduction. Material from:

Lecture 10: Reinforcement Learning

Lecture 10: Reinforcement Learning Lecture 1: Reinforcement Learning Cognitive Systems II - Machine Learning SS 25 Part III: Learning Programs and Strategies Q Learning, Dynamic Programming Lecture 1: Reinforcement Learning p. Motivation