Neural Dynamics and Reinforcement Learning

Similar documents
Lecture 10: Reinforcement Learning

Reinforcement Learning by Comparing Immediate Reward

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

Axiom 2013 Team Description Paper

Laboratorio di Intelligenza Artificiale e Robotica

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

TD(λ) and Q-Learning Based Ludo Players

AMULTIAGENT system [1] can be defined as a group of

Seminar - Organic Computing

Laboratorio di Intelligenza Artificiale e Robotica

Artificial Neural Networks written examination

A Reinforcement Learning Variant for Control Scheduling

High-level Reinforcement Learning in Strategy Games

Speeding Up Reinforcement Learning with Behavior Transfer

Learning Prospective Robot Behavior

Georgetown University at TREC 2017 Dynamic Domain Track

Robot Shaping: Developing Autonomous Agents through Learning*

LEGO MINDSTORMS Education EV3 Coding Activities

The Evolution of Random Phenomena

Improving Action Selection in MDP s via Knowledge Transfer

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

An OO Framework for building Intelligence and Learning properties in Software Agents

XXII BrainStorming Day

Python Machine Learning

DOCTOR OF PHILOSOPHY HANDBOOK

Learning Methods for Fuzzy Systems

On the Combined Behavior of Autonomous Resource Management Agents

Evolutive Neural Net Fuzzy Filtering: Basic Description

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Artificial Neural Networks

Chapter 2. Intelligent Agents. Outline. Agents and environments. Rationality. PEAS (Performance measure, Environment, Actuators, Sensors)

Introduction to Simulation

SAM - Sensors, Actuators and Microcontrollers in Mobile Robots

arxiv: v1 [cs.cv] 10 May 2017

EVOLVING POLICIES TO SOLVE THE RUBIK S CUBE: EXPERIMENTS WITH IDEAL AND APPROXIMATE PERFORMANCE FUNCTIONS

Robot Learning Simultaneously a Task and How to Interpret Human Instructions

Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures

Surprise-Based Learning for Autonomous Systems

Lecture 6: Applications

INPE São José dos Campos

AGENDA LEARNING THEORIES LEARNING THEORIES. Advanced Learning Theories 2/22/2016

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Carter M. Mast. Participants: Peter Mackenzie-Helnwein, Pedro Arduino, and Greg Miller. 6 th MPM Workshop Albuquerque, New Mexico August 9-10, 2010

Transfer Learning Action Models by Measuring the Similarity of Different Domains

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

Learning and Transferring Relational Instance-Based Policies

Using focal point learning to improve human machine tacit coordination

Proposal of Pattern Recognition as a necessary and sufficient principle to Cognitive Science

(Sub)Gradient Descent

Multiagent Simulation of Learning Environments

FF+FPG: Guiding a Policy-Gradient Planner

Master s Programme in Computer, Communication and Information Sciences, Study guide , ELEC Majors

CS Machine Learning

Learning to Schedule Straight-Line Code

SARDNET: A Self-Organizing Feature Map for Sequences

Automatic Discretization of Actions and States in Monte-Carlo Tree Search

CSL465/603 - Machine Learning

Teachable Robots: Understanding Human Teaching Behavior to Build More Effective Robot Learners

Case Acquisition Strategies for Case-Based Reasoning in Real-Time Strategy Games

University of Victoria School of Exercise Science, Physical and Health Education EPHE 245 MOTOR LEARNING. Calendar Description Units: 1.

Planning with External Events

Evolution of Symbolisation in Chimpanzees and Neural Nets

Regret-based Reward Elicitation for Markov Decision Processes

AI Agent for Ice Hockey Atari 2600

Agents and environments. Intelligent Agents. Reminders. Vacuum-cleaner world. Outline. A vacuum-cleaner agent. Chapter 2 Actuators

Generative models and adversarial training

TOKEN-BASED APPROACH FOR SCALABLE TEAM COORDINATION. by Yang Xu PhD of Information Sciences

Lecture 2: Quantifiers and Approximation

An Embodied Model for Sensorimotor Grounding and Grounding Transfer: Experiments With Epigenetic Robots

Building A Baby. Paul R. Cohen, Tim Oates, Marc S. Atkin Department of Computer Science

A Case-Based Approach To Imitation Learning in Robotic Agents

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Lecture 1: Machine Learning Basics

Contents. Foreword... 5

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

On Human Computer Interaction, HCI. Dr. Saif al Zahir Electrical and Computer Engineering Department UBC

Softprop: Softmax Neural Network Backpropagation Learning

While you are waiting... socrative.com, room number SIMLANG2016

ACTL5103 Stochastic Modelling For Actuaries. Course Outline Semester 2, 2014

Knowledge-Based - Systems

Task Completion Transfer Learning for Reward Inference

Summary / Response. Karl Smith, Accelerations Educational Software. Page 1 of 8

Circuit Simulators: A Revolutionary E-Learning Platform

A Neural Network GUI Tested on Text-To-Phoneme Mapping

ECE-492 SENIOR ADVANCED DESIGN PROJECT

WHAT ARE VIRTUAL MANIPULATIVES?

Liquid Narrative Group Technical Report Number

Interaction Design Considerations for an Aircraft Carrier Deck Agent-based Simulation

Radius STEM Readiness TM

Intelligent Agents. Chapter 2. Chapter 2 1

MYCIN. The MYCIN Task

Moderator: Gary Weckman Ohio University USA

A Context-Driven Use Case Creation Process for Specifying Automotive Driver Assistance Systems

TABLE OF CONTENTS TABLE OF CONTENTS COVER PAGE HALAMAN PENGESAHAN PERNYATAAN NASKAH SOAL TUGAS AKHIR ACKNOWLEDGEMENT FOREWORD

The SWARM-BOTS Project

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Transcription:

Neural Dynamics and Reinforcement Learning Presented By: Matthew Luciw DFT SUMMER SCHOOL, 2013 IDSIA Istituto Dalle Molle Di Studi sull Intelligenza Artificiale

IDSIA Lugano, Switzerland www.idsia.ch Our Lab s Director: Juergen Schmidhuber -Cognitive robotics, robot learning, universal search and learning algorithms, Kolmogorov complexity, algorithmic probability, Speed Prior, minimal description length, generalization and data compression, recurrent neural networks, financial forecasting with low-complexity nets, independent component analysis, low-complexity codes, reinforcement learning in partially observable environments, adaptive subgoal generation, multiagent learning, artificial evolution, probabilistic program evolution, automatic music composition, metalearning, self-modifying policies, Gödel machines, low-complexity art, theories of interestingness and beauty-

Motivation How do we learn sequences of behavior, to achieve goals, in the DFT framework? How are these sequences learned from delayed rewards? How do sequences of these behaviors emerge as an agent autonomously explores its environment?

Reinforcement Environment state / situation s t reward r t AGENT action a t r t+1 ENVIRONMENT s t+1

Reinforcement Learning Basics Agent in situation s t chooses action a t Outcome: new situation s t+1 Agent perceives situation s t+1 and reward r t+1 Policy: law of how the agent acts Reinforcement learning is both improving the policy and selecting actions to provide experience stream (history) s 0 a 0 s 1 r 1 a 1 s 2 r 2 a 2 Goal: produce history to maximize sum of r i

Reinforcement Learning: What We Need 1. What is the learner s internal state? e.g., state values, state-action values Needs states and actions to be defined 2. How does the agent sense the world state? Sensors? Features? 3. How are possible actions evaluated? e.g., state-action values, one-step state predictor and state values 4. How are possible actions chosen? policy exploration method 5. How are the actions executed? e.g., low-level controllers 6. How is the internal state updated? e.g., value iteration, Q-learning, SARSA

Color Hue IDSIA Elementary Behaviors for RL (Example: Find Color) Sensory Input Perceptual Field Activity Perceptual Field Output 20 40 60 80 50 100 150 200 250 300 Current Intention: Green Preshape Pixel Column Motor Field Intention Nodes Heading Direction Motors EBs cover #2 - how does the agent sense the world state?, and #5 how are the actions executed? CoS node

Learner s Internal State Value Function Value function predicts reward, estimates total future reward given a course of action State values (γ is a discount factor) State-action values Learner estimates values from experience

Elementary Behaviors can function as States for an RL system Discretize the continuous world

Behavior Chaining Functionally a deterministic state transition Lets add multiple outcome EBs, and (possibly multiple) ways to select one of them

Adaptive Value Nodes for Policy Learning IDSIA Intention Nodes CoS Nodes Value Nodes Perceptual Field Output Greedy policy execution becomes: for the previously completed EB, select the intention of the next most valuable EB this encodes a sequence of EBs

Adaptive Value Nodes for Policy Learning IDSIA Intention Nodes CoS Nodes Value Nodes Perceptual Field Output This covers #1 what is the internal state of the learner? and #3 how are possible actions evaluated?

Learning the Policy IDSIA In RL, we re generally trying to learn an optimal policy If we know the dynamics of the environment and the reward function (together: the model), we can use dynamic programming to get Dynamics of environment: Reward function:

Temporal Difference Learning IDSIA We can learn an optimal policy without learning the model with model-free methods These learn directly (on-line) from experience Update estimate of V(s) after visiting state s

Our DFT TD-Learning Algorithm DN-SARSA(λ) combines: a process description of DFT to allow operation in real-time, continuous environments, with RL algorithm SARSA(λ) to enable agent to learn sequences of behaviors that lead to reward Deals with #6 how is the internal state updated?

SARSA(λ) TD Algorithm DN-SARSA(λ) Dynamic Neural-SARSA(λ)

DN-SARSA(λ) Architecture Value opposition field is where the TD-error calculation lives Eligibility trace - this particular implementation uses Item and Order working memory (Sohrob) Transient pulse cells do state transition signaling and memory of last stateaction

Avg.TD Eror Cumulative Reward IDSIA Epuck in a Color Sequence Learning Task 15000 10000 Four EBs 5000 1. Find Blue Explore 2. Find Red 0 0 1 2 3 Time Step [S 3. Find Yellow (b) 4. Find Purple 4 2 0 (a) Explore 2 0 2000 4000 6000 8 Error Measurem

The Eligibility Trace is Important for sequence learning

~~ The Tree of Life ~~ α A B C D A B C D A B C D A B C D A B C D all possible histories Ω

Somewhere, A Reward C D +100!

What Caused It? C D +100! C D +0

Memory Capacity Can Matter! A B C D +100!

Memory Capacity Can Matter! A B REINFORCED C D +100!

Memory Capacity Can Matter! A B B A REINFORCED NOPE C C D +100! D +0

Grid World Analogy

Grid World Analogy

Grid World Analogy

The Eligibility Trace is Essential for our system But the length of sequences it can learn is limited Note: if a sequence is very long, you couldn t learn it either

Avg.TD Eror Cumulative Reward IDSIA Epuck in a Color Sequence Learning Task 15000 10000 Four EBs 5000 1. Find Blue Explore 2. Find Red 0 0 1 2 3 Time Step [S 3. Find Yellow (b) 4. Find Purple 4 2 0 (a) Explore 2 0 2000 4000 6000 8 Error Measurem

One Last Thing: Policy Iteration IDSIA More than TD value updates are needed to achieve This constitutes policy evaluation prediction of return for some policy But we will only learn the values of the policy through which the agent is sampling the stateaction space Policy improvement change policy to increase prediction of return Need to interleave policy evaluation and policy improvement to get Epsilon greedy - More random exploration early, (hopefully) mostly exploitation later

Avg.TD Eror Cumulative Reward Cumulative Reward Time Step IDSIA 15000 10000 Should have used e-greedy!!! Exploit Plots 5 x 104 4 3 Time When Correct Sequence Learned 4 5000 Explore 0 0 1 2 3 4 5 6 7 Time Step [S * 32] x 10 4 (b) 2 1 0 2 x 104 1 2 3 4 5 6 7 8 9 10 11 12 13 Run # (c) Sequence Finding Difficulty(Run 6) 2 0 Explore 2 0 2000 4000 6000 8000 10000 12000 Error Measurements (d) Exploit 1 0 0 1 2 3 4 5 6 7 Time Step [S * 32] x 10 4 (e)

Dynamics of Behavioral Transitions

Simulated Environment: Exploration Video See Demo Material

Sequence Learned Transferred to THE REAL WORLD Video See Demo Material

Different Agent+Environment

Start Possible Transitions Reward if A-> B-> C -> D -> E Search (A) Grab (C) Transp. (D) Approach (B) Drop (E) FAIL

Goal Sequence Video See Demo Material

Learning the Sequence (Now with E-Greedy!) Video See Demo Material

Exploration Mishaps Video See Demo Material

Exploration Mishaps Video See Demo Material

Exploration Mishaps Video See Demo Material

Nao Experiment Boris Duran, Gauss Lee, Robert Lowe

Motivation Dynamic Field Theory Behavioral Organization in DFT SARSA / DN-SARSA Conclusion

Video See Demo Material

Video See Demo Material

Conclusions Reinforcement Learning can enable Neural Dynamics models to autonomously learn rewarding behavioral sequences There are some limitations of the current method

References Kazerounian*, S., Luciw*, M., Richter, M., Sandamirskaya, Y. (2013). Autonomous Reinforcement of Behavioral Sequences in Neural Dynamics. Proceedings of the International Joint Conference on Neural Networks (IJCNN). Duran, B., Lee, G., Lowe, R. (2013), Learning a DFT-Based Sequence with Reinforcement Learning: A NAO Implementation. PALADYN Journal of Behavioral Robotics. Sandamirskaya, Y., Richter, M., Schöner, G. (2011). A Neural-Dynamic Architecture for Behavioral Organization of an Embodied Agent. Proceedings of the International Conference on Development and Learning (ICDL). Sutton, R.S., Barto, A. G. (1998). Reinforcement Learning: An Introduction. Material from: http://www.inf.ed.ac.uk/teaching/courses/rl/slides/