Continuous reinforcement learning in cognitive robotics

Similar documents
Lecture 10: Reinforcement Learning

Reinforcement Learning by Comparing Immediate Reward

Axiom 2013 Team Description Paper

Artificial Neural Networks written examination

Laboratorio di Intelligenza Artificiale e Robotica

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

TD(λ) and Q-Learning Based Ludo Players

Laboratorio di Intelligenza Artificiale e Robotica

AMULTIAGENT system [1] can be defined as a group of

Syntactic systematicity in sentence processing with a recurrent self-organizing network

High-level Reinforcement Learning in Strategy Games

Learning Methods for Fuzzy Systems

Learning Prospective Robot Behavior

Python Machine Learning

Lecture 1: Machine Learning Basics

Georgetown University at TREC 2017 Dynamic Domain Track

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

On the Combined Behavior of Autonomous Resource Management Agents

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Improving Action Selection in MDP s via Knowledge Transfer

Speeding Up Reinforcement Learning with Behavior Transfer

Test Effort Estimation Using Neural Network

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

(Sub)Gradient Descent

arxiv: v2 [cs.ro] 3 Mar 2017

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

Software Maintenance

DOCTOR OF PHILOSOPHY HANDBOOK

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

Learning to Schedule Straight-Line Code

Chapter 2. Intelligent Agents. Outline. Agents and environments. Rationality. PEAS (Performance measure, Environment, Actuators, Sensors)

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

Speaker Identification by Comparison of Smart Methods. Abstract

Introduction to Simulation

SARDNET: A Self-Organizing Feature Map for Sequences

CS Machine Learning

AGENDA LEARNING THEORIES LEARNING THEORIES. Advanced Learning Theories 2/22/2016

AI Agent for Ice Hockey Atari 2600

Evolutive Neural Net Fuzzy Filtering: Basic Description

Task Completion Transfer Learning for Reward Inference

Time series prediction

An OO Framework for building Intelligence and Learning properties in Software Agents

INPE São José dos Campos

FF+FPG: Guiding a Policy-Gradient Planner

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Knowledge-Based - Systems

Robot Learning Simultaneously a Task and How to Interpret Human Instructions

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Regret-based Reward Elicitation for Markov Decision Processes

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Speech Recognition at ICSI: Broadcast News and beyond

Transfer Learning Action Models by Measuring the Similarity of Different Domains

BMBF Project ROBUKOM: Robust Communication Networks

DIGITAL GAMING & INTERACTIVE MEDIA BACHELOR S DEGREE. Junior Year. Summer (Bridge Quarter) Fall Winter Spring GAME Credits.

Truth Inference in Crowdsourcing: Is the Problem Solved?

Seminar - Organic Computing

Task Completion Transfer Learning for Reward Inference

An empirical study of learning speed in backpropagation

THE ROLE OF TOOL AND TEACHER MEDIATIONS IN THE CONSTRUCTION OF MEANINGS FOR REFLECTION

Improving Fairness in Memory Scheduling

Robot Shaping: Developing Autonomous Agents through Learning*

Learning From the Past with Experiment Databases

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

An Embodied Model for Sensorimotor Grounding and Grounding Transfer: Experiments With Epigenetic Robots

Telekooperation Seminar

CSL465/603 - Machine Learning

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Surprise-Based Learning for Autonomous Systems

SOFTWARE EVALUATION TOOL

Lecture 1: Basic Concepts of Machine Learning

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

IAT 888: Metacreation Machines endowed with creative behavior. Philippe Pasquier Office 565 (floor 14)

Understanding Games for Teaching Reflections on Empirical Approaches in Team Sports Research

UNIVERSITY OF CALIFORNIA SANTA CRUZ TOWARDS A UNIVERSAL PARAMETRIC PLAYER MODEL

Case Acquisition Strategies for Case-Based Reasoning in Real-Time Strategy Games

Ericsson Wallet Platform (EWP) 3.0 Training Programs. Catalog of Course Descriptions

Teachable Robots: Understanding Human Teaching Behavior to Build More Effective Robot Learners

Artificial Neural Networks

Knowledge Transfer in Deep Convolutional Neural Nets

LEARNING TO PLAY IN A DAY: FASTER DEEP REIN-

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

An Evaluation of the Interactive-Activation Model Using Masked Partial-Word Priming. Jason R. Perry. University of Western Ontario. Stephen J.

Accelerated Learning Online. Course Outline

Agents and environments. Intelligent Agents. Reminders. Vacuum-cleaner world. Outline. A vacuum-cleaner agent. Chapter 2 Actuators

Intelligent Agents. Chapter 2. Chapter 2 1

Reducing Features to Improve Bug Prediction

A Reinforcement Learning Variant for Control Scheduling

GACE Computer Science Assessment Test at a Glance

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

PH.D. IN COMPUTER SCIENCE PROGRAM (POST M.S.)

WHAT DOES IT REALLY MEAN TO PAY ATTENTION?

An investigation of imitation learning algorithms for structured prediction

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Transcription:

Continuous reinforcement learning in cognitive robotics Igor Farkaš CNC research group Department of Applied Informatics / Centre for Cognitive Science FMFI, Comenius University in Bratislava AI seminar, 3.12.2012, Bratislava 1

Talk outline Brief introduction to reinforcement learning RL for continuous spaces CACLA (Wiering & van Hasselt, 2007) Master theses: Integration of motor actions and language (T. Malik, 2011) Learning to reach and grasp objects (L. Zdechovan, 2012) Extended version in: Frontiers in Neurorobotics (2012) Laureát ŠVK, Cena rektora, ACM Spy Galéria najlepších Learning defense manoeuvres in fencing (J. Blanář, 2011) Laureát ŠVK, Celoštátna ŠVOČ (3. miesto, Apl. informatika) 2

Learning paradigms in machine learning supervised (with teacher) unsupervised (self organized) reinforcement learning (partial feedback) 3

Reward and value function RL: developed for discrete space (state, actions) Agent: maximize long-term rewards: γ (0,1 discount factor (future rewards are valued less) Rt =r t + γ r t +1 + γ 2 r t + 2+... Value function: V (s)= Eπ {R t s t =s }= Eπ { k =0 γ r t +k +1 s t =s } Optimal policy: V (s)=max π V (s) Bellman equation: π k π V π (s)=r (s)+max a s ' T ( s, a, s ')V ( s ') (value iteration) model-based, dynamic programming (not RL yet) 4

Active RL control task exploration enabled (non-greedy behavior), model-free Control task choosing best actions Learning state-action Q-values: Optimal policy: Q (s, a)=max π Q ( s, a) Q-learning update (off-policy): V ( s)=max a Q (s, a) π converges faster Qt +1 (s t, at )=Qt (s t, at )+α [r t +1 + γ max a Qt (st +1, a) Qt (s t, a t )] SARSA update: (on-policy) Qt +1 (s t, at )=Qt (s t, at )+α [r t +1 + γ Qt (s t +1, at +1 ) Qt (st, at )] 5

Actor-critic architecture for RL Problem formalization: S = states, A = actions, R = rewards, T = transitions [s(t),a] s(t+1) Agent's goal: maximize longterm reward (from env.) TD (temporal difference) learning uses error Actor chooses actions Critic estimates reward in visited states Exploration vs exploitation 6

Continuous A-C Learning Automaton (van Hasselt & Wiering, 2007) Critic Actor output Hidden space representation State vector input for t =0,1,2... do choose action at Ac t ( st ), using exploration make action at,get to next state s t + 1 update critic: V t + 1 r t +1 + γ V t (s t + 1 ) if V t +1 (s t )>V t (s t ), then update actor's parameters such that: Ac t +1 at end if end for 7

Example 1: Learning object-directed actions (MSc thesis of T. Malík, 2011) Simulated robot icub learns object-directed motor actions, experimental design inspired by Sugita & Tani (2005) Sensorimotor coordination involved icub faithful simulation of physics, 3-year old child, 52 DoF Link to language grounding the meaning of words 8

Integration of action learning and language 9

Module 1: target object localizer Image processing using OpenCV Input for multi-layer perceptron Trained with the teacher 10

Module 2: Learning action execution Evolution of reward after training The agent learned 3 actions (POINT, TOUCH, PUSH) toward the target object at different locations (left, middle, right) 11

Module 3: executed action naming 12

Example 1: summary Agent successfully learned to look at target position given the cue about object shape or color (module 1) Agent learned to execute actions (given by action name and target position) via interaction with the environment (module 2) Agent learned to name executed actions (module 3) What's missing? Link between execution and observation (mirror system) Test for scaling up Grasping and manipulation to be added 13

Example 2: icub learns to reach and grasp Objects of various sizes, orientations, at various positions Reaching and grasping modules with A-C architecture (MLPs) trained separately Right arm used Reward based on visual, haptic and pressure information 14

Reaching: Actor architecture Critic has the same input vector Critic has the same input vector n=9 neurons per dimension 15

Reaching: Reward function design 16

CACLA modifications tested Two modifications tried to speed up training Actors learns: in original CACLA if: in modif CACLA if: in reward CACLA if: V t ( s t +1 ) ~ after explored action Assumption: Reward is available in each step during the episode. 17

Grasping: Actor architecture Critic has the same input vector 18

Grasp features (Oztop & Arbib, 2002) 19

Grasping: our state vector 20

Grasping: reward function design 21

Results: Reaching 22

Results: Grasping 23

Example 2: Summary icub learned to reach and grasp objects of various sizes, orientations and positions with certain accuracy. 3 types of grasping learned: power, side, precision (roughly in this order) We compared 3 versions of CACLA algorithm for reaching: final performance was roughly the same. Final performance is also quite robust w.r.t. some model parameters (learning rate, exploration degree). (appropriate) reward drives learning Further improvements should be possible 24

Example 3: Fencing Trainer attacks the agent with the sword, using a limited repertoire of preprogrammed actions (with variations possible) Agent learned to defend itself Agent uses CACLA for learning 4 DoF in arm used 25

3 types of attack From left From middle From right 26

Model architecture 27

CACLA and its modification Trajectory generation for trainer: Bezier curves Reward design: 2 major components 1 - distance b/w swords distance from defender's body Original CACLA problem with condition for adapting the actor (too weak) 20% testing error Modification MCACLA both actions compared Mental simulation involved Improved the actor's performance (up to 100%) 28

Modified CACLA 29

Original and modified CACLA 30

Performance comparison 31

Conclusion CACLA a new algorithm for continuous spaces RL cognitively plausible, rather slow, improvements should be possible Reward design important feature Ďakujem za pozornosť. Vďaka bývalým študentom: Tomáš Malík, Lukáš Zdechovan, Jaroslav Blanář 32