Deep reinforcement learning

Similar documents
Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

Lecture 10: Reinforcement Learning

Reinforcement Learning by Comparing Immediate Reward

Georgetown University at TREC 2017 Dynamic Domain Track

AI Agent for Ice Hockey Atari 2600

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Lecture 1: Machine Learning Basics

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

(Sub)Gradient Descent

Generative models and adversarial training

LEARNING TO PLAY IN A DAY: FASTER DEEP REIN-

Axiom 2013 Team Description Paper

Artificial Neural Networks written examination

FF+FPG: Guiding a Policy-Gradient Planner

Python Machine Learning

CSL465/603 - Machine Learning

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

Laboratorio di Intelligenza Artificiale e Robotica

Using Deep Convolutional Neural Networks in Monte Carlo Tree Search

TD(λ) and Q-Learning Based Ludo Players

Introduction to Simulation

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

arxiv: v1 [cs.lg] 15 Jun 2015

Transferring End-to-End Visuomotor Control from Simulation to Real World for a Multi-Stage Task

Model Ensemble for Click Prediction in Bing Search Ads

Evolutive Neural Net Fuzzy Filtering: Basic Description

High-level Reinforcement Learning in Strategy Games

Improving Action Selection in MDP s via Knowledge Transfer

Regret-based Reward Elicitation for Markov Decision Processes

Laboratorio di Intelligenza Artificiale e Robotica

Task Completion Transfer Learning for Reward Inference

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

Speeding Up Reinforcement Learning with Behavior Transfer

Learning Prospective Robot Behavior

Designing a Computer to Play Nim: A Mini-Capstone Project in Digital Design I

A Comparison of Annealing Techniques for Academic Course Scheduling

Task Completion Transfer Learning for Reward Inference

Dialog-based Language Learning

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

AMULTIAGENT system [1] can be defined as a group of

Softprop: Softmax Neural Network Backpropagation Learning

LEGO MINDSTORMS Education EV3 Coding Activities

arxiv: v2 [cs.ir] 22 Aug 2016

An OO Framework for building Intelligence and Learning properties in Software Agents

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

ACTL5103 Stochastic Modelling For Actuaries. Course Outline Semester 2, 2014

Managerial Decision Making

12- A whirlwind tour of statistics

The Good Judgment Project: A large scale test of different methods of combining expert predictions

IAT 888: Metacreation Machines endowed with creative behavior. Philippe Pasquier Office 565 (floor 14)

A study of speaker adaptation for DNN-based speech synthesis

A Reinforcement Learning Approach for Adaptive Single- and Multi-Document Summarization

Chapter 2. Intelligent Agents. Outline. Agents and environments. Rationality. PEAS (Performance measure, Environment, Actuators, Sensors)

Session 2B From understanding perspectives to informing public policy the potential and challenges for Q findings to inform survey design

arxiv: v1 [cs.cv] 10 May 2017

arxiv: v2 [cs.ro] 3 Mar 2017

arxiv: v1 [cs.dc] 19 May 2017

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Teachable Robots: Understanding Human Teaching Behavior to Build More Effective Robot Learners

arxiv: v1 [cs.lg] 8 Mar 2017

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

Learning to Schedule Straight-Line Code

Intelligent Agents. Chapter 2. Chapter 2 1

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Understanding and Interpreting the NRC s Data-Based Assessment of Research-Doctorate Programs in the United States (2010)

Deep Neural Network Language Models

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

On the Combined Behavior of Autonomous Resource Management Agents

Measures of the Location of the Data

SARDNET: A Self-Organizing Feature Map for Sequences

Adaptive Generation in Dialogue Systems Using Dynamic User Modeling

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Transfer Learning Action Models by Measuring the Similarity of Different Domains

Software Maintenance

Comment-based Multi-View Clustering of Web 2.0 Items

Human-like Natural Language Generation Using Monte Carlo Tree Search

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

EVOLVING POLICIES TO SOLVE THE RUBIK S CUBE: EXPERIMENTS WITH IDEAL AND APPROXIMATE PERFORMANCE FUNCTIONS

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Carter M. Mast. Participants: Peter Mackenzie-Helnwein, Pedro Arduino, and Greg Miller. 6 th MPM Workshop Albuquerque, New Mexico August 9-10, 2010

Learning Methods for Fuzzy Systems

Purdue Data Summit Communication of Big Data Analytics. New SAT Predictive Validity Case Study

Automatic Discretization of Actions and States in Monte-Carlo Tree Search

Acquiring Competence from Performance Data

Ericsson Wallet Platform (EWP) 3.0 Training Programs. Catalog of Course Descriptions

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Lecture 6: Applications

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

A student diagnosing and evaluation system for laboratory-based academic exercises

Knowledge Transfer in Deep Convolutional Neural Nets

An investigation of imitation learning algorithms for structured prediction

Seminar - Organic Computing

Designing a Rubric to Assess the Modelling Phase of Student Design Projects in Upper Year Engineering Courses

Second Exam: Natural Language Parsing with Neural Networks

Transcription:

Deep reinforcement learning

Function approximation So far, we ve assumed a lookup table representation for utility function U(s) or actionutility function Q(s,a) This does not work if the state space is really large or continuous Alternative idea: approximate the utilities or Q values using parametric functions and automatically learn the parameters: V(s) V(s;w) ˆ Q(s, a) ˆQ(s, a;w)

Deep Q learning Train a deep neural network to output Q values: Source: D. Silver

Deep Q learning Regular TD update: nudge Q(s,a) towards the target ( R( s) + γ max Q( s', a') Q( s, )) Q( s, a) Q( s, a) + α a' a Deep Q learning: encourage estimate to match the target by minimizing squared error: L(w) = ( R(s)+γ max a' Q(s', a';w) Q(s, a;w) ) 2 target estimate

Deep Q learning Regular TD update: nudge Q(s,a) towards the target Deep Q learning: encourage estimate to match the target by minimizing squared error: Compare to supervised learning: ( R( s) + γ max Q( s', a') Q( s, )) Q( s, a) Q( s, a) + α a' a L(w) = ( R(s)+γ max a' Q(s', a';w) Q(s, a;w) ) 2 target L(w) = ( y f (x;w)) 2 estimate Key difference: the target in Q learning is also moving!

Online Q learning algorithm Observe experience (s,a,s ) Compute target y = R(s)+γ max a' Q(s', a';w) Update weights to reduce the error L = ( y Q(s, a;w) ) 2 Gradient: w L = Q(s, a;w) y ( ) w Q(s, a;w) Weight update: w w α w L This is called stochastic gradient descent (SGD)

Dealing with training instability Challenges Target values are not fixed Successive experiences are correlated and dependent on the policy Policy may change rapidly with slight changes to parameters, leading to drastic change in data distribution Solutions Freeze target Q network Use experience replay Mnih et al. Human-level control through deep reinforcement learning, Nature 2015

Experience replay At each time step: Take action a t according to epsilon-greedy policy Store experience (s t, a t, r t+1, s t+1 ) in replay memory buffer Randomly sample mini-batch of experiences from the buffer Mnih et al. Human-level control through deep reinforcement learning, Nature 2015

Experience replay At each time step: Take action a t according to epsilon-greedy policy Store experience (s t, a t, r t+1, s t+1 ) in replay memory buffer Randomly sample mini-batch of experiences from the buffer Perform update to reduce objective function E s,a,s' " ( R(s)+γ max a' Q(s', a';w ) Q(s, a;w) ) 2 #$ Keep parameters of target network fixed, update every once in a while % &' Mnih et al. Human-level control through deep reinforcement learning, Nature 2015

Deep Q learning in Atari Mnih et al. Human-level control through deep reinforcement learning, Nature 2015

Deep Q learning in Atari End-to-end learning of Q(s,a) from pixels s Output is Q(s,a) for 18 joystick/button configurations Reward is change in score for that step Q(s,a 1 ) Q(s,a 2 ) Q(s,a 3 )........... Q(s,a 18 ) Mnih et al. Human-level control through deep reinforcement learning, Nature 2015

Deep Q learning in Atari Input state s is stack of raw pixels from last 4 frames Network architecture and hyperparameters fixed for all games Mnih et al. Human-level control through deep reinforcement learning, Nature 2015

Deep Q learning in Atari Mnih et al. Human-level control through deep reinforcement learning, Nature 2015

Breakout demo https://www.youtube.com/watch?v=tmpftpjtdgg

Policy gradient methods Learning the policy directly can be much simpler than learning Q values We can train a neural network to output stochastic policies, or probabilities of taking each action in a given state Softmax policy: π (s, a;u) = a' exp( f (s, a;u) ) exp( f (s, a';u) )

Actor-critic algorithm Define objective function as total discounted reward: J(u) = E! R +γr +γ 2 R +... # " 1 2 3 $ The gradient for a stochastic policy is given by J = E" logπ (s, a;u) Q π (s, a;w) $ u # u % Actor network update: Actor network u u +α u J Critic network Critic network update: use Q learning (following actor s policy)

Advantage actor-critic The raw Q value is less meaningful than whether the reward is better or worse than what you expect to get Introduce an advantage function that subtracts a baseline number from all Q values A π (s, a) = Q π (s, a) V π (s) Estimate V using a value network Advantage actor-critic: R(s)+γV π (s') V π (s) u J = E" # u logπ (s, a;u) A π (s, a;w) $ %

Asynchronous advantage actor-critic (A3C) Agent 1 Experience 1 Local updates V, π... Agent 2 Experience 2 Local updates Agent n Experience n Local updates Asynchronously update global parameters Mnih et al. Asynchronous Methods for Deep Reinforcement Learning. ICML 2016

Asynchronous advantage actor-critic (A3C) Mean and median human-normalized scores over 57 Atari games Mnih et al. Asynchronous Methods for Deep Reinforcement Learning. ICML 2016

Asynchronous advantage actor-critic (A3C) TORCS car racing simulation video Mnih et al. Asynchronous Methods for Deep Reinforcement Learning. ICML 2016

Asynchronous advantage actor-critic (A3C) Motor control tasks video Mnih et al. Asynchronous Methods for Deep Reinforcement Learning. ICML 2016

Playing Go Go is a known (and deterministic) environment Therefore, learning to play Go involves solving a known MDP Key challenges: huge state and action space, long sequences, sparse rewards

Review: AlphaGo Policy network: initialized by supervised training on large amount of human games Value network: trained to predict outcome of game based on self-play Networks are used to guide Monte Carlo tree search (MCTS) D. Silver et al., Mastering the Game of Go with Deep Neural Networks and Tree Search, Nature 529, January 2016

AlphaGo Zero A fancier architecture (deep residual networks) No hand-crafted features at all A single network to predict both value and policy Train network entirely by self-play, starting with random moves Uses MCTS inside the reinforcement learning loop, not outside D. Silver et al., Mastering the Game of Go without Human Knowledge, Nature 550, October 2017 https://deepmind.com/blog/alphago-zero-learning-scratch/

AlphaGo Zero Given a position, neural network outputs both move probabilities P and value V (probability of winning) In each position, MCTS is conducted to return refined move probabilities π and game winner Z Neural network parameters are updated to make P and V better match π and Z Reminiscent of policy iteration: self-play with MCTS is policy evaluation, updating the network towards MCTS output is policy improvement D. Silver et al., Mastering the Game of Go without Human Knowledge, Nature 550, October 2017 https://deepmind.com/blog/alphago-zero-learning-scratch/

AlphaGo Zero D. Silver et al., Mastering the Game of Go without Human Knowledge, Nature 550, October 2017 https://deepmind.com/blog/alphago-zero-learning-scratch/

AlphaGo Zero It s also more efficient than older engines! D. Silver et al., Mastering the Game of Go without Human Knowledge, Nature 550, October 2017 https://deepmind.com/blog/alphago-zero-learning-scratch/

Imitation learning In some applications, you cannot bootstrap yourself from random policies High-dimensional state and action spaces where most random trajectories fail miserably Expensive to evaluate policies in the physical world, especially in cases of failure Solution: learn to imitate sample trajectories or demonstrations This is also helpful when there is no natural reward formulation

Learning visuomotor policies Underlying state x: true object position, robot configuration Observations o: image pixels Two-part approach: Learn guiding policy π(a x) using trajectory-centric RL and control techniques Learn visuomotor policy π(a o) by imitating π(a x) S. Levine et al. End-to-end training of deep visuomotor policies. JMLR 2016

Learning visuomotor policies Neural network architecture S. Levine et al. End-to-end training of deep visuomotor policies. JMLR 2016

Learning visuomotor policies Overview video, training video S. Levine et al. End-to-end training of deep visuomotor policies. JMLR 2016

Summary Deep Q learning Policy gradient methods Actor-critic Advantage actor-critic A3C Policy iteration for AlphaGo Imitation learning for visuomotor policies