Reinforcement Learning. Fei-Fei Li & Justin Johnson & Serena Yeung. Lecture 14-1

Similar documents
(Sub)Gradient Descent

Lecture 1: Machine Learning Basics

Generative models and adversarial training

Python Machine Learning

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Lecture 10: Reinforcement Learning

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Reinforcement Learning by Comparing Immediate Reward

Artificial Neural Networks written examination

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

Axiom 2013 Team Description Paper

CSL465/603 - Machine Learning

Georgetown University at TREC 2017 Dynamic Domain Track

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

AI Agent for Ice Hockey Atari 2600

arxiv: v1 [cs.cv] 10 May 2017

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Laboratorio di Intelligenza Artificiale e Robotica

TD(λ) and Q-Learning Based Ludo Players

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Learning Methods for Fuzzy Systems

A Reinforcement Learning Variant for Control Scheduling

LEARNING TO PLAY IN A DAY: FASTER DEEP REIN-

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Forget catastrophic forgetting: AI that learns after deployment

Software Maintenance

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Model Ensemble for Click Prediction in Bing Search Ads

arxiv: v1 [cs.lg] 15 Jun 2015

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Grade 6: Correlated to AGS Basic Math Skills

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Seminar - Organic Computing

FF+FPG: Guiding a Policy-Gradient Planner

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Mathematics process categories

AMULTIAGENT system [1] can be defined as a group of

Assignment 1: Predicting Amazon Review Ratings

ACTL5103 Stochastic Modelling For Actuaries. Course Outline Semester 2, 2014

On the Combined Behavior of Autonomous Resource Management Agents

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Mathematics Success Level E

Human Emotion Recognition From Speech

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

Deep Neural Network Language Models

Laboratorio di Intelligenza Artificiale e Robotica

Word Segmentation of Off-line Handwritten Documents

Fountas-Pinnell Level P Informational Text

Semantic Segmentation with Histological Image Data: Cancer Cell vs. Stroma

Introduction and Motivation

Speech Emotion Recognition Using Support Vector Machine

An Online Handwriting Recognition System For Turkish

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

Introduction to Simulation

Softprop: Softmax Neural Network Backpropagation Learning

Second Exam: Natural Language Parsing with Neural Networks

Calibration of Confidence Measures in Speech Recognition

Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures

A study of speaker adaptation for DNN-based speech synthesis

B. How to write a research paper

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Algebra 2- Semester 2 Review

Major Milestones, Team Activities, and Individual Deliverables

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Transferring End-to-End Visuomotor Control from Simulation to Real World for a Multi-Stage Task

Using Deep Convolutional Neural Networks in Monte Carlo Tree Search

SARDNET: A Self-Organizing Feature Map for Sequences

Learning to Schedule Straight-Line Code

Speech Recognition at ICSI: Broadcast News and beyond

INPE São José dos Campos

CROSS COUNTRY CERTIFICATION STANDARDS

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

AP Calculus AB. Nevada Academic Standards that are assessable at the local level only.

Learning From the Past with Experiment Databases

Speeding Up Reinforcement Learning with Behavior Transfer

Ohio s Learning Standards-Clear Learning Targets

Using focal point learning to improve human machine tacit coordination

A Metacognitive Approach to Support Heuristic Solution of Mathematical Problems

Lecture 6: Applications

Evolutive Neural Net Fuzzy Filtering: Basic Description

An Introduction to the Minimalist Program

Automatic Pronunciation Checker

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Test Effort Estimation Using Neural Network

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

Dialog-based Language Learning

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

1.11 I Know What Do You Know?

LEGO MINDSTORMS Education EV3 Coding Activities

Surprise-Based Learning for Autonomous Systems

Transcription:

Lecture 14: Reinforcement Learning Lecture 14-1

Administrative Grades: - Midterm grades released last night, see Piazza for more information and statistics - A2 and milestone grades scheduled for later this week Lecture 14-2

Administrative Projects: - All teams must register their project, see Piazza for registration form - Tiny ImageNet evaluation server is online Lecture 14-3

Administrative Survey: - Please fill out the course survey! - Link on Piazza or https://goo.gl/forms/eqpvw7ipjqapsdkb2 Lecture 14-4

So far Supervised Learning Data: (x, y) x is data, y is label Cat Goal: Learn a function to map x -> y Examples: Classification, regression, object detection, semantic segmentation, image captioning, etc. Classification This image is CC0 public domain Lecture 14-5

So far Unsupervised Learning Data: x Just data, no labels! 1-d density estimation Goal: Learn some underlying hidden structure of the data Examples: Clustering, dimensionality reduction, feature learning, density estimation, etc. 2-d density estimation 2-d density images left and right are CC0 public domain Lecture 14-6

Today: Reinforcement Learning Problems involving an agent interacting with an environment, which provides numeric reward signals Goal: Learn how to take actions in order to maximize reward Lecture 14-7

Overview - What is Reinforcement Learning? Markov Decision Processes Q-Learning Policy Gradients Lecture 14-8

Reinforcement Learning Agent Environment Lecture 14-9

Reinforcement Learning Agent State st Environment Lecture 14-10

Reinforcement Learning Agent State st Action at Environment Lecture 14-11

Reinforcement Learning Agent State st Reward rt Action at Environment Lecture 14-12

Reinforcement Learning Agent State st Reward rt Next state st+1 Action at Environment Lecture 14-13

Cart-Pole Problem Objective: Balance a pole on top of a movable cart State: angle, angular speed, position, horizontal velocity Action: horizontal force applied on the cart Reward: 1 at each time step if the pole is upright This image is CC0 public domain Lecture 14-14

Robot Locomotion Objective: Make the robot move forward State: Angle and position of the joints Action: Torques applied on joints Reward: 1 at each time step upright + forward movement Lecture 14-15

Atari Games Objective: Complete the game with the highest score State: Raw pixel inputs of the game state Action: Game controls e.g. Left, Right, Up, Down Reward: Score increase/decrease at each time step Lecture 14-16

Go Objective: Win the game! State: Position of all pieces Action: Where to put the next piece down Reward: 1 if win at the end of the game, 0 otherwise This image is CC0 public domain Lecture 14-17

How can we mathematically formalize the RL problem? Agent State st Reward rt Next state st+1 Action at Environment Lecture 14-18

Markov Decision Process - Mathematical formulation of the RL problem Markov property: Current state completely characterises the state of the world Defined by: : set of possible states : set of possible actions : distribution of reward given (state, action) pair : transition probability i.e. distribution over next state given (state, action) pair : discount factor Lecture 14-19

Markov Decision Process - At time step t=0, environment samples initial state s0 ~ p(s0) Then, for t=0 until done: - Agent selects action at - Environment samples reward rt ~ R(. st, at) - Environment samples next state st+1 ~ P(. st, at) - Agent receives reward rt and next state st+1 - A policy is a function from S to A that specifies what action to take in each state Objective: find policy * that maximizes cumulative discounted reward: - Lecture 14-20

A simple MDP: Grid World states actions = { 1. right 2. left 3. up 4. down Set a negative reward for each transition (e.g. r = -1) } Objective: reach one of terminal states (greyed out) in least number of actions Lecture 14-21

A simple MDP: Grid World Random Policy Optimal Policy Lecture 14-22

The optimal policy * We want to find optimal policy * that maximizes the sum of rewards. How do we handle the randomness (initial state, transition probability )? Lecture 14-23

The optimal policy * We want to find optimal policy * that maximizes the sum of rewards. How do we handle the randomness (initial state, transition probability )? Maximize the expected sum of rewards! Formally: with Lecture 14-24

Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s0, a0, r0, s1, a1, r1, Lecture 14-25

Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s0, a0, r0, s1, a1, r1, How good is a state? The value function at state s, is the expected cumulative reward from following the policy from state s: Lecture 14-26

Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s0, a0, r0, s1, a1, r1, How good is a state? The value function at state s, is the expected cumulative reward from following the policy from state s: How good is a state-action pair? The Q-value function at state s and action a, is the expected cumulative reward from taking action a in state s and then following the policy: Lecture 14-27

Bellman equation The optimal Q-value function Q* is the maximum expected cumulative reward achievable from a given (state, action) pair: Lecture 14-28

Bellman equation The optimal Q-value function Q* is the maximum expected cumulative reward achievable from a given (state, action) pair: Q* satisfies the following Bellman equation: Intuition: if the optimal state-action values for the next time-step Q*(s,a ) are known, then the optimal strategy is to take the action that maximizes the expected value of Lecture 14-29

Bellman equation The optimal Q-value function Q* is the maximum expected cumulative reward achievable from a given (state, action) pair: Q* satisfies the following Bellman equation: Intuition: if the optimal state-action values for the next time-step Q*(s,a ) are known, then the optimal strategy is to take the action that maximizes the expected value of The optimal policy * corresponds to taking the best action in any state as specified by Q* Lecture 14-30

Solving for the optimal policy Value iteration algorithm: Use Bellman equation as an iterative update Qi will converge to Q* as i -> infinity Lecture 14-31

Solving for the optimal policy Value iteration algorithm: Use Bellman equation as an iterative update Qi will converge to Q* as i -> infinity What s the problem with this? Lecture 14-32

Solving for the optimal policy Value iteration algorithm: Use Bellman equation as an iterative update Qi will converge to Q* as i -> infinity What s the problem with this? Not scalable. Must compute Q(s,a) for every state-action pair. If state is e.g. current game state pixels, computationally infeasible to compute for entire state space! Lecture 14-33

Solving for the optimal policy Value iteration algorithm: Use Bellman equation as an iterative update Qi will converge to Q* as i -> infinity What s the problem with this? Not scalable. Must compute Q(s,a) for every state-action pair. If state is e.g. current game state pixels, computationally infeasible to compute for entire state space! Solution: use a function approximator to estimate Q(s,a). E.g. a neural network! Lecture 14-34

Solving for the optimal policy: Q-learning Q-learning: Use a function approximator to estimate the action-value function Lecture 14-35

Solving for the optimal policy: Q-learning Q-learning: Use a function approximator to estimate the action-value function If the function approximator is a deep neural network => deep q-learning! Lecture 14-36

Solving for the optimal policy: Q-learning Q-learning: Use a function approximator to estimate the action-value function function parameters (weights) If the function approximator is a deep neural network => deep q-learning! Lecture 14-37

Solving for the optimal policy: Q-learning Remember: want to find a Q-function that satisfies the Bellman Equation: Lecture 14-38

Solving for the optimal policy: Q-learning Remember: want to find a Q-function that satisfies the Bellman Equation: Forward Pass Loss function: where Lecture 14-39

Solving for the optimal policy: Q-learning Remember: want to find a Q-function that satisfies the Bellman Equation: Forward Pass Loss function: where Backward Pass Gradient update (with respect to Q-function parameters θ): Lecture 14-40

Solving for the optimal policy: Q-learning Remember: want to find a Q-function that satisfies the Bellman Equation: Forward Pass Loss function: where Backward Pass Gradient update (with respect to Q-function parameters θ): Iteratively try to make the Q-value close to the target value (yi) it should have, if Q-function corresponds to optimal Q* (and optimal policy *) Lecture 14-41

[Mnih et al. NIPS Workshop 2013; Nature 2015] Case Study: Playing Atari Games Objective: Complete the game with the highest score State: Raw pixel inputs of the game state Action: Game controls e.g. Left, Right, Up, Down Reward: Score increase/decrease at each time step Lecture 14-42

[Mnih et al. NIPS Workshop 2013; Nature 2015] Q-network Architecture : neural network with weights FC-4 (Q-values) FC-256 32 4x4 conv, stride 2 16 8x8 conv, stride 4 Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping) Lecture 14-43

[Mnih et al. NIPS Workshop 2013; Nature 2015] Q-network Architecture : neural network with weights FC-4 (Q-values) FC-256 32 4x4 conv, stride 2 16 8x8 conv, stride 4 Input: state st Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping) Lecture 14-44

[Mnih et al. NIPS Workshop 2013; Nature 2015] Q-network Architecture : neural network with weights FC-4 (Q-values) FC-256 32 4x4 conv, stride 2 Familiar conv layers, FC layer 16 8x8 conv, stride 4 Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping) Lecture 14-45

[Mnih et al. NIPS Workshop 2013; Nature 2015] Q-network Architecture : neural network with weights FC-4 (Q-values) FC-256 32 4x4 conv, stride 2 Last FC layer has 4-d output (if 4 actions), corresponding to Q(st, a1), Q(st, a2), Q(st, a3), Q(st,a4) 16 8x8 conv, stride 4 Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping) Lecture 14-46

[Mnih et al. NIPS Workshop 2013; Nature 2015] Q-network Architecture : neural network with weights FC-4 (Q-values) FC-256 32 4x4 conv, stride 2 Last FC layer has 4-d output (if 4 actions), corresponding to Q(st, a1), Q(st, a2), Q(st, a3), Q(st,a4) 16 8x8 conv, stride 4 Number of actions between 4-18 depending on Atari game Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping) Lecture 14-47

[Mnih et al. NIPS Workshop 2013; Nature 2015] Q-network Architecture : neural network with weights FC-4 (Q-values) FC-256 A single feedforward pass to compute Q-values for all actions from the current state => efficient! 32 4x4 conv, stride 2 Last FC layer has 4-d output (if 4 actions), corresponding to Q(st, a1), Q(st, a2), Q(st, a3), Q(st,a4) 16 8x8 conv, stride 4 Number of actions between 4-18 depending on Atari game Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping) Lecture 14-48

[Mnih et al. NIPS Workshop 2013; Nature 2015] Training the Q-network: Loss function (from before) Remember: want to find a Q-function that satisfies the Bellman Equation: Forward Pass Loss function: where Backward Pass Gradient update (with respect to Q-function parameters θ): Iteratively try to make the Q-value close to the target value (yi) it should have, if Q-function corresponds to optimal Q* (and optimal policy *) Lecture 14-49

[Mnih et al. NIPS Workshop 2013; Nature 2015] Training the Q-network: Experience Replay Learning from batches of consecutive samples is problematic: - Samples are correlated => inefficient learning - Current Q-network parameters determines next training samples (e.g. if maximizing action is to move left, training samples will be dominated by samples from left-hand size) => can lead to bad feedback loops Lecture 14-50

[Mnih et al. NIPS Workshop 2013; Nature 2015] Training the Q-network: Experience Replay Learning from batches of consecutive samples is problematic: - Samples are correlated => inefficient learning - Current Q-network parameters determines next training samples (e.g. if maximizing action is to move left, training samples will be dominated by samples from left-hand size) => can lead to bad feedback loops Address these problems using experience replay - Continually update a replay memory table of transitions (st, at, rt, st+1) as game (experience) episodes are played - Train Q-network on random minibatches of transitions from the replay memory, instead of consecutive samples Lecture 14-51

[Mnih et al. NIPS Workshop 2013; Nature 2015] Training the Q-network: Experience Replay Learning from batches of consecutive samples is problematic: - Samples are correlated => inefficient learning - Current Q-network parameters determines next training samples (e.g. if maximizing action is to move left, training samples will be dominated by samples from left-hand size) => can lead to bad feedback loops Address these problems using experience replay - Continually update a replay memory table of transitions (st, at, rt, st+1) as game (experience) episodes are played - Train Q-network on random minibatches of transitions from the replay memory, instead of consecutive samples Each transition can also contribute to multiple weight updates => greater data efficiency Lecture 14-52

[Mnih et al. NIPS Workshop 2013; Nature 2015] Putting it together: Deep Q-Learning with Experience Replay Lecture 14-53

[Mnih et al. NIPS Workshop 2013; Nature 2015] Putting it together: Deep Q-Learning with Experience Replay Initialize replay memory, Q-network Lecture 14-54

[Mnih et al. NIPS Workshop 2013; Nature 2015] Putting it together: Deep Q-Learning with Experience Replay Play M episodes (full games) Lecture 14-55

[Mnih et al. NIPS Workshop 2013; Nature 2015] Putting it together: Deep Q-Learning with Experience Replay Initialize state (starting game screen pixels) at the beginning of each episode Lecture 14-56

[Mnih et al. NIPS Workshop 2013; Nature 2015] Putting it together: Deep Q-Learning with Experience Replay For each timestep t of the game Lecture 14-57

[Mnih et al. NIPS Workshop 2013; Nature 2015] Putting it together: Deep Q-Learning with Experience Replay With small probability, select a random action (explore), otherwise select greedy action from current policy Lecture 14-58

[Mnih et al. NIPS Workshop 2013; Nature 2015] Putting it together: Deep Q-Learning with Experience Replay Take the action (at), and observe the reward rt and next state st+1 Lecture 14-59

[Mnih et al. NIPS Workshop 2013; Nature 2015] Putting it together: Deep Q-Learning with Experience Replay Store transition in replay memory Lecture 14-60

[Mnih et al. NIPS Workshop 2013; Nature 2015] Putting it together: Deep Q-Learning with Experience Replay Experience Replay: Sample a random minibatch of transitions from replay memory and perform a gradient descent step Lecture 14-61

https://www.youtube.com/watch?v=v1eynij0rnk Video by Károly Zsolnai-Fehér. Reproduced with permission. Lecture 14-62

Policy Gradients What is a problem with Q-learning? The Q-function can be very complicated! Example: a robot grasping an object has a very high-dimensional state => hard to learn exact value of every (state, action) pair Lecture 14-63

Policy Gradients What is a problem with Q-learning? The Q-function can be very complicated! Example: a robot grasping an object has a very high-dimensional state => hard to learn exact value of every (state, action) pair But the policy can be much simpler: just close your hand Can we learn a policy directly, e.g. finding the best policy from a collection of policies? Lecture 14-64

Policy Gradients Formally, let s define a class of parametrized policies: For each policy, define its value: Lecture 14-65

Policy Gradients Formally, let s define a class of parametrized policies: For each policy, define its value: We want to find the optimal policy How can we do this? Lecture 14-66

Policy Gradients Formally, let s define a class of parametrized policies: For each policy, define its value: We want to find the optimal policy How can we do this? Gradient ascent on policy parameters! Lecture 14-67

REINFORCE algorithm Mathematically, we can write: Where r( ) is the reward of a trajectory Lecture 14-68

REINFORCE algorithm Expected reward: Lecture 14-69

REINFORCE algorithm Expected reward: Now let s differentiate this: Lecture 14-70

REINFORCE algorithm Expected reward: Now let s differentiate this: Intractable! Gradient of an expectation is problematic when p depends on θ Lecture 14-71

REINFORCE algorithm Expected reward: Now let s differentiate this: Intractable! Gradient of an expectation is problematic when p depends on θ However, we can use a nice trick: Lecture 14-72

REINFORCE algorithm Expected reward: Now let s differentiate this: Intractable! Gradient of an expectation is problematic when p depends on θ However, we can use a nice trick: If we inject this back: Can estimate with Monte Carlo sampling Lecture 14-73

REINFORCE algorithm Can we compute those quantities without knowing the transition probabilities? We have: Lecture 14-74

REINFORCE algorithm Can we compute those quantities without knowing the transition probabilities? We have: Thus: Lecture 14-75

REINFORCE algorithm Can we compute those quantities without knowing the transition probabilities? We have: Thus: And when differentiating: Doesn t depend on transition probabilities! Lecture 14-76

REINFORCE algorithm Can we compute those quantities without knowing the transition probabilities? We have: Thus: And when differentiating: Doesn t depend on transition probabilities! Therefore when sampling a trajectory, we can estimate J( ) with Lecture 14-77

Intuition Gradient estimator: Interpretation: - If r( ) is high, push up the probabilities of the actions seen - If r( ) is low, push down the probabilities of the actions seen Lecture 14-78

Intuition Gradient estimator: Interpretation: - If r( ) is high, push up the probabilities of the actions seen - If r( ) is low, push down the probabilities of the actions seen Might seem simplistic to say that if a trajectory is good then all its actions were good. But in expectation, it averages out! Lecture 14-79

Intuition Gradient estimator: Interpretation: - If r( ) is high, push up the probabilities of the actions seen - If r( ) is low, push down the probabilities of the actions seen Might seem simplistic to say that if a trajectory is good then all its actions were good. But in expectation, it averages out! However, this also suffers from high variance because credit assignment is really hard. Can we help the estimator? Lecture 14-80

Variance reduction Gradient estimator: Lecture 14-81

Variance reduction Gradient estimator: First idea: Push up probabilities of an action seen, only by the cumulative future reward from that state Lecture 14-82

Variance reduction Gradient estimator: First idea: Push up probabilities of an action seen, only by the cumulative future reward from that state Second idea: Use discount factor to ignore delayed effects Lecture 14-83

Variance reduction: Baseline Problem: The raw value of a trajectory isn t necessarily meaningful. For example, if rewards are all positive, you keep pushing up probabilities of actions. What is important then? Whether a reward is better or worse than what you expect to get Idea: Introduce a baseline function dependent on the state. Concretely, estimator is now: Lecture 14-84

How to choose the baseline? A simple baseline: constant moving average of rewards experienced so far from all trajectories Lecture 14-85

How to choose the baseline? A simple baseline: constant moving average of rewards experienced so far from all trajectories Variance reduction techniques seen so far are typically used in Vanilla REINFORCE Lecture 14-86

How to choose the baseline? A better baseline: Want to push up the probability of an action from a state, if this action was better than the expected value of what we should get from that state. Q: What does this remind you of? Lecture 14-87

How to choose the baseline? A better baseline: Want to push up the probability of an action from a state, if this action was better than the expected value of what we should get from that state. Q: What does this remind you of? A: Q-function and value function! Lecture 14-88

How to choose the baseline? A better baseline: Want to push up the probability of an action from a state, if this action was better than the expected value of what we should get from that state. Q: What does this remind you of? A: Q-function and value function! Intuitively, we are happy with an action at in a state st if is large. On the contrary, we are unhappy with an action if it s small. Lecture 14-89

How to choose the baseline? A better baseline: Want to push up the probability of an action from a state, if this action was better than the expected value of what we should get from that state. Q: What does this remind you of? A: Q-function and value function! Intuitively, we are happy with an action at in a state st if is large. On the contrary, we are unhappy with an action if it s small. Using this, we get the estimator: Lecture 14-90

Actor-Critic Algorithm Problem: we don t know Q and V. Can we learn them? Yes, using Q-learning! We can combine Policy Gradients and Q-learning by training both an actor (the policy) and a critic (the Q-function). - The actor decides which action to take, and the critic tells the actor how good its action was and how it should adjust Also alleviates the task of the critic as it only has to learn the values of (state, action) pairs generated by the policy Can also incorporate Q-learning tricks e.g. experience replay Remark: we can define by the advantage function how much an action was better than expected Lecture 14-91

Actor-Critic Algorithm Initialize policy parameters, critic parameters For iteration=1, 2 do Sample m trajectories under the current policy For i=1,, m do For t=1,..., T do End for Lecture 14-92

REINFORCE in action: Recurrent Attention Model (RAM) Objective: Image Classification Take a sequence of glimpses selectively focusing on regions of the image, to predict class - Inspiration from human perception and eye movements - Saves computational resources => scalability - Able to ignore clutter / irrelevant parts of image State: Glimpses seen so far Action: (x,y) coordinates (center of glimpse) of where to look next in image Reward: 1 at the final timestep if image correctly classified, 0 otherwise glimpse [Mnih et al. 2014] Lecture 14-93

REINFORCE in action: Recurrent Attention Model (RAM) Objective: Image Classification Take a sequence of glimpses selectively focusing on regions of the image, to predict class - Inspiration from human perception and eye movements - Saves computational resources => scalability - Able to ignore clutter / irrelevant parts of image State: Glimpses seen so far Action: (x,y) coordinates (center of glimpse) of where to look next in image Reward: 1 at the final timestep if image correctly classified, 0 otherwise glimpse Glimpsing is a non-differentiable operation => learn policy for how to take glimpse actions using REINFORCE Given state of glimpses seen so far, use RNN to model the state and output next action [Mnih et al. 2014] Lecture 14-94

REINFORCE in action: Recurrent Attention Model (RAM) (x1, y1) Input image NN [Mnih et al. 2014] Lecture 14-95

REINFORCE in action: Recurrent Attention Model (RAM) Input image (x1, y1) (x2, y2) NN NN [Mnih et al. 2014] Lecture 14-96

REINFORCE in action: Recurrent Attention Model (RAM) Input image (x1, y1) (x2, y2) (x3, y3) NN NN NN [Mnih et al. 2014] Lecture 14-97

REINFORCE in action: Recurrent Attention Model (RAM) Input image (x1, y1) (x2, y2) (x3, y3) (x4, y4) NN NN NN NN [Mnih et al. 2014] Lecture 14-98

REINFORCE in action: Recurrent Attention Model (RAM) (x1, y1) (x2, y2) (x3, y3) (x4, y4) (x5, y5) Softmax Input image NN NN NN NN NN y=2 [Mnih et al. 2014] Lecture 14-99

REINFORCE in action: Recurrent Attention Model (RAM) Has also been used in many other tasks including fine-grained image recognition, image captioning, and visual question-answering! [Mnih et al. 2014] Lecture 14-10 0

More policy gradients: AlphaGo Overview: - Mix of supervised learning and reinforcement learning - Mix of old methods (Monte Carlo Tree Search) and recent ones (deep RL) How to beat the Go world champion: - Featurize the board (stone color, move legality, bias, ) - Initialize policy network with supervised training from professional go games, then continue training using policy gradient (play against itself from random previous iterations, +1 / -1 reward for winning / losing) - Also learn value network (critic) - Finally, combine combine policy and value networks in a Monte Carlo Tree Search algorithm to select actions by lookahead search [Silver et al., Nature 2016] This image is CC0 public domain Lecture 14-10 1

Summary - Policy gradients: very general but suffer from high variance so requires a lot of samples. Challenge: sample-efficiency - Q-learning: does not always work but when it works, usually more sample-efficient. Challenge: exploration - Guarantees: - Policy Gradients: Converges to a local minima of J( ), often good enough! Q-learning: Zero guarantees since you are approximating Bellman equation with a complicated function approximator Lecture 14-10 2

Next Time Guest Lecture: Song Han - Energy-efficient deep learning Deep learning hardware Model compression Embedded systems And more... Lecture 14-10 3