Deep Learning Mohammad Ali Keyvanrad Lecture 19:Deep Reinforcement Learning
OUTLINE Introduction Reinforcement Learning examples Mathematical formulation of the RL problem Deep Q-learning Deep Q-learning examples 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 2
OUTLINE Introduction Reinforcement Learning examples Mathematical formulation of the RL problem Deep Q-learning Deep Q-learning examples 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 3
Introduction Supervised Learning 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 4
Introduction Supervised Learning Training Info = desired (target) outputs Inputs Supervised Learning System Outputs Error = (target output actual output) 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 5
Introduction Unsupervised Learning 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 6
Introduction Unsupervised Learning Inputs Unsupervised Learning System Outputs Using measure of similarity, Likelihood, 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 7
Introduction Reinforcement Learning 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 8
Introduction Reinforcement Learning Training Info = evaluations ( rewards / penalties ) Inputs RL System Outputs ( actions ) 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 9
OUTLINE Introduction Reinforcement Learning examples Mathematical formulation of the RL problem Deep Q-learning Deep Q-learning examples 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 10
Reinforcement Learning examples 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 11
Reinforcement Learning examples Cart-Pole Problem 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 12
Reinforcement Learning examples Robot Locomotion 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 13
Reinforcement Learning examples Atari Games 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 14
Reinforcement Learning examples UCL Course on RL by David Silver Architecture 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 15
OUTLINE Introduction Reinforcement Learning examples Mathematical formulation of the RL problem Deep Q-learning Deep Q-learning examples 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 16
Mathematical formulation of the RL problem The basic reinforcement is modeled as a Markov decision process (MDP) Markov property: Current state completely characterizes the state of the world. 1. S: a set of environment and agent states 2. A: a set of actions of the agent 3. R: R a (s, s ) is the immediate reward after transition from s to s with action a 4. P: P a (s, s ) = Pr s t+1 = s s t = s, a t = a is the probability of transition from state s to state s under action a 5. γ : discount factor, represents the difference in importance between future rewards and present rewards 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 17
Mathematical formulation of the RL problem Algorithm At time step t = 0, environment samples initial state s 0 ~p s 0 Then, for t = 0 until done: Agent selects action a t Environment samples reward r t ~R. s t, a t ) Environment samples next state s t+1 ~ P. s t, a t ) Agent receives reward r t and next state s t+1 A policy π is a function from S to A that specifies what action to take in each state Objective: find policy π that maximizes cumulative discounted reward: 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 18
Mathematical formulation of the RL problem A simple MDP: Grid World 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 19
Mathematical formulation of the RL problem A simple MDP: Grid World 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 20
Mathematical formulation of the RL problem We want to find optimal policy π that maximizes the sum of rewards. How do we handle the randomness (initial state, transition probability )? Maximize the expected sum of rewards! Formally 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 21
Mathematical formulation of the RL problem Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, How good is a state? The value function at state s, is the expected cumulative reward from following the policy from state s: How good is a state-action pair? The Q-value function at state s and action a, is the expected cumulative reward from taking action a in state s and then following the policy: 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 22
Mathematical formulation of the RL problem Bellman equation The optimal Q-value function Q is the maximum expected cumulative reward achievable from a given (state, action) pair Q satisfies the following Bellman equation: 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 23
Mathematical formulation of the RL problem Solving for the optimal policy Use Bellman equation as an iterative update Q i will converge to Q as i infinity What s the problem with this? Not scalable. Must compute Q(s, a) for every state-action pair. If state is e.g. current game state pixels, computationally infeasible to compute for entire state space! 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 24
Mathematical formulation of the RL problem Think about the Breakout game State: screen pixels Image size: 84 84 (resized) Consecutive 4 images Grayscale with 256 gray levels Solution: use a function approximator to estimate Q(s, a). E.g. a neural network! 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 25
OUTLINE Introduction Reinforcement Learning examples Mathematical formulation of the RL problem Deep Q-learning Deep Q-learning examples 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 26
Deep Q-learning Use a function (with parameters) to approximate the Q- function Deep Q-learning A Q-Learning that the function approximator is a deep neural network function parameters: weights 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 27
Deep Q-learning 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 28
Deep Q-learning 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 29
Experience Replay 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 30
Fixed Target Q-Network 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 31
Reward / Value Range DQN clips the reward to [ 1, +1] This prevents Q-values from becoming too large Ensures gradients are well-conditioned 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 32
Deep Q-learning Stable Deep RL 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 33
Train the Deep Q-Network 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 34
Train the Deep Q-Network 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 35
Train the Deep Q-Network 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 36
Train the Deep Q-Network 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 37
OUTLINE Introduction Reinforcement Learning examples Mathematical formulation of the RL problem Deep Q-learning Deep Q-learning examples 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 38
Deep Q-learning examples 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 39
Deep Q-learning examples 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 40
Deep Q-learning examples 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 41
Deep Q-learning examples A visualization of the learned action-value function on the game Pong. 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 42
Deep Q-learning examples Google's DeepMind used a Deep Learning technique to teach a computer to play Control of the keyboard while watching the score, and its goal was to maximize the score 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 43
Beating people in dozens of computer games Computer program playing Doom using only raw pixel data. 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 44
References Stanford Convolutional Neural Networks for Visual Recognition course (Lecture 14) Mnih, Volodymyr, et al. "Human-level control through deep reinforcement learning." Nature 518.7540 (2015): 529-533. Bowen Xu, Human-level control through deep reinforcement learning, Vehicle Intelligence Lab 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 45
1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 46
1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 47