Deep Learning. Mohammad Ali Keyvanrad Lecture 19:Deep Reinforcement Learning

Deep Learning Mohammad Ali Keyvanrad Lecture 19:Deep Reinforcement Learning

OUTLINE Introduction Reinforcement Learning examples Mathematical formulation of the RL problem Deep Q-learning Deep Q-learning examples 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 2

OUTLINE Introduction Reinforcement Learning examples Mathematical formulation of the RL problem Deep Q-learning Deep Q-learning examples 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 3

Introduction Supervised Learning 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 4

Introduction Supervised Learning Training Info = desired (target) outputs Inputs Supervised Learning System Outputs Error = (target output actual output) 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 5

Introduction Unsupervised Learning 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 6

Introduction Unsupervised Learning Inputs Unsupervised Learning System Outputs Using measure of similarity, Likelihood, 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 7

Introduction Reinforcement Learning 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 8

Introduction Reinforcement Learning Training Info = evaluations ( rewards / penalties ) Inputs RL System Outputs ( actions ) 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 9

OUTLINE Introduction Reinforcement Learning examples Mathematical formulation of the RL problem Deep Q-learning Deep Q-learning examples 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 10

Reinforcement Learning examples 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 11

Reinforcement Learning examples Cart-Pole Problem 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 12

Reinforcement Learning examples Robot Locomotion 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 13

Reinforcement Learning examples Atari Games 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 14

Reinforcement Learning examples UCL Course on RL by David Silver Architecture 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 15

OUTLINE Introduction Reinforcement Learning examples Mathematical formulation of the RL problem Deep Q-learning Deep Q-learning examples 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 16

Mathematical formulation of the RL problem The basic reinforcement is modeled as a Markov decision process (MDP) Markov property: Current state completely characterizes the state of the world. 1. S: a set of environment and agent states 2. A: a set of actions of the agent 3. R: R a (s, s ) is the immediate reward after transition from s to s with action a 4. P: P a (s, s ) = Pr s t+1 = s s t = s, a t = a is the probability of transition from state s to state s under action a 5. γ : discount factor, represents the difference in importance between future rewards and present rewards 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 17

Mathematical formulation of the RL problem Algorithm At time step t = 0, environment samples initial state s 0 ~p s 0 Then, for t = 0 until done: Agent selects action a t Environment samples reward r t ~R. s t, a t ) Environment samples next state s t+1 ~ P. s t, a t ) Agent receives reward r t and next state s t+1 A policy π is a function from S to A that specifies what action to take in each state Objective: find policy π that maximizes cumulative discounted reward: 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 18

Mathematical formulation of the RL problem A simple MDP: Grid World 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 19

Mathematical formulation of the RL problem A simple MDP: Grid World 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 20

Mathematical formulation of the RL problem We want to find optimal policy π that maximizes the sum of rewards. How do we handle the randomness (initial state, transition probability )? Maximize the expected sum of rewards! Formally 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 21

Mathematical formulation of the RL problem Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, How good is a state? The value function at state s, is the expected cumulative reward from following the policy from state s: How good is a state-action pair? The Q-value function at state s and action a, is the expected cumulative reward from taking action a in state s and then following the policy: 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 22

Mathematical formulation of the RL problem Bellman equation The optimal Q-value function Q is the maximum expected cumulative reward achievable from a given (state, action) pair Q satisfies the following Bellman equation: 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 23

Mathematical formulation of the RL problem Solving for the optimal policy Use Bellman equation as an iterative update Q i will converge to Q as i infinity What s the problem with this? Not scalable. Must compute Q(s, a) for every state-action pair. If state is e.g. current game state pixels, computationally infeasible to compute for entire state space! 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 24

Mathematical formulation of the RL problem Think about the Breakout game State: screen pixels Image size: 84 84 (resized) Consecutive 4 images Grayscale with 256 gray levels Solution: use a function approximator to estimate Q(s, a). E.g. a neural network! 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 25

OUTLINE Introduction Reinforcement Learning examples Mathematical formulation of the RL problem Deep Q-learning Deep Q-learning examples 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 26

Deep Q-learning Use a function (with parameters) to approximate the Q- function Deep Q-learning A Q-Learning that the function approximator is a deep neural network function parameters: weights 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 27

Deep Q-learning 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 28

Deep Q-learning 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 29

Experience Replay 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 30

Fixed Target Q-Network 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 31

Reward / Value Range DQN clips the reward to [ 1, +1] This prevents Q-values from becoming too large Ensures gradients are well-conditioned 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 32

Deep Q-learning Stable Deep RL 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 33

Train the Deep Q-Network 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 34

Train the Deep Q-Network 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 35

Train the Deep Q-Network 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 36

Train the Deep Q-Network 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 37

OUTLINE Introduction Reinforcement Learning examples Mathematical formulation of the RL problem Deep Q-learning Deep Q-learning examples 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 38

Deep Q-learning examples 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 39

Deep Q-learning examples 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 40

Deep Q-learning examples 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 41

Deep Q-learning examples A visualization of the learned action-value function on the game Pong. 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 42

Deep Q-learning examples Google's DeepMind used a Deep Learning technique to teach a computer to play Control of the keyboard while watching the score, and its goal was to maximize the score 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 43

Beating people in dozens of computer games Computer program playing Doom using only raw pixel data. 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 44

References Stanford Convolutional Neural Networks for Visual Recognition course (Lecture 14) Mnih, Volodymyr, et al. "Human-level control through deep reinforcement learning." Nature 518.7540 (2015): 529-533. Bowen Xu, Human-level control through deep reinforcement learning, Vehicle Intelligence Lab 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 45

1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 46

1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 47