A Brief Introduction to Reinforcement Learning Jingwei Zhang zhang@informatik.uni-freiburg.de 1
Outline Characteristics of Reinforcement Learning (RL) Components of RL (MDP, value, policy, Bellman) Planning (policy iteration, value iteration) Model-free Prediction (MC, TD) Model-free Control (Q-Learning) Deep Reinforcement Learning (DQN) 2
Outline Characteristics of Reinforcement Learning (RL) The RL Problem (MDP, value, policy, Bellman) Planning (policy iteration, value iteration) Model-free Prediction (MC, TD) Model-free Control (Q-Learning) Deep Reinforcement Learning (DQN) 4
Characteristics of RL SL VS RL Supervised Learning i.i.d data direct and strong supervision (label: what is the right thing to do) instantaneous feedback Reinforcement Learning sequential data, non-i.i.d no supervisor, only a reward signal (rule: what you did is good or bad) delayed feedback 5
Characteristics of RL SL VS RL Supervised Learning i.i.d data direct and strong supervision (label: what is the right thing to do) instantaneous feedback Reinforcement Learning sequential data, non-i.i.d no supervisor, only a reward signal (rule: what you did is good or bad) delayed feedback 6
Characteristics of RL SL VS RL Supervised Learning i.i.d data direct and strong supervision (label: what is the right thing to do) instantaneous feedback Reinforcement Learning sequential data, non-i.i.d no supervisor, only a reward signal (rule: what you did is good or bad) delayed feedback 7
Characteristics of RL SL VS RL Supervised Learning i.i.d data direct and strong supervision (label: what is the right thing to do) instantaneous feedback Reinforcement Learning sequential data, non-i.i.d no supervisor, only a reward signal (rule: what you did is good or bad) delayed feedback 8
Outline Characteristics of Reinforcement Learning (RL) Components of RL (MDP, value, policy, Bellman) Planning (policy iteration, value iteration) Model-free Prediction (MC, TD) Model-free Control (Q-Learning) Deep Reinforcement Learning (DQN) 10
Components of RL MDP A general framework for sequential decision making A MDP is a tuple: hs, A, P, R, i S :states A :actions P :transition probability, Pss a = P [S 0 t+1 = s 0 S t = s, A t = a] R :reward function, R a s = E [R t+1 S t = s, A t = a] :discount factor, 2 [0, 1] Markov property: The future is independent of the past given the present 11
Components of RL MDP A general framework for sequential decision making A MDP is a tuple: hs, A, P, R, i S :states A :actions P :transition probability, Pss a = P [S 0 t+1 = s 0 S t = s, A t = a] R :reward function, R a s = E [R t+1 S t = s, A t = a] :discount factor, 2 [0, 1] Markov property: The future is independent of the past given the present 12
Components of RL MDP A general framework for sequential decision making A MDP is a tuple: hs, A, P, R, i S :states A :actions P :transition probability, Pss a = P [S 0 t+1 = s 0 S t = s, A t = a] R :reward function, R a s = E [R t+1 S t = s, A t = a] :discount factor, 2 [0, 1] Markov property: The future is independent of the past given the present 13
Components of RL Policy & Return & Value Policy: (a s) =P [A t = a S t = s] Return: G t = R t+1 + R t+2 +...= 1X k=0 State-value function: v (s) =E [G t S t = s] k R t+k+1 Action-value function: q (s, a) =E [G t S t = s, A t = a] 14
Components of RL Policy & Return & Value Policy: (a s) =P [A t = a S t = s] Return: G t = R t+1 + R t+2 +...= 1X k=0 State-value function: v (s) =E [G t S t = s] k R t+k+1 Action-value function: q (s, a) =E [G t S t = s, A t = a] 15
Components of RL Policy & Return & Value Policy: (a s) =P [A t = a S t = s] Return: G t = R t+1 + R t+2 +...= 1X k=0 State-value function: v (s) =E [G t S t = s] k R t+k+1 Action-value function: q (s, a) =E [G t S t = s, A t = a] 16
Components of RL Policy & Return & Value Policy: (a s) =P [A t = a S t = s] Return: G t = R t+1 + R t+2 +...= 1X k=0 State-value function: v (s) =E [G t S t = s] k R t+k+1 Action-value function: q (s, a) =E [G t S t = s, A t = a] 17
Components of RL Bellman Equations Bellman Expectation Equation v (s) =E [R t+1 + v (S t+1 ) S t = s]! v (s) = X a2a (a s) R a s + X s 0 2S P a ss 0 v (s 0 ) 18
Components of RL Bellman Equations Bellman Optimality Equation! v (s) = max a R a s + X s 0 2S P a ss 0 v (s 0 ) 19
Components of RL Prediction VS Control Prediction given a policy, evaluate how much reward you can get by following that policy Control find an optimal policy that maximizes the cumulative future reward 20
Components of RL Prediction VS Control Prediction given a policy, evaluate how much reward you can get by following that policy Control find an optimal policy that maximizes the cumulative future reward 21
Components of RL Planning VS Learning Planning the underlying MDP is known agent only needs to perform computations on the given model dynamic programming (policy iteration, value iteration) Learning the underlying MDP is initially unknown agent needs to interact with the environment model-free (learn value / policy) / model-based (learn model, plan on it) 22
Components of RL Planning VS Learning Planning the underlying MDP is known agent only needs to perform computations on the given model dynamic programming (policy iteration, value iteration) Learning the underlying MDP is initially unknown agent needs to interact with the environment model-free (learn value / policy) / model-based (learn model, plan on it) 23
Components of RL Planning VS Learning Planning the underlying MDP is known agent only needs to perform computations on the given model dynamic programming (policy iteration, value iteration) Learning the underlying MDP is initially unknown agent needs to interact with the environment model-free (learn value / policy) / model-based (learn model, plan on it) 24
Components of RL Planning VS Learning Planning the underlying MDP is known agent only needs to perform computations on the given model dynamic programming (policy iteration, value iteration) Learning the underlying MDP is initially unknown agent needs to interact with the environment model-free (learn value / policy) / model-based (learn model, plan on it) 25
Outline Characteristics of Reinforcement Learning (RL) The RL Problem (MDP, value, policy, Bellman) Planning (policy iteration, value iteration) Model-free Prediction (MC, TD) Model-free Control (Q-Learning) Deep Reinforcement Learning (DQN) 27
Planning Dynamic Programming Applied when optimal solutions can be decomposed into subproblems For prediction: Input: Output: < S, A, R, S, >, v For control: Input: Output: < S, A, R, S, > v, 28
Planning Dynamic Programming Applied when optimal solutions can be decomposed into subproblems For prediction: (iterative policy evaluation) Input: Output: < S, A, R, S, >, v For control: (policy iteration, value iteration) Input: Output: < S, A, R, S, > v, 29
Planning Dynamic Programming Applied when optimal solutions can be decomposed into subproblems For prediction: (iterative policy evaluation) Input: Output: < S, A, R, S, >, v For control: (policy iteration, value iteration) Input: Output: < S, A, R, S, > v, 30
Planning Iterative Policy Evaluation Iterative application of Bellman Expectation backup v 1! v 2!...! v! v k+1 (s) = X a2a (a s) R a s + X s 0 2S P a ss 0 v k (s 0 ) 31
Planning Policy Iteration Evaluate the given policy and get: v Get an improved policy by acting greedily: 0 = greedy(v ) 32
Planning Value Iteration Iterative application of Bellman Optimality backup v 1! v 2!...! v! v k+1 (s) = max a2a Ra s + X s 0 2S P a ss 0 v k (s 0 ) 33
Planning Synchronous DP Algorithms 34
Outline Characteristics of Reinforcement Learning (RL) The RL Problem (MDP, value, policy, Bellman) Planning (policy iteration, value iteration) Model-free Prediction (MC, TD) Model-free Control (Q-Learning) Deep Reinforcement Learning (DQN) 36
Recap: Components of RL Planning VS Learning Planning the underlying MDP is known agent only needs to perform computations on the given model dynamic programming (policy iteration, value iteration) Learning the underlying MDP is initially unknown agent needs to interact with the environment model-free (learn value / policy) / model-based (learn model, plan on it) 37
Model-free Prediction MC VS TD Monte Carlo Learning learns from complete trajectories, no bootstrapping estimates values by looking at sample returns, empirical mean return Temporal Difference Learning learns from incomplete episodes, by bootstrapping, substituting the remainder of the trajectory with our estimate updates a guess towards a guess 38
Model-free Prediction MC VS TD Monte Carlo Learning learns from complete trajectories, no bootstrapping estimates values by looking at sample returns, empirical mean return Temporal Difference Learning learns from incomplete episodes, by bootstrapping, substituting the remainder of the trajectory with our estimate updates a guess towards a guess 39
Model-free Prediction MC VS TD Monte Carlo Learning learns from complete trajectories, no bootstrapping estimates values by looking at sample returns, empirical mean return Temporal Difference Learning learns from incomplete episodes, by bootstrapping, substituting the remainder of the trajectory with our estimate updates a guess towards a guess 40
Model-free Prediction MC Goal: learn v from episodes of experience under policy Recall: Return is the total discounted reward: G t = R t+1 + R t+2 +...= 1X k=0 Recall: Value function is the expected return: k R t+k+1 v (s) =E [G t S t = s] MC policy evaluation (every visit MC): uses empirical mean return instead of expected return 41
Model-free Prediction MC Goal: learn v from episodes of experience under policy Recall: Return is the total discounted reward: G t = R t+1 + R t+2 +...= 1X k=0 Recall: Value function is the expected return: k R t+k+1 v (s) =E [G t S t = s] MC policy evaluation (every visit MC): uses empirical mean return instead of expected return 42
Model-free Prediction MC Goal: learn v from episodes of experience under policy Recall: Return is the total discounted reward: G t = R t+1 + R t+2 +...= 1X k=0 Recall: Value function is the expected return: k R t+k+1 v (s) =E [G t S t = s] MC policy evaluation (every visit MC): uses empirical mean return instead of expected return 43
Model-free Prediction MC Goal: learn v from episodes of experience under policy Recall: Return is the total discounted reward: G t = R t+1 + R t+2 +...= 1X k=0 Recall: Value function is the expected return: k R t+k+1 v (s) =E [G t S t = s] MC policy evaluation (every visit MC): uses empirical mean return instead of expected return 44
Goal: MC: TD: Model-free Prediction MC -> TD learn v from episodes of experience under policy updates V (S t ) towards actual return: G t V (S t ) V (S t )+ (G t V (S t )) updates V (S t ) towards estimated return: R t+1 + V (S t+1 ) V (S t ) V (S t )+ (R t+1 + V (S t+1 ) V (S t )) 45
Model-free Prediction MC VS TD: Driving Home 46
Model-free Prediction MC Backup 47
Model-free Prediction TD Backup 48
Model-free Prediction DP Backup 49
Model-free Prediction Unified View 50
Outline Characteristics of Reinforcement Learning (RL) The RL Problem (MDP, value, policy, Bellman) Planning (policy iteration, value iteration) Model-free Prediction (MC, TD) Model-free Control (Q-Learning) Deep Reinforcement Learning (DQN) 52
Model-free Control Why model-free? MDP is unknown: but experience can be sampled MDP is known: but too big to use except to sample from it 53
Recap: Planning Policy Iteration Evaluate the given policy and get: v Get an improved policy by acting greedily: 0 = greedy(v ) 54
Model-free Control Generalized Policy Iteration Evaluate the given policy and get: v Get an improved policy by acting greedily: 0 = greedy(v ) 55
Model-free Control V -> Q Greedy policy improvement over V(s) requires model of MDP 0 (s) = arg max a2a (R a s + P a ss 0 V (s 0 )) Greedy policy improvement over Q(s,a) is model-free 0 (s) = arg maxq(s, a) a2a 56
Model-free Control SARSA Q(S, A) Q(S, A)+ (R + Q(S 0, A 0 ) Q(S, A)) 57
Model-free Control Q-Learning Q(S, A) Q(S, A)+ R + max a 0 Q(S 0, a 0 ) Q(S, A) 58
Model-free Control SARSA VS Q-Learning 59
Model-free Control DP VS TD 60
Model-free Control DP VS TD 61
Outline Characteristics of Reinforcement Learning (RL) The RL Problem (MDP, value, policy, Bellman) Planning (policy iteration, value iteration) Model-free Prediction (MC, TD) Model-free Control (Q-Learning) Deep Reinforcement Learning (DQN) 63
Deep Reinforcement Learning Why? So far we represented value function by a lookup table every state s has an entry V(s) every state-action pair (s, a) has an entry Q(s, a) Problem w/ large MDPs too many states and/or actions to store in memory too slow to learn the value of each state individually 64
Deep Reinforcement Learning Why? So far we represented value function by a lookup table every state s has an entry V(s) every state-action pair (s, a) has an entry Q(s, a) Problem w/ large MDPs too many states and/or actions to store in memory too slow to learn the value of each state individually 65
Deep Reinforcement Learning How to? Use deep networks to represent: value function (value-based methods) policy (policy-based methods) model (model-based methods) Optimize value function / policy / model end-to-end 66
Deep Reinforcement Learning How to? Use deep networks to represent: value function (value-based methods) policy (policy-based methods) model (model-based methods) Optimize value function / policy / model end-to-end 67
Deep Reinforcement Learning Q-learning -> DQN 68
Deep Reinforcement Learning Q-learning -> DQN 69
Deep Reinforcement Learning AI = RL + DL Reinforcement Learning (RL) a general purpose framework for decision making learn policies to maximize future reward Deep Learning (DL) a general purpose framework for representation learning given an objective, learn representation that is required to achieve objective DRL: a single agent which can solve any human-level task RL defines the objective DL gives the mechanism RL + DL = general intelligence 70
Deep Reinforcement Learning AI = RL + DL Reinforcement Learning (RL) a general purpose framework for decision making learn policies to maximize future reward Deep Learning (DL) a general purpose framework for representation learning given an objective, learn representation that is required to achieve objective DRL: a single agent which can solve any human-level task RL defines the objective DL gives the mechanism RL + DL = general intelligence 71
Deep Reinforcement Learning AI = RL + DL Reinforcement Learning (RL) a general purpose framework for decision making learn policies to maximize future reward Deep Learning (DL) a general purpose framework for representation learning given an objective, learn representation that is required to achieve objective DRL: a single agent which can solve any human-level task RL defines the objective DL gives the mechanism RL + DL = general intelligence 72
Some Recommendations Reinforcement Learning from David Silver on YouTube Reinforcement Learning, An Introduction, Richard Sutton, 2nd Edition DQN Nature Paper: Human-level Control Through Deep Reinforcement Learning Flappy Bird: Tabular RL: https://github.com/sarvagyavaish/flappybirdrl Deep RL: https://github.com/songrotek/drl-flappybird Many many 3rd party implementations, just search for deep reinforcement learning, dqn, a3c on github My implementations in pytorch: https://github.com/jingweiz/pytorch-rl 73