A Brief Introduction to Reinforcement Learning. Jingwei Zhang

Size: px

Start display at page:

Download "A Brief Introduction to Reinforcement Learning. Jingwei Zhang"

Amos Tate
6 years ago
Views:

1 A Brief Introduction to Reinforcement Learning Jingwei Zhang 1

2 Outline Characteristics of Reinforcement Learning (RL) Components of RL (MDP, value, policy, Bellman) Planning (policy iteration, value iteration) Model-free Prediction (MC, TD) Model-free Control (Q-Learning) Deep Reinforcement Learning (DQN) 2

4 Outline Characteristics of Reinforcement Learning (RL) The RL Problem (MDP, value, policy, Bellman) Planning (policy iteration, value iteration) Model-free Prediction (MC, TD) Model-free Control (Q-Learning) Deep Reinforcement Learning (DQN) 4

5 Characteristics of RL SL VS RL Supervised Learning i.i.d data direct and strong supervision (label: what is the right thing to do) instantaneous feedback Reinforcement Learning sequential data, non-i.i.d no supervisor, only a reward signal (rule: what you did is good or bad) delayed feedback 5

6 Characteristics of RL SL VS RL Supervised Learning i.i.d data direct and strong supervision (label: what is the right thing to do) instantaneous feedback Reinforcement Learning sequential data, non-i.i.d no supervisor, only a reward signal (rule: what you did is good or bad) delayed feedback 6

7 Characteristics of RL SL VS RL Supervised Learning i.i.d data direct and strong supervision (label: what is the right thing to do) instantaneous feedback Reinforcement Learning sequential data, non-i.i.d no supervisor, only a reward signal (rule: what you did is good or bad) delayed feedback 7

8 Characteristics of RL SL VS RL Supervised Learning i.i.d data direct and strong supervision (label: what is the right thing to do) instantaneous feedback Reinforcement Learning sequential data, non-i.i.d no supervisor, only a reward signal (rule: what you did is good or bad) delayed feedback 8

10 Outline Characteristics of Reinforcement Learning (RL) Components of RL (MDP, value, policy, Bellman) Planning (policy iteration, value iteration) Model-free Prediction (MC, TD) Model-free Control (Q-Learning) Deep Reinforcement Learning (DQN) 10

11 Components of RL MDP A general framework for sequential decision making A MDP is a tuple: hs, A, P, R, i S :states A :actions P :transition probability, Pss a = P [S 0 t+1 = s 0 S t = s, A t = a] R :reward function, R a s = E [R t+1 S t = s, A t = a] :discount factor, 2 [0, 1] Markov property: The future is independent of the past given the present 11

12 Components of RL MDP A general framework for sequential decision making A MDP is a tuple: hs, A, P, R, i S :states A :actions P :transition probability, Pss a = P [S 0 t+1 = s 0 S t = s, A t = a] R :reward function, R a s = E [R t+1 S t = s, A t = a] :discount factor, 2 [0, 1] Markov property: The future is independent of the past given the present 12

13 Components of RL MDP A general framework for sequential decision making A MDP is a tuple: hs, A, P, R, i S :states A :actions P :transition probability, Pss a = P [S 0 t+1 = s 0 S t = s, A t = a] R :reward function, R a s = E [R t+1 S t = s, A t = a] :discount factor, 2 [0, 1] Markov property: The future is independent of the past given the present 13

14 Components of RL Policy & Return & Value Policy: (a s) =P [A t = a S t = s] Return: G t = R t+1 + R t = 1X k=0 State-value function: v (s) =E [G t S t = s] k R t+k+1 Action-value function: q (s, a) =E [G t S t = s, A t = a] 14

15 Components of RL Policy & Return & Value Policy: (a s) =P [A t = a S t = s] Return: G t = R t+1 + R t = 1X k=0 State-value function: v (s) =E [G t S t = s] k R t+k+1 Action-value function: q (s, a) =E [G t S t = s, A t = a] 15

16 Components of RL Policy & Return & Value Policy: (a s) =P [A t = a S t = s] Return: G t = R t+1 + R t = 1X k=0 State-value function: v (s) =E [G t S t = s] k R t+k+1 Action-value function: q (s, a) =E [G t S t = s, A t = a] 16

17 Components of RL Policy & Return & Value Policy: (a s) =P [A t = a S t = s] Return: G t = R t+1 + R t = 1X k=0 State-value function: v (s) =E [G t S t = s] k R t+k+1 Action-value function: q (s, a) =E [G t S t = s, A t = a] 17

18 Components of RL Bellman Equations Bellman Expectation Equation v (s) =E [R t+1 + v (S t+1 ) S t = s]! v (s) = X a2a (a s) R a s + X s 0 2S P a ss 0 v (s 0 ) 18

19 Components of RL Bellman Equations Bellman Optimality Equation! v (s) = max a R a s + X s 0 2S P a ss 0 v (s 0 ) 19

20 Components of RL Prediction VS Control Prediction given a policy, evaluate how much reward you can get by following that policy Control find an optimal policy that maximizes the cumulative future reward 20

21 Components of RL Prediction VS Control Prediction given a policy, evaluate how much reward you can get by following that policy Control find an optimal policy that maximizes the cumulative future reward 21

22 Components of RL Planning VS Learning Planning the underlying MDP is known agent only needs to perform computations on the given model dynamic programming (policy iteration, value iteration) Learning the underlying MDP is initially unknown agent needs to interact with the environment model-free (learn value / policy) / model-based (learn model, plan on it) 22

23 Components of RL Planning VS Learning Planning the underlying MDP is known agent only needs to perform computations on the given model dynamic programming (policy iteration, value iteration) Learning the underlying MDP is initially unknown agent needs to interact with the environment model-free (learn value / policy) / model-based (learn model, plan on it) 23

24 Components of RL Planning VS Learning Planning the underlying MDP is known agent only needs to perform computations on the given model dynamic programming (policy iteration, value iteration) Learning the underlying MDP is initially unknown agent needs to interact with the environment model-free (learn value / policy) / model-based (learn model, plan on it) 24

25 Components of RL Planning VS Learning Planning the underlying MDP is known agent only needs to perform computations on the given model dynamic programming (policy iteration, value iteration) Learning the underlying MDP is initially unknown agent needs to interact with the environment model-free (learn value / policy) / model-based (learn model, plan on it) 25

27 Outline Characteristics of Reinforcement Learning (RL) The RL Problem (MDP, value, policy, Bellman) Planning (policy iteration, value iteration) Model-free Prediction (MC, TD) Model-free Control (Q-Learning) Deep Reinforcement Learning (DQN) 27

28 Planning Dynamic Programming Applied when optimal solutions can be decomposed into subproblems For prediction: Input: Output: < S, A, R, S, >, v For control: Input: Output: < S, A, R, S, > v, 28

29 Planning Dynamic Programming Applied when optimal solutions can be decomposed into subproblems For prediction: (iterative policy evaluation) Input: Output: < S, A, R, S, >, v For control: (policy iteration, value iteration) Input: Output: < S, A, R, S, > v, 29

30 Planning Dynamic Programming Applied when optimal solutions can be decomposed into subproblems For prediction: (iterative policy evaluation) Input: Output: < S, A, R, S, >, v For control: (policy iteration, value iteration) Input: Output: < S, A, R, S, > v, 30

31 Planning Iterative Policy Evaluation Iterative application of Bellman Expectation backup v 1! v 2!...! v! v k+1 (s) = X a2a (a s) R a s + X s 0 2S P a ss 0 v k (s 0 ) 31

32 Planning Policy Iteration Evaluate the given policy and get: v Get an improved policy by acting greedily: 0 = greedy(v ) 32

33 Planning Value Iteration Iterative application of Bellman Optimality backup v 1! v 2!...! v! v k+1 (s) = max a2a Ra s + X s 0 2S P a ss 0 v k (s 0 ) 33

34 Planning Synchronous DP Algorithms 34

36 Outline Characteristics of Reinforcement Learning (RL) The RL Problem (MDP, value, policy, Bellman) Planning (policy iteration, value iteration) Model-free Prediction (MC, TD) Model-free Control (Q-Learning) Deep Reinforcement Learning (DQN) 36

37 Recap: Components of RL Planning VS Learning Planning the underlying MDP is known agent only needs to perform computations on the given model dynamic programming (policy iteration, value iteration) Learning the underlying MDP is initially unknown agent needs to interact with the environment model-free (learn value / policy) / model-based (learn model, plan on it) 37

38 Model-free Prediction MC VS TD Monte Carlo Learning learns from complete trajectories, no bootstrapping estimates values by looking at sample returns, empirical mean return Temporal Difference Learning learns from incomplete episodes, by bootstrapping, substituting the remainder of the trajectory with our estimate updates a guess towards a guess 38

39 Model-free Prediction MC VS TD Monte Carlo Learning learns from complete trajectories, no bootstrapping estimates values by looking at sample returns, empirical mean return Temporal Difference Learning learns from incomplete episodes, by bootstrapping, substituting the remainder of the trajectory with our estimate updates a guess towards a guess 39

40 Model-free Prediction MC VS TD Monte Carlo Learning learns from complete trajectories, no bootstrapping estimates values by looking at sample returns, empirical mean return Temporal Difference Learning learns from incomplete episodes, by bootstrapping, substituting the remainder of the trajectory with our estimate updates a guess towards a guess 40

41 Model-free Prediction MC Goal: learn v from episodes of experience under policy Recall: Return is the total discounted reward: G t = R t+1 + R t = 1X k=0 Recall: Value function is the expected return: k R t+k+1 v (s) =E [G t S t = s] MC policy evaluation (every visit MC): uses empirical mean return instead of expected return 41

42 Model-free Prediction MC Goal: learn v from episodes of experience under policy Recall: Return is the total discounted reward: G t = R t+1 + R t = 1X k=0 Recall: Value function is the expected return: k R t+k+1 v (s) =E [G t S t = s] MC policy evaluation (every visit MC): uses empirical mean return instead of expected return 42

43 Model-free Prediction MC Goal: learn v from episodes of experience under policy Recall: Return is the total discounted reward: G t = R t+1 + R t = 1X k=0 Recall: Value function is the expected return: k R t+k+1 v (s) =E [G t S t = s] MC policy evaluation (every visit MC): uses empirical mean return instead of expected return 43

44 Model-free Prediction MC Goal: learn v from episodes of experience under policy Recall: Return is the total discounted reward: G t = R t+1 + R t = 1X k=0 Recall: Value function is the expected return: k R t+k+1 v (s) =E [G t S t = s] MC policy evaluation (every visit MC): uses empirical mean return instead of expected return 44

45 Goal: MC: TD: Model-free Prediction MC -> TD learn v from episodes of experience under policy updates V (S t ) towards actual return: G t V (S t ) V (S t )+ (G t V (S t )) updates V (S t ) towards estimated return: R t+1 + V (S t+1 ) V (S t ) V (S t )+ (R t+1 + V (S t+1 ) V (S t )) 45

46 Model-free Prediction MC VS TD: Driving Home 46

47 Model-free Prediction MC Backup 47

48 Model-free Prediction TD Backup 48

49 Model-free Prediction DP Backup 49

50 Model-free Prediction Unified View 50

52 Outline Characteristics of Reinforcement Learning (RL) The RL Problem (MDP, value, policy, Bellman) Planning (policy iteration, value iteration) Model-free Prediction (MC, TD) Model-free Control (Q-Learning) Deep Reinforcement Learning (DQN) 52

53 Model-free Control Why model-free? MDP is unknown: but experience can be sampled MDP is known: but too big to use except to sample from it 53

54 Recap: Planning Policy Iteration Evaluate the given policy and get: v Get an improved policy by acting greedily: 0 = greedy(v ) 54

55 Model-free Control Generalized Policy Iteration Evaluate the given policy and get: v Get an improved policy by acting greedily: 0 = greedy(v ) 55

56 Model-free Control V -> Q Greedy policy improvement over V(s) requires model of MDP 0 (s) = arg max a2a (R a s + P a ss 0 V (s 0 )) Greedy policy improvement over Q(s,a) is model-free 0 (s) = arg maxq(s, a) a2a 56

57 Model-free Control SARSA Q(S, A) Q(S, A)+ (R + Q(S 0, A 0 ) Q(S, A)) 57

58 Model-free Control Q-Learning Q(S, A) Q(S, A)+ R + max a 0 Q(S 0, a 0 ) Q(S, A) 58

59 Model-free Control SARSA VS Q-Learning 59

60 Model-free Control DP VS TD 60

61 Model-free Control DP VS TD 61

63 Outline Characteristics of Reinforcement Learning (RL) The RL Problem (MDP, value, policy, Bellman) Planning (policy iteration, value iteration) Model-free Prediction (MC, TD) Model-free Control (Q-Learning) Deep Reinforcement Learning (DQN) 63

64 Deep Reinforcement Learning Why? So far we represented value function by a lookup table every state s has an entry V(s) every state-action pair (s, a) has an entry Q(s, a) Problem w/ large MDPs too many states and/or actions to store in memory too slow to learn the value of each state individually 64

65 Deep Reinforcement Learning Why? So far we represented value function by a lookup table every state s has an entry V(s) every state-action pair (s, a) has an entry Q(s, a) Problem w/ large MDPs too many states and/or actions to store in memory too slow to learn the value of each state individually 65

66 Deep Reinforcement Learning How to? Use deep networks to represent: value function (value-based methods) policy (policy-based methods) model (model-based methods) Optimize value function / policy / model end-to-end 66

67 Deep Reinforcement Learning How to? Use deep networks to represent: value function (value-based methods) policy (policy-based methods) model (model-based methods) Optimize value function / policy / model end-to-end 67

68 Deep Reinforcement Learning Q-learning -> DQN 68

69 Deep Reinforcement Learning Q-learning -> DQN 69

70 Deep Reinforcement Learning AI = RL + DL Reinforcement Learning (RL) a general purpose framework for decision making learn policies to maximize future reward Deep Learning (DL) a general purpose framework for representation learning given an objective, learn representation that is required to achieve objective DRL: a single agent which can solve any human-level task RL defines the objective DL gives the mechanism RL + DL = general intelligence 70

71 Deep Reinforcement Learning AI = RL + DL Reinforcement Learning (RL) a general purpose framework for decision making learn policies to maximize future reward Deep Learning (DL) a general purpose framework for representation learning given an objective, learn representation that is required to achieve objective DRL: a single agent which can solve any human-level task RL defines the objective DL gives the mechanism RL + DL = general intelligence 71

72 Deep Reinforcement Learning AI = RL + DL Reinforcement Learning (RL) a general purpose framework for decision making learn policies to maximize future reward Deep Learning (DL) a general purpose framework for representation learning given an objective, learn representation that is required to achieve objective DRL: a single agent which can solve any human-level task RL defines the objective DL gives the mechanism RL + DL = general intelligence 72

73 Some Recommendations Reinforcement Learning from David Silver on YouTube Reinforcement Learning, An Introduction, Richard Sutton, 2nd Edition DQN Nature Paper: Human-level Control Through Deep Reinforcement Learning Flappy Bird: Tabular RL: Deep RL: Many many 3rd party implementations, just search for deep reinforcement learning, dqn, a3c on github My implementations in pytorch: 73

Lecture 10: Reinforcement Learning

Lecture 10: Reinforcement Learning Lecture 1: Reinforcement Learning Cognitive Systems II - Machine Learning SS 25 Part III: Learning Programs and Strategies Q Learning, Dynamic Programming Lecture 1: Reinforcement Learning p. Motivation