An Introduction to COMPUTATIONAL REINFORCEMENT LEARING. Andrew G. Barto. Department of Computer Science University of Massachusetts Amherst

Size: px

Start display at page:

Download "An Introduction to COMPUTATIONAL REINFORCEMENT LEARING. Andrew G. Barto. Department of Computer Science University of Massachusetts Amherst"

Rafe Allen
5 years ago
Views:

1 An Introduction to COMPUTATIONAL REINFORCEMENT LEARING Andrew G. Barto Department of Computer Science University of Massachusetts Amherst UPF Lecture 2 Autonomous Learning Laboratory Department of Computer Science

2 The Overall Plan Lecture 1: What is Computational Reinforcement Learning? Learning from evaluative feedback Markov decision processes Lecture 2: Dynamic Programming Simple Monte Carlo methods Temporal Difference methods A unified perspective Connections to neuroscience Lecture 3: Function approximation Model-based methods Abstraction and hierarchy Intrinsically motivated RL A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

3 The Overall Plan Lecture 1: What is Computational Reinforcement Learning? Learning from evaluative feedback Markov decision processes Lecture 2: Dynamic Programming Simple Monte Carlo methods Temporal Difference methods A unified perspective Connections to neuroscience Lecture 3: Function approximation Model-based methods Dimensions of Reinforcement Learning A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

4 Lecture 2, Part 1: Dynamic Programming Objectives of this part: Overview of a collection of classical solution methods for MDPs known as Dynamic Programming (DP) Show how DP can be used to compute value functions, and hence, optimal policies Discuss efficiency and utility of DP A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

5 Policy Evaluation Policy Evaluation: for a given policy π, compute the state-value function V! Recall: State - value function for policy! : V! (s) = E! R t s t = s % { } = E! & $ " k r t +k +1 s t = s ' # k =0 ( ) * Bellman equation for V! : $ $ V! (s) =!(s, a) P a s s " a s " [ R a s s " + # V! ( s ")] a system of S simultaneous linear equations A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

6 Iterative Methods V 0! V 1! L! V k! V k +1! L! V " a sweep A sweep consists of applying a backup operation to each state. A full policy evaluation backup: V k +1 (s) " #(s,a) P a [ s s $ R a + %V ( $ s s $ s )] k & a & s $ A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

7 A Small Gridworld An undiscounted episodic task Nonterminal states: 1, 2,..., 14; One terminal state (shown twice as shaded squares) Actions that would take agent off the grid leave state unchanged Reward is 1 until the terminal state is reached A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

8 Iterative Policy Eval for the Small Gridworld! = random (uniform) action choices A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

9 Policy Improvement Suppose we have computed V! for a deterministic policy π. For a given state s, would it be better to do an action a! "(s)? The value of doing a in state s is : Q " (s,a) = E { " r t +1 + #V " (s t +1 ) s t = s,a t = a} = % P a s s $ R a + #V " ( $ s s $ s ) s $ [ ] It is better to switch to action a for state s if and only if Q " (s,a) > V " (s) A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

10 Policy Improvement Cont. Do this for all states to get a new policy "! that is greedy with respect to V " : Then V "! # V " "!(s) = argmax Q " (s, a) a = argmax # R a + $ V " ( s!) s! P a a s! s! [ ] A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

11 Policy Improvement Cont. What if V # " = V #? i.e., for all s $ S, V # " (s) = max% P a s s " R a + &V # ( " s s " s ) But this is the Bellman Optimality Equation. a s " [ ]? So V "! = V # and both " and "! are optimal policies. A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

12 Policy Iteration! 0 " V! 0 "! 1 " V! 1 " L! * " V * "! * policy evaluation policy improvement greedification A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

13 Value Iteration Recall the full policy evaluation backup: V k +1 (s) " #(s,a) P a [ s s $ R a + %V ( $ s s $ s )] k & a & s $ Here is the full value iteration backup: V k +1 (s) " max a % s # P s # s [ a + $V ( s # )] k a R s s # A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

14 Asynchronous DP All the DP methods described so far require exhaustive sweeps of the entire state set. Asynchronous DP does not use sweeps. Instead it works like this: Repeat until convergence criterion is met: Pick a state at random and apply the appropriate backup Still need lots of computation, but does not get locked into hopelessly long sweeps Can you select states to backup intelligently? YES: an agent s experience can act as a guide. A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

15 Efficiency of DP To find an optimal policy is polynomial in the number of states BUT, the number of states is often astronomical, e.g., often growing exponentially with the number of state variables (what Bellman called the curse of dimensionality ). In practice, classical DP can be applied to problems with a few millions of states. Asynchronous DP can be applied to larger problems, and appropriate for parallel computation. It is surprisingly easy to come up with MDPs for which DP methods are not practical. A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

16 Summary Policy evaluation: backups without a max Policy improvement: form a greedy policy, if only locally Policy iteration: alternate the above two processes Value iteration: backups with a max Full backups (to be contrasted later with sample backups) Asynchronous DP: a way to avoid exhaustive sweeps Bootstrapping: updating estimates based on other estimates A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

17 The Overall Plan Lecture 1: What is Computational Reinforcement Learning? Learning from evaluative feedback Markov decision processes Lecture 2: Dynamic Programming Simple Monte Carlo methods Temporal Difference methods A unified perspective Connections to neuroscience Lecture 3: Function approximation Model-based methods Abstraction and hierarchy Intrinsically motivated RL A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

18 Lecture 2, Part 2: Simple Monte Carlo Methods Simple Monte Carlo methods learn from complete sample returns Only defined for episodic tasks Simple Monte Carlo methods learn directly from experience On-line: No model necessary Simulated: No need for a full model A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

19 (First-visit) Monte Carlo policy evaluation A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

20 Backup diagram for Simple Monte Carlo Entire episode included Only one choice at each state (unlike DP) MC does not bootstrap Time required to estimate one state does not depend on the total number of states A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

21 Monte Carlo Estimation of Action Values (Q) Monte Carlo is most useful when a model is not available Q π (s,a) - average return starting from state s and action a following π Also converges asymptotically if every state-action pair is visited infinitely often We are really interested in estimates of V* and Q*, i.e., Monte Carlo Control A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

22 Learning about π while following! " A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

23 Summary MC has several advantages over DP: Can learn directly from interaction with environment No need for full models No need to learn about ALL states Less harm by Markovian violations (later in book) MC methods provide an alternate policy evaluation process One issue to watch for: maintaining sufficient exploration exploring starts, soft policies No bootstrapping (as opposed to DP) Estimating values for one policy while behaving according to another policy: importance sampling A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

24 The Overall Plan Lecture 1: What is Computational Reinforcement Learning? Learning from evaluative feedback Markov decision processes Lecture 2: Simple Monte Carlo methods Dynamic Programming Temporal Difference methods A unified perspective Connections to neuroscience Lecture 3: Function approximation Model-based methods Abstraction and hierarchy Intrinsically motivated RL A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

25 Lecture 2, Part 2: Temporal Difference Learning Objectives of this part: Introduce Temporal Difference (TD) learning Focus first on policy evaluation, or prediction, methods Then extend to control methods A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

26 TD Prediction Policy Evaluation (the prediction problem): for a given policy π, compute the state-value function V! Recall: Simple (every - visit) Monte Carlo method : [ ] V (s t ) " V (s t ) + # R t $V (s t ) target: the actual return after time t The simplest TD method, TD(0) : [ ] V (s t ) " V (s t ) + # r t +1 + $ V (s t +1 ) %V(s t ) target: an estimate of the return A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

27 Simple Monte Carlo V (s t ) " V (s t ) + # [ R t $V (s t )] (constant-α MC) where R t is the actual return following state s t. s t T T T T T T T T T T A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

28 cf. Dynamic Programming V(s t )! E " r t +1 +# V(s t ) { } s t r t +1 s t +1 T T T T T T T T T T A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

29 Simplest TD Method V(s t )! V(s t ) +" [ r t +1 + # V (s t+1 ) $ V(s t )] s t s t +1 r t +1 T T T T T T T T T T A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

30 TD Bootstraps and Samples Bootstrapping: update involves an estimate MC does not bootstrap DP bootstraps TD bootstraps Sampling: update does not involve an expected value MC samples DP does not sample TD samples A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

31 Advantages of TD Learning TD methods do not require a model of the environment, only experience TD, but not MC, methods can be fully incremental You can learn before knowing the final outcome Less memory Less peak computation You can learn without the final outcome From incomplete sequences Both MC and TD converge (under certain assumptions), but which is faster/better? A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

32 Random Walk Example equiprobable transitions " = 0.1 Values learned by TD(0) after various numbers of episodes A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

33 TD and MC on the Random Walk Data averaged over 100 sequences of episodes A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

34 You are the Predictor Suppose you observe the following 8 episodes: A, 0, B, 0 B, 1 B, 1 B, 1 B, 1 B, 1 B, 1 B, 0 V(A)? V(B)? A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

35 You are the Predictor V(A)? A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

36 You are the Predictor The prediction that best matches the training data is V(A)=0 This minimizes the mean-square-error on the training set This is what a Monte Carlo method gets If we consider the sequentiality of the problem, then we would set V(A)=.75 This is correct for the maximum likelihood estimate of a Markov model generating the data i.e, if we do a best fit Markov model, and assume it is exactly correct, and then compute what it predicts (how?) This is called the certainty-equivalence estimate This is what TD(0) gets A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

37 Learning An Action-Value Function Estimate Q! for the current behavior policy!. After every transition from a nonterminal state s t, do this : Q( s t, a t )! Q( s t, a t ) + " r t +1 +# Q( s t +1,a t +1 ) $ Q( s t,a t ) [ ] If s t +1 is terminal, then Q(s t +1, a t +1 ) = 0. A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

38 Sarsa: On-Policy TD Control Turn this into a control method by always updating the policy to be greedy with respect to the current estimate: s, a r s a A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

39 Q-Learning: Off-Policy TD Control One - step Q - learning : Q( s t, a t )! Q( s t, a t ) [ + " r t +1 +# max Q ( s t+1, a ) $ Q ( s, a )] t t a A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

40 Cliffwalking ε greedy, ε = 0.1 A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

41 Actor-Critic Architecture Environment Actions Situations or States Primary Critic Primary Rewards Adaptive Critic Effective Rewards: (involves values) Actor A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

42 Actor-Critic Methods Explicit representation of policy as well as value function Minimal computation to select actions Can learn an explicit stochastic policy Can put constraints on policies Appealing as psychological and neural models A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

43 Actor-Critic Details TD - error is used to evaluate actions :! t = r t +1 + " V (s t +1 ) # V(s t ) If actions are determined by preferences, p(s, a), as follows : { } = ep( s, a) " e! t (s, a) = Pr a t = a s t = s b p(s,b) then you can update the preferences like this : p(s t, a t ) # p(s t,a t ) + $% t, A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

44 Afterstates Usually, a state-value function evaluates states in which the agent can take an action. But sometimes it is useful to evaluate states after agent has acted, as in tic-tac-toe. Why is this useful? What is this in general? A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

45 Summary TD prediction Introduced one-step tabular model-free TD methods Extend prediction to control by employing some form of GPI On-policy control: Sarsa Off-policy control: Q-learning These methods bootstrap and sample, combining aspects of DP and MC methods A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

46 The Overall Plan Lecture 1: What is Computational Reinforcement Learning? Learning from evaluative feedback Markov decision processes Lecture 2: Dynamic Programming Simple Monte Carlo methods Temporal Difference methods A unified perspective Connections to neuroscience Lecture 3: Function approximation Model-based methods Abstraction and hierarchy Intrinsically motivated RL A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

Lecture 2, Part 4: Unified Perspective A. G. Barto, Barcelona Lectures, April 2006.

47 Lecture 2, Part 4: Unified Perspective A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

48 n-step TD Prediction Idea: Look farther into the future when you do TD backup (1, 2, 3,, n steps) A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

49 Mathematics of n-step TD Prediction Monte Carlo: R t = r " 2 t+ 1 + " rt " rt L+ T! t! 1 r T TD: R! (1) t = r + t+ 1 Vt t+ ( s ) 1 Use V to estimate remaining return n-step TD: 2 step return: R! (2) t 2 = rt + 1 +! rt Vt t+ ( s ) 2 n-step return: ( n) 2 n! 1 n R t = rt " rt " rt L+ " rt + n + " Vt ( st+ n ) A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

50 Learning with n-step Backups Backup (on-line or off-line): "V t (s t ) = #[ (n R ) t $V t (s t )] Error reduction property of n-step returns max s E " {R n t s t = s} #V " (s) $ % n maxv (s) #V " (s) s n step return Maximum error using n-step return Maximum error using V Using this, you can show that n-step methods converge A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

51 Random Walk Examples How does 2-step TD work here? How about 3-step TD? A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

52 A Larger Example Task: 19 state random walk Do you think there is an optimal n (for everything)? A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

53 Averaging n-step Returns n-step methods were introduced to help with TD(λ) understanding Idea: backup an average of several returns e.g. backup half of 2-step and half of 4- step avg 1 (2) R t = Rt R (4) t One backup Called a complex backup Draw each component Label with the weights for that component A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

54 Forward View of TD(λ) TD(λ) is a method for averaging all n-step backups weight by λ n-1 (time since visitation) λ-return: R! t = (1"!) $! n "1 (n) R t n=1 Backup using λ-return: #!V t (s t ) = "[ R # t $ V t (s t )] A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

55 λ-return Weighting Function A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

56 Relation to TD(0) and MC λ-return can be rewritten as: T " t "1 R! t = (1"!) #! n"1 R (n) t +! T "t "1 R t n=1 Until termination If λ = 1, you get MC: After termination T "t "1 R! t = (1"1) # 1 n"1 (n R ) t + 1 T " t "1 R t = R t n=1 If λ = 0, you get TD(0) T "t "1 R! t = (1" 0) # 0 n"1 (n R ) t + 0 T " t "1 (1) R t = R t n=1 A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

57 Forward View of TD(λ) Look forward from each state to determine update from future states and rewards: A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

58 λ-return on the Random Walk Same 19 state random walk as before Why do you think intermediate values of λ are best? A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

59 Backward View of TD(λ) The forward view was for theory The backward view is for mechanism New variable called eligibility trace: e t (s) " # + On each step, decay all traces by γλ and increment the trace for the current state by 1 Accumulating trace %!"e t #1 (s) e t (s) = & '!"e t #1 (s) +1 if s $ s t if s = s t A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

60 Backward View # t = r t+ 1 + " Vt ( st+ 1)! Vt ( st ) Shout δ t backwards over time The strength of your voice decreases with temporal distance by γλ A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

61 On-line Tabular TD(λ) Initialize V(s) arbitrarily and e(s) = 0, for all s!s Repeat (for each episode) : Initialize s Repeat (for each step of episode) : a " action given by # for s Take action a, observe reward, r, and next state s $ % " r +&V( s $ ) ' V (s) e(s) " e(s) +1 For all s : s " s $ V(s) " V(s) +(%e(s) e(s) " &)e(s) Until s is terminal A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

62 Relation of Backwards View to MC & TD(0) Using update rule: As before, if you set λ to 0, you get to TD(0) If you set λ to 1, you get MC but in a better way # V ( s) = e ( s) t!" t Can apply TD(1) to continuing tasks t Works incrementally and on-line (instead of waiting to the end of the episode) A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

63 Forward View = Backward View The forward (theoretical) view of TD(λ) is equivalent to the backward (mechanistic) view for off-line updating The book shows: T "1 #!V TD t (s) = #!V $ t (s t ) I sst t = 0 T "1 t = 0 T "1 #!V TD t (s) = # $ I sst #(%&) k " t ' k $!V " t (s t )I sst = $ % I sst $ (&") k # t ' k t = 0 T "1 t = 0 Backward updates Forward updates algebra shown in book T "1 k =t T #1 On-line updating with small α is similar t = 0 T #1 t = 0 T #1 k =t A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

64 Control: Sarsa(λ) Save eligibility for state-action pairs instead of just states $ e t (s, a) =!"e t #1(s, a) +1 % if &!"e t #1 (s,a) s = s and a = a t t otherwise Q t +1 (s, a) = Q t (s, a) +'( t e t (s, a) ( t = r t +1 +!Q t (s t +1,a t +1 ) # Q t (s t, a t ) A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

65 Sarsa(λ) Algorithm Initialize Q(s,a) arbitrarily and e(s, a) = 0, for all s, a Repeat (for each episode) : Initialize s, a Repeat (for each step of episode) : Take action a, observe r, s! Choose a! from s! using policy derived from Q (e.g.? - greedy) " # r +$Q( s!, a!) % Q(s, a) e(s,a) # e(s,a) +1 For all s,a : Q(s, a) # Q(s, a) +&"e(s, a) e(s, a) # $'e(s, a) s # s!;a # a! Until s is terminal A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

66 Sarsa(λ) Gridworld Example With one trial, the agent has much more information about how to get to the goal not necessarily the best way Can considerably accelerate learning A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

67 Conclusions Eligibility traces provide efficient, incremental way to combine MC and TD Includes advantages of MC (can deal with lack of Markov property) Includes advantages of TD (using TD error, bootstrapping) Can significantly speed learning Does have a cost in computation A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

68 The Overall Plan Lecture 1: What is Computational Reinforcement Learning? Learning from evaluative feedback Markov decision processes Lecture 2: Dynamic Programming Simple Monte Carlo methods Temporal Difference methods A unified perspective Connections to neuroscience Lecture 3: Function approximation Model-based methods Abstraction and hierarchy Intrinsically motivated RL A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

69 TD-error r δ t = r t + V t V t 1 regular predictors of z over this interval early in learning learning complete V δ V δ r omitted δ A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

70 Dopamine Neurons and TD Error W. Schultz et al. Universite de Fribourg A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

71 Dopamine Modulated Synaptic Plasticity A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

72 Basal Ganglia as Adaptive Critic Architecture Houk, Adams, & Barto, 1995 A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

73 The Overall Plan Lecture 1: What is Computational Reinforcement Learning? Learning from evaluative feedback Markov decision processes Lecture 2: Dynamic Programming Simple Monte Carlo methods Temporal Difference methods A unified perspective Connections to neuroscience Lecture 3: Function approximation Model-based methods Abstraction and hierarchy Intrinsically motivated RL A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

Reinforcement Learning by Comparing Immediate Reward

Reinforcement Learning by Comparing Immediate Reward Punit Pandey DeepshikhaPandey Dr. Shishir Kumar Abstract This paper introduces an approach to Reinforcement Learning Algorithm by comparing their immediate