An Introduction to COMPUTATIONAL REINFORCEMENT LEARING. Andrew G. Barto. Department of Computer Science University of Massachusetts Amherst

An Introduction to COMPUTATIONAL REINFORCEMENT LEARING Andrew G. Barto Department of Computer Science University of Massachusetts Amherst UPF Lecture 2 Autonomous Learning Laboratory Department of Computer Science

The Overall Plan Lecture 1: What is Computational Reinforcement Learning? Learning from evaluative feedback Markov decision processes Lecture 2: Dynamic Programming Simple Monte Carlo methods Temporal Difference methods A unified perspective Connections to neuroscience Lecture 3: Function approximation Model-based methods Abstraction and hierarchy Intrinsically motivated RL A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 2

The Overall Plan Lecture 1: What is Computational Reinforcement Learning? Learning from evaluative feedback Markov decision processes Lecture 2: Dynamic Programming Simple Monte Carlo methods Temporal Difference methods A unified perspective Connections to neuroscience Lecture 3: Function approximation Model-based methods Dimensions of Reinforcement Learning A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 3

Lecture 2, Part 1: Dynamic Programming Objectives of this part: Overview of a collection of classical solution methods for MDPs known as Dynamic Programming (DP) Show how DP can be used to compute value functions, and hence, optimal policies Discuss efficiency and utility of DP A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 4

Policy Evaluation Policy Evaluation: for a given policy π, compute the state-value function V! Recall: State - value function for policy! : V! (s) = E! R t s t = s % { } = E! & $ " k r t +k +1 s t = s ' # k =0 ( ) * Bellman equation for V! : $ $ V! (s) =!(s, a) P a s s " a s " [ R a s s " + # V! ( s ")] a system of S simultaneous linear equations A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 5

Iterative Methods V 0! V 1! L! V k! V k +1! L! V " a sweep A sweep consists of applying a backup operation to each state. A full policy evaluation backup: V k +1 (s) " #(s,a) P a [ s s $ R a + %V ( $ s s $ s )] k & a & s $ A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 6

A Small Gridworld An undiscounted episodic task Nonterminal states: 1, 2,..., 14; One terminal state (shown twice as shaded squares) Actions that would take agent off the grid leave state unchanged Reward is 1 until the terminal state is reached A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 8

Iterative Policy Eval for the Small Gridworld! = random (uniform) action choices A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 9

Policy Improvement Suppose we have computed V! for a deterministic policy π. For a given state s, would it be better to do an action a! "(s)? The value of doing a in state s is : Q " (s,a) = E { " r t +1 + #V " (s t +1 ) s t = s,a t = a} = % P a s s $ R a + #V " ( $ s s $ s ) s $ [ ] It is better to switch to action a for state s if and only if Q " (s,a) > V " (s) A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 10

Policy Improvement Cont. Do this for all states to get a new policy "! that is greedy with respect to V " : Then V "! # V " "!(s) = argmax Q " (s, a) a = argmax # R a + $ V " ( s!) s! P a a s! s! [ ] A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 11

Policy Improvement Cont. What if V # " = V #? i.e., for all s $ S, V # " (s) = max% P a s s " R a + &V # ( " s s " s ) But this is the Bellman Optimality Equation. a s " [ ]? So V "! = V # and both " and "! are optimal policies. A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 12

Policy Iteration! 0 " V! 0 "! 1 " V! 1 " L! * " V * "! * policy evaluation policy improvement greedification A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 13

Value Iteration Recall the full policy evaluation backup: V k +1 (s) " #(s,a) P a [ s s $ R a + %V ( $ s s $ s )] k & a & s $ Here is the full value iteration backup: V k +1 (s) " max a % s # P s # s [ a + $V ( s # )] k a R s s # A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 15

Asynchronous DP All the DP methods described so far require exhaustive sweeps of the entire state set. Asynchronous DP does not use sweeps. Instead it works like this: Repeat until convergence criterion is met: Pick a state at random and apply the appropriate backup Still need lots of computation, but does not get locked into hopelessly long sweeps Can you select states to backup intelligently? YES: an agent s experience can act as a guide. A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 17

Efficiency of DP To find an optimal policy is polynomial in the number of states BUT, the number of states is often astronomical, e.g., often growing exponentially with the number of state variables (what Bellman called the curse of dimensionality ). In practice, classical DP can be applied to problems with a few millions of states. Asynchronous DP can be applied to larger problems, and appropriate for parallel computation. It is surprisingly easy to come up with MDPs for which DP methods are not practical. A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 19

Summary Policy evaluation: backups without a max Policy improvement: form a greedy policy, if only locally Policy iteration: alternate the above two processes Value iteration: backups with a max Full backups (to be contrasted later with sample backups) Asynchronous DP: a way to avoid exhaustive sweeps Bootstrapping: updating estimates based on other estimates A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 20

Lecture 2, Part 2: Simple Monte Carlo Methods Simple Monte Carlo methods learn from complete sample returns Only defined for episodic tasks Simple Monte Carlo methods learn directly from experience On-line: No model necessary Simulated: No need for a full model A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 22

(First-visit) Monte Carlo policy evaluation A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 24

Backup diagram for Simple Monte Carlo Entire episode included Only one choice at each state (unlike DP) MC does not bootstrap Time required to estimate one state does not depend on the total number of states A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 27

Monte Carlo Estimation of Action Values (Q) Monte Carlo is most useful when a model is not available Q π (s,a) - average return starting from state s and action a following π Also converges asymptotically if every state-action pair is visited infinitely often We are really interested in estimates of V* and Q*, i.e., Monte Carlo Control A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 30

Learning about π while following! " A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 38

Summary MC has several advantages over DP: Can learn directly from interaction with environment No need for full models No need to learn about ALL states Less harm by Markovian violations (later in book) MC methods provide an alternate policy evaluation process One issue to watch for: maintaining sufficient exploration exploring starts, soft policies No bootstrapping (as opposed to DP) Estimating values for one policy while behaving according to another policy: importance sampling A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 41

The Overall Plan Lecture 1: What is Computational Reinforcement Learning? Learning from evaluative feedback Markov decision processes Lecture 2: Simple Monte Carlo methods Dynamic Programming Temporal Difference methods A unified perspective Connections to neuroscience Lecture 3: Function approximation Model-based methods Abstraction and hierarchy Intrinsically motivated RL A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 42

Lecture 2, Part 2: Temporal Difference Learning Objectives of this part: Introduce Temporal Difference (TD) learning Focus first on policy evaluation, or prediction, methods Then extend to control methods A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 43

TD Prediction Policy Evaluation (the prediction problem): for a given policy π, compute the state-value function V! Recall: Simple (every - visit) Monte Carlo method : [ ] V (s t ) " V (s t ) + # R t $V (s t ) target: the actual return after time t The simplest TD method, TD(0) : [ ] V (s t ) " V (s t ) + # r t +1 + $ V (s t +1 ) %V(s t ) target: an estimate of the return A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 44

Simple Monte Carlo V (s t ) " V (s t ) + # [ R t $V (s t )] (constant-α MC) where R t is the actual return following state s t. s t T T T T T T T T T T A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 45

cf. Dynamic Programming V(s t )! E " r t +1 +# V(s t ) { } s t r t +1 s t +1 T T T T T T T T T T A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 46

Simplest TD Method V(s t )! V(s t ) +" [ r t +1 + # V (s t+1 ) $ V(s t )] s t s t +1 r t +1 T T T T T T T T T T A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 47

TD Bootstraps and Samples Bootstrapping: update involves an estimate MC does not bootstrap DP bootstraps TD bootstraps Sampling: update does not involve an expected value MC samples DP does not sample TD samples A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 48

Advantages of TD Learning TD methods do not require a model of the environment, only experience TD, but not MC, methods can be fully incremental You can learn before knowing the final outcome Less memory Less peak computation You can learn without the final outcome From incomplete sequences Both MC and TD converge (under certain assumptions), but which is faster/better? A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 51

Random Walk Example equiprobable transitions " = 0.1 Values learned by TD(0) after various numbers of episodes A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 52

TD and MC on the Random Walk Data averaged over 100 sequences of episodes A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 53

You are the Predictor Suppose you observe the following 8 episodes: A, 0, B, 0 B, 1 B, 1 B, 1 B, 1 B, 1 B, 1 B, 0 V(A)? V(B)? A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 56

You are the Predictor V(A)? A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 57

You are the Predictor The prediction that best matches the training data is V(A)=0 This minimizes the mean-square-error on the training set This is what a Monte Carlo method gets If we consider the sequentiality of the problem, then we would set V(A)=.75 This is correct for the maximum likelihood estimate of a Markov model generating the data i.e, if we do a best fit Markov model, and assume it is exactly correct, and then compute what it predicts (how?) This is called the certainty-equivalence estimate This is what TD(0) gets A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 58

Learning An Action-Value Function Estimate Q! for the current behavior policy!. After every transition from a nonterminal state s t, do this : Q( s t, a t )! Q( s t, a t ) + " r t +1 +# Q( s t +1,a t +1 ) $ Q( s t,a t ) [ ] If s t +1 is terminal, then Q(s t +1, a t +1 ) = 0. A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 59

Sarsa: On-Policy TD Control Turn this into a control method by always updating the policy to be greedy with respect to the current estimate: s, a r s a A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 60

Q-Learning: Off-Policy TD Control One - step Q - learning : Q( s t, a t )! Q( s t, a t ) [ + " r t +1 +# max Q ( s t+1, a ) $ Q ( s, a )] t t a A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 63

Cliffwalking ε greedy, ε = 0.1 A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 64

Actor-Critic Architecture Environment Actions Situations or States Primary Critic Primary Rewards Adaptive Critic Effective Rewards: (involves values) Actor A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 65

Actor-Critic Methods Explicit representation of policy as well as value function Minimal computation to select actions Can learn an explicit stochastic policy Can put constraints on policies Appealing as psychological and neural models A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 66

Actor-Critic Details TD - error is used to evaluate actions :! t = r t +1 + " V (s t +1 ) # V(s t ) If actions are determined by preferences, p(s, a), as follows : { } = ep( s, a) " e! t (s, a) = Pr a t = a s t = s b p(s,b) then you can update the preferences like this : p(s t, a t ) # p(s t,a t ) + $% t, A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 67

Afterstates Usually, a state-value function evaluates states in which the agent can take an action. But sometimes it is useful to evaluate states after agent has acted, as in tic-tac-toe. Why is this useful? What is this in general? A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 71

Summary TD prediction Introduced one-step tabular model-free TD methods Extend prediction to control by employing some form of GPI On-policy control: Sarsa Off-policy control: Q-learning These methods bootstrap and sample, combining aspects of DP and MC methods A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 72

Lecture 2, Part 4: Unified Perspective A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 74

n-step TD Prediction Idea: Look farther into the future when you do TD backup (1, 2, 3,, n steps) A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 75

Mathematics of n-step TD Prediction Monte Carlo: R t = r " 2 t+ 1 + " rt + 2 + " rt + 3 + L+ T! t! 1 r T TD: R! (1) t = r + t+ 1 Vt t+ ( s ) 1 Use V to estimate remaining return n-step TD: 2 step return: R! (2) t 2 = rt + 1 +! rt + 2 + Vt t+ ( s ) 2 n-step return: ( n) 2 n! 1 n R t = rt + 1 + " rt + 2 + " rt + 3 + L+ " rt + n + " Vt ( st+ n ) A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 76

Learning with n-step Backups Backup (on-line or off-line): "V t (s t ) = #[ (n R ) t $V t (s t )] Error reduction property of n-step returns max s E " {R n t s t = s} #V " (s) $ % n maxv (s) #V " (s) s n step return Maximum error using n-step return Maximum error using V Using this, you can show that n-step methods converge A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 77

Random Walk Examples How does 2-step TD work here? How about 3-step TD? A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 78

A Larger Example Task: 19 state random walk Do you think there is an optimal n (for everything)? A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 79

Averaging n-step Returns n-step methods were introduced to help with TD(λ) understanding Idea: backup an average of several returns e.g. backup half of 2-step and half of 4- step avg 1 (2) R t = Rt + 2 1 2 R (4) t One backup Called a complex backup Draw each component Label with the weights for that component A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 80

Forward View of TD(λ) TD(λ) is a method for averaging all n-step backups weight by λ n-1 (time since visitation) λ-return: R! t = (1"!) $! n "1 (n) R t n=1 Backup using λ-return: #!V t (s t ) = "[ R # t $ V t (s t )] A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 81

λ-return Weighting Function A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 82

Relation to TD(0) and MC λ-return can be rewritten as: T " t "1 R! t = (1"!) #! n"1 R (n) t +! T "t "1 R t n=1 Until termination If λ = 1, you get MC: After termination T "t "1 R! t = (1"1) # 1 n"1 (n R ) t + 1 T " t "1 R t = R t n=1 If λ = 0, you get TD(0) T "t "1 R! t = (1" 0) # 0 n"1 (n R ) t + 0 T " t "1 (1) R t = R t n=1 A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 83

Forward View of TD(λ) Look forward from each state to determine update from future states and rewards: A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 84

λ-return on the Random Walk Same 19 state random walk as before Why do you think intermediate values of λ are best? A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 85

Backward View of TD(λ) The forward view was for theory The backward view is for mechanism New variable called eligibility trace: e t (s) " # + On each step, decay all traces by γλ and increment the trace for the current state by 1 Accumulating trace %!"e t #1 (s) e t (s) = & '!"e t #1 (s) +1 if s $ s t if s = s t A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 86

Backward View # t = r t+ 1 + " Vt ( st+ 1)! Vt ( st ) Shout δ t backwards over time The strength of your voice decreases with temporal distance by γλ A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 87

On-line Tabular TD(λ) Initialize V(s) arbitrarily and e(s) = 0, for all s!s Repeat (for each episode) : Initialize s Repeat (for each step of episode) : a " action given by # for s Take action a, observe reward, r, and next state s $ % " r +&V( s $ ) ' V (s) e(s) " e(s) +1 For all s : s " s $ V(s) " V(s) +(%e(s) e(s) " &)e(s) Until s is terminal A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 88

Relation of Backwards View to MC & TD(0) Using update rule: As before, if you set λ to 0, you get to TD(0) If you set λ to 1, you get MC but in a better way # V ( s) = e ( s) t!" t Can apply TD(1) to continuing tasks t Works incrementally and on-line (instead of waiting to the end of the episode) A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 89

Forward View = Backward View The forward (theoretical) view of TD(λ) is equivalent to the backward (mechanistic) view for off-line updating The book shows: T "1 #!V TD t (s) = #!V $ t (s t ) I sst t = 0 T "1 t = 0 T "1 #!V TD t (s) = # $ I sst #(%&) k " t ' k $!V " t (s t )I sst = $ % I sst $ (&") k # t ' k t = 0 T "1 t = 0 Backward updates Forward updates algebra shown in book T "1 k =t T #1 On-line updating with small α is similar t = 0 T #1 t = 0 T #1 k =t A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 90

Control: Sarsa(λ) Save eligibility for state-action pairs instead of just states $ e t (s, a) =!"e t #1(s, a) +1 % if &!"e t #1 (s,a) s = s and a = a t t otherwise Q t +1 (s, a) = Q t (s, a) +'( t e t (s, a) ( t = r t +1 +!Q t (s t +1,a t +1 ) # Q t (s t, a t ) A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 92

Sarsa(λ) Algorithm Initialize Q(s,a) arbitrarily and e(s, a) = 0, for all s, a Repeat (for each episode) : Initialize s, a Repeat (for each step of episode) : Take action a, observe r, s! Choose a! from s! using policy derived from Q (e.g.? - greedy) " # r +$Q( s!, a!) % Q(s, a) e(s,a) # e(s,a) +1 For all s,a : Q(s, a) # Q(s, a) +&"e(s, a) e(s, a) # $'e(s, a) s # s!;a # a! Until s is terminal A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 93

Sarsa(λ) Gridworld Example With one trial, the agent has much more information about how to get to the goal not necessarily the best way Can considerably accelerate learning A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 94

Conclusions Eligibility traces provide efficient, incremental way to combine MC and TD Includes advantages of MC (can deal with lack of Markov property) Includes advantages of TD (using TD error, bootstrapping) Can significantly speed learning Does have a cost in computation A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 109

TD-error r δ t = r t + V t V t 1 regular predictors of z over this interval early in learning learning complete V δ V δ r omitted δ A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 111

Dopamine Neurons and TD Error W. Schultz et al. Universite de Fribourg A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 112

Dopamine Modulated Synaptic Plasticity A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 113

Basal Ganglia as Adaptive Critic Architecture Houk, Adams, & Barto, 1995 A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 114