An Introduction to COMPUTATIONAL REINFORCEMENT LEARING. Andrew G. Barto. Department of Computer Science University of Massachusetts Amherst

Similar documents
Reinforcement Learning by Comparing Immediate Reward

Lecture 10: Reinforcement Learning

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Axiom 2013 Team Description Paper

A Reinforcement Learning Variant for Control Scheduling

TD(λ) and Q-Learning Based Ludo Players

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Speeding Up Reinforcement Learning with Behavior Transfer

Georgetown University at TREC 2017 Dynamic Domain Track

Learning Prospective Robot Behavior

Lecture 1: Machine Learning Basics

Artificial Neural Networks written examination

High-level Reinforcement Learning in Strategy Games

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

Regret-based Reward Elicitation for Markov Decision Processes

AMULTIAGENT system [1] can be defined as a group of

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Improving Action Selection in MDP s via Knowledge Transfer

Python Machine Learning

The Strong Minimalist Thesis and Bounded Optimality

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

A Comparison of Annealing Techniques for Academic Course Scheduling

Generative models and adversarial training

FF+FPG: Guiding a Policy-Gradient Planner

Planning with External Events

Probability and Game Theory Course Syllabus

An OO Framework for building Intelligence and Learning properties in Software Agents

Are You Ready? Simplify Fractions

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

Introduction to Simulation

Learning Methods for Fuzzy Systems

Lahore University of Management Sciences. FINN 321 Econometrics Fall Semester 2017

AI Agent for Ice Hockey Atari 2600

While you are waiting... socrative.com, room number SIMLANG2016

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

How long did... Who did... Where was... When did... How did... Which did...

Measurement. When Smaller Is Better. Activity:

On the Combined Behavior of Autonomous Resource Management Agents

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Agents and environments. Intelligent Agents. Reminders. Vacuum-cleaner world. Outline. A vacuum-cleaner agent. Chapter 2 Actuators

Rover Races Grades: 3-5 Prep Time: ~45 Minutes Lesson Time: ~105 minutes

Radius STEM Readiness TM

COMPUTATIONAL COMPLEXITY OF LEFT-ASSOCIATIVE GRAMMAR

Julia Smith. Effective Classroom Approaches to.

Chapter 2. Intelligent Agents. Outline. Agents and environments. Rationality. PEAS (Performance measure, Environment, Actuators, Sensors)

BMBF Project ROBUKOM: Robust Communication Networks

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy

ACTL5103 Stochastic Modelling For Actuaries. Course Outline Semester 2, 2014

Learning to Schedule Straight-Line Code

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Laboratorio di Intelligenza Artificiale e Robotica

Adaptive Generation in Dialogue Systems Using Dynamic User Modeling

Stopping rules for sequential trials in high-dimensional data

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

AP Calculus AB. Nevada Academic Standards that are assessable at the local level only.

Evolutive Neural Net Fuzzy Filtering: Basic Description

(Sub)Gradient Descent

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

Statewide Framework Document for:

The Evolution of Random Phenomena

A Process-Model Account of Task Interruption and Resumption: When Does Encoding of the Problem State Occur?

Acquiring Competence from Performance Data

Intelligent Agents. Chapter 2. Chapter 2 1

Self Study Report Computer Science

NCEO Technical Report 27

University of Victoria School of Exercise Science, Physical and Health Education EPHE 245 MOTOR LEARNING. Calendar Description Units: 1.

Dublin City Schools Mathematics Graded Course of Study GRADE 4

Modeling user preferences and norms in context-aware systems

Corrective Feedback and Persistent Learning for Information Extraction

Mathematics subject curriculum

AGENDA LEARNING THEORIES LEARNING THEORIES. Advanced Learning Theories 2/22/2016

Erkki Mäkinen State change languages as homomorphic images of Szilard languages

Extending Place Value with Whole Numbers to 1,000,000

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Learning and Transferring Relational Instance-Based Policies

BAYESIAN ANALYSIS OF INTERLEAVED LEARNING AND RESPONSE BIAS IN BEHAVIORAL EXPERIMENTS

Accelerated Learning Online. Course Outline

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

SARDNET: A Self-Organizing Feature Map for Sequences

XXII BrainStorming Day

arxiv: v1 [math.at] 10 Jan 2016

A Stochastic Model for the Vocabulary Explosion

Surprise-Based Learning for Autonomous Systems

Visit us at:

Executive Guide to Simulation for Health

1. Answer the questions below on the Lesson Planning Response Document.

Laboratorio di Intelligenza Artificiale e Robotica

An investigation of imitation learning algorithms for structured prediction

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

Algebra 2- Semester 2 Review

Discriminative Learning of Beam-Search Heuristics for Planning

EVOLVING POLICIES TO SOLVE THE RUBIK S CUBE: EXPERIMENTS WITH IDEAL AND APPROXIMATE PERFORMANCE FUNCTIONS

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

Accelerated Learning Course Outline

Probabilistic Latent Semantic Analysis

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Curriculum Design Project with Virtual Manipulatives. Gwenanne Salkind. George Mason University EDCI 856. Dr. Patricia Moyer-Packenham

Paper Reference. Edexcel GCSE Mathematics (Linear) 1380 Paper 1 (Non-Calculator) Foundation Tier. Monday 6 June 2011 Afternoon Time: 1 hour 30 minutes

Transcription:

An Introduction to COMPUTATIONAL REINFORCEMENT LEARING Andrew G. Barto Department of Computer Science University of Massachusetts Amherst UPF Lecture 2 Autonomous Learning Laboratory Department of Computer Science

The Overall Plan Lecture 1: What is Computational Reinforcement Learning? Learning from evaluative feedback Markov decision processes Lecture 2: Dynamic Programming Simple Monte Carlo methods Temporal Difference methods A unified perspective Connections to neuroscience Lecture 3: Function approximation Model-based methods Abstraction and hierarchy Intrinsically motivated RL A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 2

The Overall Plan Lecture 1: What is Computational Reinforcement Learning? Learning from evaluative feedback Markov decision processes Lecture 2: Dynamic Programming Simple Monte Carlo methods Temporal Difference methods A unified perspective Connections to neuroscience Lecture 3: Function approximation Model-based methods Dimensions of Reinforcement Learning A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 3

Lecture 2, Part 1: Dynamic Programming Objectives of this part: Overview of a collection of classical solution methods for MDPs known as Dynamic Programming (DP) Show how DP can be used to compute value functions, and hence, optimal policies Discuss efficiency and utility of DP A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 4

Policy Evaluation Policy Evaluation: for a given policy π, compute the state-value function V! Recall: State - value function for policy! : V! (s) = E! R t s t = s % { } = E! & $ " k r t +k +1 s t = s ' # k =0 ( ) * Bellman equation for V! : $ $ V! (s) =!(s, a) P a s s " a s " [ R a s s " + # V! ( s ")] a system of S simultaneous linear equations A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 5

Iterative Methods V 0! V 1! L! V k! V k +1! L! V " a sweep A sweep consists of applying a backup operation to each state. A full policy evaluation backup: V k +1 (s) " #(s,a) P a [ s s $ R a + %V ( $ s s $ s )] k & a & s $ A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 6

A Small Gridworld An undiscounted episodic task Nonterminal states: 1, 2,..., 14; One terminal state (shown twice as shaded squares) Actions that would take agent off the grid leave state unchanged Reward is 1 until the terminal state is reached A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 8

Iterative Policy Eval for the Small Gridworld! = random (uniform) action choices A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 9

Policy Improvement Suppose we have computed V! for a deterministic policy π. For a given state s, would it be better to do an action a! "(s)? The value of doing a in state s is : Q " (s,a) = E { " r t +1 + #V " (s t +1 ) s t = s,a t = a} = % P a s s $ R a + #V " ( $ s s $ s ) s $ [ ] It is better to switch to action a for state s if and only if Q " (s,a) > V " (s) A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 10

Policy Improvement Cont. Do this for all states to get a new policy "! that is greedy with respect to V " : Then V "! # V " "!(s) = argmax Q " (s, a) a = argmax # R a + $ V " ( s!) s! P a a s! s! [ ] A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 11

Policy Improvement Cont. What if V # " = V #? i.e., for all s $ S, V # " (s) = max% P a s s " R a + &V # ( " s s " s ) But this is the Bellman Optimality Equation. a s " [ ]? So V "! = V # and both " and "! are optimal policies. A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 12

Policy Iteration! 0 " V! 0 "! 1 " V! 1 " L! * " V * "! * policy evaluation policy improvement greedification A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 13

Value Iteration Recall the full policy evaluation backup: V k +1 (s) " #(s,a) P a [ s s $ R a + %V ( $ s s $ s )] k & a & s $ Here is the full value iteration backup: V k +1 (s) " max a % s # P s # s [ a + $V ( s # )] k a R s s # A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 15

Asynchronous DP All the DP methods described so far require exhaustive sweeps of the entire state set. Asynchronous DP does not use sweeps. Instead it works like this: Repeat until convergence criterion is met: Pick a state at random and apply the appropriate backup Still need lots of computation, but does not get locked into hopelessly long sweeps Can you select states to backup intelligently? YES: an agent s experience can act as a guide. A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 17

Efficiency of DP To find an optimal policy is polynomial in the number of states BUT, the number of states is often astronomical, e.g., often growing exponentially with the number of state variables (what Bellman called the curse of dimensionality ). In practice, classical DP can be applied to problems with a few millions of states. Asynchronous DP can be applied to larger problems, and appropriate for parallel computation. It is surprisingly easy to come up with MDPs for which DP methods are not practical. A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 19

Summary Policy evaluation: backups without a max Policy improvement: form a greedy policy, if only locally Policy iteration: alternate the above two processes Value iteration: backups with a max Full backups (to be contrasted later with sample backups) Asynchronous DP: a way to avoid exhaustive sweeps Bootstrapping: updating estimates based on other estimates A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 20

The Overall Plan Lecture 1: What is Computational Reinforcement Learning? Learning from evaluative feedback Markov decision processes Lecture 2: Dynamic Programming Simple Monte Carlo methods Temporal Difference methods A unified perspective Connections to neuroscience Lecture 3: Function approximation Model-based methods Abstraction and hierarchy Intrinsically motivated RL A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 21

Lecture 2, Part 2: Simple Monte Carlo Methods Simple Monte Carlo methods learn from complete sample returns Only defined for episodic tasks Simple Monte Carlo methods learn directly from experience On-line: No model necessary Simulated: No need for a full model A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 22

(First-visit) Monte Carlo policy evaluation A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 24

Backup diagram for Simple Monte Carlo Entire episode included Only one choice at each state (unlike DP) MC does not bootstrap Time required to estimate one state does not depend on the total number of states A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 27

Monte Carlo Estimation of Action Values (Q) Monte Carlo is most useful when a model is not available Q π (s,a) - average return starting from state s and action a following π Also converges asymptotically if every state-action pair is visited infinitely often We are really interested in estimates of V* and Q*, i.e., Monte Carlo Control A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 30

Learning about π while following! " A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 38

Summary MC has several advantages over DP: Can learn directly from interaction with environment No need for full models No need to learn about ALL states Less harm by Markovian violations (later in book) MC methods provide an alternate policy evaluation process One issue to watch for: maintaining sufficient exploration exploring starts, soft policies No bootstrapping (as opposed to DP) Estimating values for one policy while behaving according to another policy: importance sampling A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 41

The Overall Plan Lecture 1: What is Computational Reinforcement Learning? Learning from evaluative feedback Markov decision processes Lecture 2: Simple Monte Carlo methods Dynamic Programming Temporal Difference methods A unified perspective Connections to neuroscience Lecture 3: Function approximation Model-based methods Abstraction and hierarchy Intrinsically motivated RL A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 42

Lecture 2, Part 2: Temporal Difference Learning Objectives of this part: Introduce Temporal Difference (TD) learning Focus first on policy evaluation, or prediction, methods Then extend to control methods A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 43

TD Prediction Policy Evaluation (the prediction problem): for a given policy π, compute the state-value function V! Recall: Simple (every - visit) Monte Carlo method : [ ] V (s t ) " V (s t ) + # R t $V (s t ) target: the actual return after time t The simplest TD method, TD(0) : [ ] V (s t ) " V (s t ) + # r t +1 + $ V (s t +1 ) %V(s t ) target: an estimate of the return A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 44

Simple Monte Carlo V (s t ) " V (s t ) + # [ R t $V (s t )] (constant-α MC) where R t is the actual return following state s t. s t T T T T T T T T T T A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 45

cf. Dynamic Programming V(s t )! E " r t +1 +# V(s t ) { } s t r t +1 s t +1 T T T T T T T T T T A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 46

Simplest TD Method V(s t )! V(s t ) +" [ r t +1 + # V (s t+1 ) $ V(s t )] s t s t +1 r t +1 T T T T T T T T T T A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 47

TD Bootstraps and Samples Bootstrapping: update involves an estimate MC does not bootstrap DP bootstraps TD bootstraps Sampling: update does not involve an expected value MC samples DP does not sample TD samples A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 48

Advantages of TD Learning TD methods do not require a model of the environment, only experience TD, but not MC, methods can be fully incremental You can learn before knowing the final outcome Less memory Less peak computation You can learn without the final outcome From incomplete sequences Both MC and TD converge (under certain assumptions), but which is faster/better? A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 51

Random Walk Example equiprobable transitions " = 0.1 Values learned by TD(0) after various numbers of episodes A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 52

TD and MC on the Random Walk Data averaged over 100 sequences of episodes A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 53

You are the Predictor Suppose you observe the following 8 episodes: A, 0, B, 0 B, 1 B, 1 B, 1 B, 1 B, 1 B, 1 B, 0 V(A)? V(B)? A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 56

You are the Predictor V(A)? A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 57

You are the Predictor The prediction that best matches the training data is V(A)=0 This minimizes the mean-square-error on the training set This is what a Monte Carlo method gets If we consider the sequentiality of the problem, then we would set V(A)=.75 This is correct for the maximum likelihood estimate of a Markov model generating the data i.e, if we do a best fit Markov model, and assume it is exactly correct, and then compute what it predicts (how?) This is called the certainty-equivalence estimate This is what TD(0) gets A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 58

Learning An Action-Value Function Estimate Q! for the current behavior policy!. After every transition from a nonterminal state s t, do this : Q( s t, a t )! Q( s t, a t ) + " r t +1 +# Q( s t +1,a t +1 ) $ Q( s t,a t ) [ ] If s t +1 is terminal, then Q(s t +1, a t +1 ) = 0. A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 59

Sarsa: On-Policy TD Control Turn this into a control method by always updating the policy to be greedy with respect to the current estimate: s, a r s a A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 60

Q-Learning: Off-Policy TD Control One - step Q - learning : Q( s t, a t )! Q( s t, a t ) [ + " r t +1 +# max Q ( s t+1, a ) $ Q ( s, a )] t t a A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 63

Cliffwalking ε greedy, ε = 0.1 A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 64

Actor-Critic Architecture Environment Actions Situations or States Primary Critic Primary Rewards Adaptive Critic Effective Rewards: (involves values) Actor A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 65

Actor-Critic Methods Explicit representation of policy as well as value function Minimal computation to select actions Can learn an explicit stochastic policy Can put constraints on policies Appealing as psychological and neural models A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 66

Actor-Critic Details TD - error is used to evaluate actions :! t = r t +1 + " V (s t +1 ) # V(s t ) If actions are determined by preferences, p(s, a), as follows : { } = ep( s, a) " e! t (s, a) = Pr a t = a s t = s b p(s,b) then you can update the preferences like this : p(s t, a t ) # p(s t,a t ) + $% t, A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 67

Afterstates Usually, a state-value function evaluates states in which the agent can take an action. But sometimes it is useful to evaluate states after agent has acted, as in tic-tac-toe. Why is this useful? What is this in general? A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 71

Summary TD prediction Introduced one-step tabular model-free TD methods Extend prediction to control by employing some form of GPI On-policy control: Sarsa Off-policy control: Q-learning These methods bootstrap and sample, combining aspects of DP and MC methods A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 72

The Overall Plan Lecture 1: What is Computational Reinforcement Learning? Learning from evaluative feedback Markov decision processes Lecture 2: Dynamic Programming Simple Monte Carlo methods Temporal Difference methods A unified perspective Connections to neuroscience Lecture 3: Function approximation Model-based methods Abstraction and hierarchy Intrinsically motivated RL A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 73

Lecture 2, Part 4: Unified Perspective A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 74

n-step TD Prediction Idea: Look farther into the future when you do TD backup (1, 2, 3,, n steps) A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 75

Mathematics of n-step TD Prediction Monte Carlo: R t = r " 2 t+ 1 + " rt + 2 + " rt + 3 + L+ T! t! 1 r T TD: R! (1) t = r + t+ 1 Vt t+ ( s ) 1 Use V to estimate remaining return n-step TD: 2 step return: R! (2) t 2 = rt + 1 +! rt + 2 + Vt t+ ( s ) 2 n-step return: ( n) 2 n! 1 n R t = rt + 1 + " rt + 2 + " rt + 3 + L+ " rt + n + " Vt ( st+ n ) A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 76

Learning with n-step Backups Backup (on-line or off-line): "V t (s t ) = #[ (n R ) t $V t (s t )] Error reduction property of n-step returns max s E " {R n t s t = s} #V " (s) $ % n maxv (s) #V " (s) s n step return Maximum error using n-step return Maximum error using V Using this, you can show that n-step methods converge A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 77

Random Walk Examples How does 2-step TD work here? How about 3-step TD? A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 78

A Larger Example Task: 19 state random walk Do you think there is an optimal n (for everything)? A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 79

Averaging n-step Returns n-step methods were introduced to help with TD(λ) understanding Idea: backup an average of several returns e.g. backup half of 2-step and half of 4- step avg 1 (2) R t = Rt + 2 1 2 R (4) t One backup Called a complex backup Draw each component Label with the weights for that component A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 80

Forward View of TD(λ) TD(λ) is a method for averaging all n-step backups weight by λ n-1 (time since visitation) λ-return: R! t = (1"!) $! n "1 (n) R t n=1 Backup using λ-return: #!V t (s t ) = "[ R # t $ V t (s t )] A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 81

λ-return Weighting Function A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 82

Relation to TD(0) and MC λ-return can be rewritten as: T " t "1 R! t = (1"!) #! n"1 R (n) t +! T "t "1 R t n=1 Until termination If λ = 1, you get MC: After termination T "t "1 R! t = (1"1) # 1 n"1 (n R ) t + 1 T " t "1 R t = R t n=1 If λ = 0, you get TD(0) T "t "1 R! t = (1" 0) # 0 n"1 (n R ) t + 0 T " t "1 (1) R t = R t n=1 A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 83

Forward View of TD(λ) Look forward from each state to determine update from future states and rewards: A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 84

λ-return on the Random Walk Same 19 state random walk as before Why do you think intermediate values of λ are best? A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 85

Backward View of TD(λ) The forward view was for theory The backward view is for mechanism New variable called eligibility trace: e t (s) " # + On each step, decay all traces by γλ and increment the trace for the current state by 1 Accumulating trace %!"e t #1 (s) e t (s) = & '!"e t #1 (s) +1 if s $ s t if s = s t A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 86

Backward View # t = r t+ 1 + " Vt ( st+ 1)! Vt ( st ) Shout δ t backwards over time The strength of your voice decreases with temporal distance by γλ A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 87

On-line Tabular TD(λ) Initialize V(s) arbitrarily and e(s) = 0, for all s!s Repeat (for each episode) : Initialize s Repeat (for each step of episode) : a " action given by # for s Take action a, observe reward, r, and next state s $ % " r +&V( s $ ) ' V (s) e(s) " e(s) +1 For all s : s " s $ V(s) " V(s) +(%e(s) e(s) " &)e(s) Until s is terminal A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 88

Relation of Backwards View to MC & TD(0) Using update rule: As before, if you set λ to 0, you get to TD(0) If you set λ to 1, you get MC but in a better way # V ( s) = e ( s) t!" t Can apply TD(1) to continuing tasks t Works incrementally and on-line (instead of waiting to the end of the episode) A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 89

Forward View = Backward View The forward (theoretical) view of TD(λ) is equivalent to the backward (mechanistic) view for off-line updating The book shows: T "1 #!V TD t (s) = #!V $ t (s t ) I sst t = 0 T "1 t = 0 T "1 #!V TD t (s) = # $ I sst #(%&) k " t ' k $!V " t (s t )I sst = $ % I sst $ (&") k # t ' k t = 0 T "1 t = 0 Backward updates Forward updates algebra shown in book T "1 k =t T #1 On-line updating with small α is similar t = 0 T #1 t = 0 T #1 k =t A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 90

Control: Sarsa(λ) Save eligibility for state-action pairs instead of just states $ e t (s, a) =!"e t #1(s, a) +1 % if &!"e t #1 (s,a) s = s and a = a t t otherwise Q t +1 (s, a) = Q t (s, a) +'( t e t (s, a) ( t = r t +1 +!Q t (s t +1,a t +1 ) # Q t (s t, a t ) A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 92

Sarsa(λ) Algorithm Initialize Q(s,a) arbitrarily and e(s, a) = 0, for all s, a Repeat (for each episode) : Initialize s, a Repeat (for each step of episode) : Take action a, observe r, s! Choose a! from s! using policy derived from Q (e.g.? - greedy) " # r +$Q( s!, a!) % Q(s, a) e(s,a) # e(s,a) +1 For all s,a : Q(s, a) # Q(s, a) +&"e(s, a) e(s, a) # $'e(s, a) s # s!;a # a! Until s is terminal A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 93

Sarsa(λ) Gridworld Example With one trial, the agent has much more information about how to get to the goal not necessarily the best way Can considerably accelerate learning A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 94

Conclusions Eligibility traces provide efficient, incremental way to combine MC and TD Includes advantages of MC (can deal with lack of Markov property) Includes advantages of TD (using TD error, bootstrapping) Can significantly speed learning Does have a cost in computation A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 109

The Overall Plan Lecture 1: What is Computational Reinforcement Learning? Learning from evaluative feedback Markov decision processes Lecture 2: Dynamic Programming Simple Monte Carlo methods Temporal Difference methods A unified perspective Connections to neuroscience Lecture 3: Function approximation Model-based methods Abstraction and hierarchy Intrinsically motivated RL A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 110

TD-error r δ t = r t + V t V t 1 regular predictors of z over this interval early in learning learning complete V δ V δ r omitted δ A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 111

Dopamine Neurons and TD Error W. Schultz et al. Universite de Fribourg A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 112

Dopamine Modulated Synaptic Plasticity A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 113

Basal Ganglia as Adaptive Critic Architecture Houk, Adams, & Barto, 1995 A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 114

The Overall Plan Lecture 1: What is Computational Reinforcement Learning? Learning from evaluative feedback Markov decision processes Lecture 2: Dynamic Programming Simple Monte Carlo methods Temporal Difference methods A unified perspective Connections to neuroscience Lecture 3: Function approximation Model-based methods Abstraction and hierarchy Intrinsically motivated RL A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 116