Monte Carlo is important in practice

Similar documents
ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Reinforcement Learning by Comparing Immediate Reward

TD(λ) and Q-Learning Based Ludo Players

Lecture 10: Reinforcement Learning

Axiom 2013 Team Description Paper

Exploration. CS : Deep Reinforcement Learning Sergey Levine

FF+FPG: Guiding a Policy-Gradient Planner

Georgetown University at TREC 2017 Dynamic Domain Track

High-level Reinforcement Learning in Strategy Games

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

The Evolution of Random Phenomena

A Reinforcement Learning Variant for Control Scheduling

Introduction to Simulation

Laboratorio di Intelligenza Artificiale e Robotica

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Speeding Up Reinforcement Learning with Behavior Transfer

Lecture 1: Machine Learning Basics

Python Machine Learning

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

Artificial Neural Networks written examination

Improving Action Selection in MDP s via Knowledge Transfer

AI Agent for Ice Hockey Atari 2600

An OO Framework for building Intelligence and Learning properties in Software Agents

Learning Prospective Robot Behavior

Laboratorio di Intelligenza Artificiale e Robotica

Generative models and adversarial training

A Comparison of Annealing Techniques for Academic Course Scheduling

How long did... Who did... Where was... When did... How did... Which did...

AMULTIAGENT system [1] can be defined as a group of

CS/SE 3341 Spring 2012

The Strong Minimalist Thesis and Bounded Optimality

Regret-based Reward Elicitation for Markov Decision Processes

Adaptive Generation in Dialogue Systems Using Dynamic User Modeling

Session 2B From understanding perspectives to informing public policy the potential and challenges for Q findings to inform survey design

Intelligent Agents. Chapter 2. Chapter 2 1

Human-like Natural Language Generation Using Monte Carlo Tree Search

University of Victoria School of Exercise Science, Physical and Health Education EPHE 245 MOTOR LEARNING. Calendar Description Units: 1.

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Teachable Robots: Understanding Human Teaching Behavior to Build More Effective Robot Learners

BAYESIAN ANALYSIS OF INTERLEAVED LEARNING AND RESPONSE BIAS IN BEHAVIORAL EXPERIMENTS

Rule-based Expert Systems

Automatic Discretization of Actions and States in Monte-Carlo Tree Search

A Version Space Approach to Learning Context-free Grammars

An Introduction to Simio for Beginners

Lahore University of Management Sciences. FINN 321 Econometrics Fall Semester 2017

An Introduction to Simulation Optimization

Performance Modeling and Design of Computer Systems

Stopping rules for sequential trials in high-dimensional data

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

The Use of Statistical, Computational and Modelling Tools in Higher Learning Institutions: A Case Study of the University of Dodoma

Lecture 6: Applications

LEARNING TO PLAY IN A DAY: FASTER DEEP REIN-

CSL465/603 - Machine Learning

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy

school students to improve communication skills

Probability and Game Theory Course Syllabus

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

A Process-Model Account of Task Interruption and Resumption: When Does Encoding of the Problem State Occur?

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

An investigation of imitation learning algorithms for structured prediction

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Self Study Report Computer Science

A Reinforcement Learning Approach for Adaptive Single- and Multi-Document Summarization

Acquiring Competence from Performance Data

Robot Learning Simultaneously a Task and How to Interpret Human Instructions

Learning and Transferring Relational Instance-Based Policies

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Algebra 2- Semester 2 Review

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Learning Methods for Fuzzy Systems

Radius STEM Readiness TM

DRAFT VERSION 2, 02/24/12

On the Combined Behavior of Autonomous Resource Management Agents

EDEXCEL FUNCTIONAL SKILLS PILOT. Maths Level 2. Chapter 7. Working with probability

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Lecture 2: Quantifiers and Approximation

Simple Random Sample (SRS) & Voluntary Response Sample: Examples: A Voluntary Response Sample: Examples: Systematic Sample Best Used When

Seminar - Organic Computing

Surprise-Based Learning for Autonomous Systems

1. Answer the questions below on the Lesson Planning Response Document.

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

Modeling user preferences and norms in context-aware systems

Executive Guide to Simulation for Health

Spinners at the School Carnival (Unequal Sections)

Ericsson Wallet Platform (EWP) 3.0 Training Programs. Catalog of Course Descriptions

Person Centered Positive Behavior Support Plan (PC PBS) Report Scoring Criteria & Checklist (Rev ) P. 1 of 8

INFORMS Transactions on Education

Evaluating Statements About Probability

Managerial Decision Making

Strategic Planning for Retaining Women in Undergraduate Computing

Department of Communication Criteria for Promotion and Tenure College of Business and Technology Eastern Kentucky University

Redirected Inbound Call Sampling An Example of Fit for Purpose Non-probability Sample Design

Planning with External Events

Agents and environments. Intelligent Agents. Reminders. Vacuum-cleaner world. Outline. A vacuum-cleaner agent. Chapter 2 Actuators

IS FINANCIAL LITERACY IMPROVED BY PARTICIPATING IN A STOCK MARKET GAME?

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Transcription:

Monte Carlo is important in practice Absolutely When there are just a few possibilities to value, out of a large state space, Monte Carlo is a big win Backgammon, Go, R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Chapter 6: emporal Difference Learning Objectives of this chapter: Introduce emporal Difference (D) learning Focus first on policy evaluation, or prediction, methods hen extend to control methods R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 2

D Prediction Policy Evaluation (the prediction problem): for a given policy p, compute the state-value function V p Recall: Simple every - visit Monte Carlo method : V(s t ) V(s t ) R t V(s t ) target: the actual return after time t he simplest D method, D(0): V(s t ) V(s t ) r t 1 V(s t 1 ) V(s t ) target: an estimate of the return R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 3

Simple Monte Carlo V(s t ) V(s t ) R t V(s t ) where R t is the actual return following state s t. s t R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 4

Simplest D Method V(s t ) V(s t ) r t 1 V(s t 1 ) V(s t ) s t s t 1 r t 1 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 5

cf. Dynamic Programming V(s t ) E p r t 1 V(s t ) s t r t 1 s t 1 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 6

D methods bootstrap and sample Bootstrapping: update involves an estimate MC does not bootstrap DP bootstraps D bootstraps Sampling: update does not involve an expected value MC samples DP does not sample D samples R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 7

Example: Driving Home State Elapsed ime Predicted Predicted (minutes) ime to Go otal ime leaving o ffice 0 30 30 reach car, raining 5 35 40 exit highway 20 15 35 behind truck 30 10 40 home street 40 3 43 arrive home 43 0 43 (5) (15) (10) (10) (3) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 8

Driving Home Changes recommended by Monte Carlo methods ( =1) Changes recommended by D methods ( =1) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 9

Advantages of D Learning D methods do not require a model of the environment, only experience D, but not MC, methods can be fully incremental You can learn before knowing the final outcome Less memory Less peak computation You can learn without the final outcome From incomplete sequences Both MC and D converge (under certain assumptions to be detailed later), but which is faster? R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 10

Random Walk Example Values learned by D(0) after various numbers of episodes R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 11

D and MC on the Random Walk Data averaged over 100 sequences of episodes R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 12

Optimality of D(0) Batch Updating: train completely on a finite amount of data, e.g., train repeatedly on 10 episodes until convergence. Compute updates according to D(0), but only update estimates after each complete pass through the data. For any finite Markov prediction task, under batch updating, D(0) converges for sufficiently small. Constant- MC also converges under these conditions, but to a difference answer! R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 13

Random Walk under Batch Updating After each new episode, all previous episodes were treated as a batch, and algorithm was trained until convergence. All repeated 100 times. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 14

You are the Predictor Suppose you observe the following 8 episodes: A, 0, B, 0 B, 1 B, 1 B, 1 B, 1 B, 1 B, 1 B, 0 V(A)? V(B)? R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 15

You are the Predictor V(A)? R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 16

You are the Predictor he prediction that best matches the training data is V(A)=0 his minimizes the mean-square-error on the training set his is what a batch Monte Carlo method gets If we consider the sequentiality of the problem, then we would set V(A)=.75 his is correct for the maximum likelihood estimate of a Markov model generating the data i.e, if we do a best fit Markov model, and assume it is exactly correct, and then compute what it predicts (how?) his is called the certainty-equivalence estimate his is what D(0) gets R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 17

Learning An Action-Value Function Estimate Q p for the current behavior policy p. After every transition from a nonterminal state s t, do this: Q( s t, a t Q( s t, a t r t 1 Q( s t 1,a t 1 Q( s t,a t If s t 1 is terminal, then Q(s t 1, a t 1 ) 0. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 18

Sarsa: On-Policy D Control urn this into a control method by always updating the policy to be greedy with respect to the current estimate: R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 19

Windy Gridworld undiscounted, episodic, reward = 1 until goal R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 20

Results of Sarsa on the Windy Gridworld R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 21

Q-Learning: Off-Policy D Control One - step Q- learning : Q( s t, a t Q( s t, a t r t 1 max Q( s t 1, a Q( s t, a t a R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 22

Cliffwalking e greedy, e = 0.1 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 23

he Book Part I: he Problem Introduction Evaluative Feedback he Reinforcement Learning Problem Part II: Elementary Solution Methods Dynamic Programming Monte Carlo Methods emporal Difference Learning Part III: A Unified View Eligibility races Generalization and Function Approximation Planning and Learning Dimensions of Reinforcement Learning Case Studies R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 24

Unified View R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 25

Actor-Critic Methods Explicit representation of policy as well as value function Minimal computation to select actions Can learn an explicit stochastic policy Can put constraints on policies Appealing as psychological and neural models R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 26

Actor-Critic Details D- error is used to evaluate actions : t r t 1 V(s t 1 ) V(s t ) If actions are determined by preferences, p(s,a), as follows: p t (s,a) Pr a t a s t s ep(s, a) e b p(s,b) then you can update the preferences like this: p(s t, a t ) p(s t,a t ) t, R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 27

Dopamine Neurons and D Error W. Schultz et al. Universite de Fribourg R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 28

Average Reward Per ime Step Average expected reward per time step under policy p : 1 p lim n n n E p r t the same for each state if ergodic t 1 Value of a state relative to p : V p ( s E p r t k p s t s k 1 Value of a state - action pair relative to p : Q p ( s, a E p r t k p s t s,a t a k 1 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 29

R-Learning R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 30

Access-Control Queuing ask n servers Customers have four different priorities, which pay reward of 1, 2, 4, or 8, if served At each time step, customer at head of queue is accepted (assigned to a server) or removed from the queue Proportion of randomly distributed high priority customers in queue is h Busy server becomes free with probability p on each time step Statistics of arrivals and departures are unknown Apply R-learning n=10, h=.5, p=.06 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 31

Afterstates Usually, a state-value function evaluates states in which the agent can take an action. But sometimes it is useful to evaluate states after agent has acted, as in tic-tac-toe. Why is this useful? What is this in general? R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 32

Summary D prediction Introduced one-step tabular model-free D methods Extend prediction to control by employing some form of GPI On-policy control: Sarsa Off-policy control: Q-learning and R-learning hese methods bootstrap and sample, combining aspects of DP and MC methods R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 33

Questions What can I tell you about RL? What is common to all three classes of methods? DP, MC, D What are the principle strengths and weaknesses of each? In what sense is our RL view complete? In what senses is it incomplete? What are the principal things missing? he broad applicability of these ideas What does the term bootstrapping refer to? What is the relationship between DP and learning? R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 34