Chapter 9: Planning and Learning

Similar documents
Reinforcement Learning by Comparing Immediate Reward

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Lecture 10: Reinforcement Learning

TD(λ) and Q-Learning Based Ludo Players

Axiom 2013 Team Description Paper

High-level Reinforcement Learning in Strategy Games

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

Laboratorio di Intelligenza Artificiale e Robotica

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Speeding Up Reinforcement Learning with Behavior Transfer

Regret-based Reward Elicitation for Markov Decision Processes

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Learning and Transferring Relational Instance-Based Policies

Lecture 6: Applications

Georgetown University at TREC 2017 Dynamic Domain Track

Laboratorio di Intelligenza Artificiale e Robotica

A Reinforcement Learning Variant for Control Scheduling

Lecture 1: Machine Learning Basics

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

The Good Judgment Project: A large scale test of different methods of combining expert predictions

CS Machine Learning

Learning Methods for Fuzzy Systems

AMULTIAGENT system [1] can be defined as a group of

Learning Prospective Robot Behavior

SMALL GROUPS AND WORK STATIONS By Debbie Hunsaker 1

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

Rover Races Grades: 3-5 Prep Time: ~45 Minutes Lesson Time: ~105 minutes

Adaptive Generation in Dialogue Systems Using Dynamic User Modeling

FF+FPG: Guiding a Policy-Gradient Planner

Improving Action Selection in MDP s via Knowledge Transfer

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

University of Victoria School of Exercise Science, Physical and Health Education EPHE 245 MOTOR LEARNING. Calendar Description Units: 1.

Robot Learning Simultaneously a Task and How to Interpret Human Instructions

Improving Fairness in Memory Scheduling

Learning to Schedule Straight-Line Code

On the Combined Behavior of Autonomous Resource Management Agents

Managerial Decision Making

Surprise-Based Learning for Autonomous Systems

AI Agent for Ice Hockey Atari 2600

Knowledge based expert systems D H A N A N J A Y K A L B A N D E

Seminar - Organic Computing

Acquiring Competence from Performance Data

Automatic Discretization of Actions and States in Monte-Carlo Tree Search

Discriminative Learning of Beam-Search Heuristics for Planning

Session 2B From understanding perspectives to informing public policy the potential and challenges for Q findings to inform survey design

IMGD Technical Game Development I: Iterative Development Techniques. by Robert W. Lindeman

Generative models and adversarial training

Modeling user preferences and norms in context-aware systems

Lahore University of Management Sciences. FINN 321 Econometrics Fall Semester 2017

A Comparison of Annealing Techniques for Academic Course Scheduling

Artificial Neural Networks written examination

Leader s Guide: Dream Big and Plan for Success

Emergency Management Games and Test Case Utility:

A Reinforcement Learning Approach for Adaptive Single- and Multi-Document Summarization

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Hentai High School A Game Guide

Task Completion Transfer Learning for Reward Inference

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

EVOLVING POLICIES TO SOLVE THE RUBIK S CUBE: EXPERIMENTS WITH IDEAL AND APPROXIMATE PERFORMANCE FUNCTIONS

DOCTOR OF PHILOSOPHY HANDBOOK

A Game-based Assessment of Children s Choices to Seek Feedback and to Revise

What s in a Step? Toward General, Abstract Representations of Tutoring System Log Data

Student Handbook. This handbook was written for the students and participants of the MPI Training Site.

Knowledge-Based - Systems

An investigation of imitation learning algorithms for structured prediction

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Test How To. Creating a New Test

Fragment Analysis and Test Case Generation using F- Measure for Adaptive Random Testing and Partitioned Block based Adaptive Random Testing

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

STA 225: Introductory Statistics (CT)

Shockwheat. Statistics 1, Activity 1

On Human Computer Interaction, HCI. Dr. Saif al Zahir Electrical and Computer Engineering Department UBC

Designing a Rubric to Assess the Modelling Phase of Student Design Projects in Upper Year Engineering Courses

Launching GO 4 Schools as a whole school approach

INTERMEDIATE ALGEBRA PRODUCT GUIDE

Task Completion Transfer Learning for Reward Inference

Visual CP Representation of Knowledge

Bluetooth mlearning Applications for the Classroom of the Future

DIGITAL GAMING & INTERACTIVE MEDIA BACHELOR S DEGREE. Junior Year. Summer (Bridge Quarter) Fall Winter Spring GAME Credits.

CLASSIFICATION OF PROGRAM Critical Elements Analysis 1. High Priority Items Phonemic Awareness Instruction

Colloque: Le bilinguisme au sein d un Canada plurilingue: recherches et incidences Ottawa, juin 2008

Planning with External Events

CPS122 Lecture: Identifying Responsibilities; CRC Cards. 1. To show how to use CRC cards to identify objects and find responsibilities

CAMP 4:4:3. Supplemental Tools

Chapter 2. Intelligent Agents. Outline. Agents and environments. Rationality. PEAS (Performance measure, Environment, Actuators, Sensors)

Forget catastrophic forgetting: AI that learns after deployment

Go fishing! Responsibility judgments when cooperation breaks down

A Metacognitive Approach to Support Heuristic Solution of Mathematical Problems

The Nature of Exploratory Testing

Introduction to Simulation

POLA: a student modeling framework for Probabilistic On-Line Assessment of problem solving performance

Truth Inference in Crowdsourcing: Is the Problem Solved?

Team Formation for Generalized Tasks in Expertise Social Networks

IAT 888: Metacreation Machines endowed with creative behavior. Philippe Pasquier Office 565 (floor 14)

Evolutive Neural Net Fuzzy Filtering: Basic Description

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

MYCIN. The MYCIN Task

Create Quiz Questions

Ohio s Learning Standards-Clear Learning Targets

Transcription:

Chapter 9: Planning and Learning Objectives of this chapter: Use of environment models Integration of planning and learning methods R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

The Original Idea Sutton, 1990 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 2

The Original Idea Sutton, 1990 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 3

Models Model: anything the agent can use to predict how the environment will respond to its actions Distribution model: description of all possibilities and their probabilities e.g., P s s a and R a s for all s, s, and a A(s) Sample model: produces sample experiences e.g., a simulation model Both types of models can be used to produce simulated experience Often sample models are much easier to come by R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 4

Planning Planning: any computational process that uses a model to create or improve a policy Planning in AI: state-space planning model planning plan-space planning (e.g., partial-order planner) We take the following (unusual) view: all state-space planning methods involve computing value functions, either explicitly or implicitly they all apply backups to simulated experience policy model simulated experience backups values policy R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 5

Planning Cont. Classical DP methods are state-space planning methods Heuristic search methods are state-space planning methods A planning method based on Q-learning: Random-Sample One-Step Tabular Q-Planning R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 6

Learning, Planning, and Acting Two uses of real experience: model learning: to improve the model direct RL: to directly improve the value function and policy Improving value function and/or policy via a model is sometimes called indirect RL or model-based RL. Here, we call it planning. planning model value/policy direct RL model learning acting experience R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 7

Direct vs. Indirect RL Indirect methods: make fuller use of experience: get better policy with fewer environment interactions Direct methods simpler not affected by bad models But they are very closely related and can be usefully combined: planning, acting, model learning, and direct RL can occur simultaneously and in parallel R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 8

The Dyna Architecture (Sutton 1990) Policy/value functions planning update direct RL update real experience model learning simulated experience search control Environment Model R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 9

The Dyna-Q Algorithm direct RL model learning planning R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 10

Dyna-Q on a Simple Maze rewards = 0 until goal, when =1 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 11

Dyna-Q Snapshots: Midway in 2nd Episode WITHOUT PLANNING (N=0) G WITH PLANNING (N=50) G S S R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 12

When the Model is Wrong: Blocking Maze The changed envirnoment is harder G G S S 150 Dyna-Q+ Dyna-Q Cumulative reward Dyna-AC 0 0 1000 2000 3000 Time steps R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 13

The changed environment is easier Shortcut Maze G G S S 400 Cumulative reward Dyna-Q+ Dyna-Q Dyna-AC 0 0 3000 6000 Time steps R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 14

+ What is Dyna-Q? Uses an exploration bonus : Keeps track of time since each state-action pair was tried for real An extra reward is added for transitions caused by state-action pairs related to how long ago they were tried: the longer unvisited, the more reward for visiting The agent actually plans how to visit long unvisited states R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 15

Exploration vs. Exploitation R-Max (Brafman, Tennenholz, 2003) Model-based algorithm Classify states as to whether they are sufficiently explored or not ( known, unknown ) The optimistic model is one where in unknown states we enter a terminal state with the best possible reward Solve the optimistic model and follow the resulting policy UC-RL (Auer, Ortner, 2006) Given the uncertainty in the estimated model picks the world that is consistent with the observations and gives the highest average reward Log-regret bounds R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 16

Prioritized Sweeping Which states or state-action pairs should be generated during planning? Work backwards from states whose values have just changed: Maintain a queue of state-action pairs whose values would change a lot if backed up, prioritized by the size of the change When a new backup occurs, insert predecessors according to their priorities Always perform backups from first in queue Moore and Atkeson 1993; Peng and Williams, 1993 Improved prioritized sweeping (McMahan & Gordon 2005) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 17

Prioritized Sweeping R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 18

Prioritized Sweeping vs. Dyna-Q Both use N=5 backups per environmental interaction R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 19

Rod Maneuvering (Moore and Atkeson 1993) Goal Start R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 20

Full and Sample (One-Step) Backups Value estimated Full backups (DP) Sample backups (one-step TD) s s V π (s) a r s' a r s' policy evaluation TD(0) V * (s) max s value iteration a r s' s,a s,a Q π (a,s) r s' r s' a' Q-policy evaluation a' Sarsa s,a s,a Q * (a,s) max r R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 21 s' Q-value iteration a' max r s' a' Q-learning

Full vs. Sample Backups Mixing rate (stochasticity) b successor states, equally likely; initial error = 1; assume all next states values are correct R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 22

Trajectory Sampling Trajectory sampling: perform backups along simulated trajectories This samples from the on-policy distribution Advantages when function approximation is used (Chapter 8) Focusing of computation: can cause vast uninteresting parts of the state space to be (usefully) ignored: Initial states Reachable under optimal control Irrelevant states R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 23

Trajectory Sampling Experiment one-step full tabular backups uniform: cycled through all stateaction pairs on-policy: backed up along simulated trajectories 200 randomly generated undiscounted episodic tasks 2 actions for each state, each with b equally likely next states.1 prob of transition to terminal state expected reward on each transition selected from mean 0 variance 1 Gaussian Value of start state under greedy policy Value of start state under greedy policy 3 2 1 0 3 2 1 0 on-policy b=3 b=10 b=1 uniform 0 5,000 10,000 15,000 20,000 Computation time, in full backups on-policy uniform b=1 uniform on-policy uniform on-policy 0 50,000 100,000 150,000 200,000 Computation time, in full backups 1000 STATES 10,000 STATES R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 24

Heuristic Search Used for action selection, not for changing a value function (=heuristic evaluation function) Backed-up values are computed, but typically discarded Extension of the idea of a greedy policy only deeper Also suggests ways to select states to backup: smart focusing: 3 10 UCT: Kocsis&Szepesvari 2006 The algorithm used in all the best go programs as of 2007, 500 ELO increase, MOGO,.. 1 2 6 4 5 8 9 7 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 25

Summary Emphasized close relationship between planning and learning Important distinction between distribution models and sample models Looked at some ways to integrate planning and learning synergy among planning, acting, model learning Distribution of backups: focus of the computation trajectory sampling: backup along trajectories prioritized sweeping heuristic search Size of backups: full vs. sample; deep vs. shallow R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 26