Models. Chapter 9: Planning and Learning. Planning Cont. Planning. for all s, s!, and a "A(s)! Sample model: produces sample experiences

Similar documents
Reinforcement Learning by Comparing Immediate Reward

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Axiom 2013 Team Description Paper

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

TD(λ) and Q-Learning Based Ludo Players

Speeding Up Reinforcement Learning with Behavior Transfer

High-level Reinforcement Learning in Strategy Games

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Learning and Transferring Relational Instance-Based Policies

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

A Reinforcement Learning Variant for Control Scheduling

FF+FPG: Guiding a Policy-Gradient Planner

Laboratorio di Intelligenza Artificiale e Robotica

Lecture 10: Reinforcement Learning

Lecture 6: Applications

SMALL GROUPS AND WORK STATIONS By Debbie Hunsaker 1

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Surprise-Based Learning for Autonomous Systems

Rover Races Grades: 3-5 Prep Time: ~45 Minutes Lesson Time: ~105 minutes

Learning Methods for Fuzzy Systems

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Learning to Schedule Straight-Line Code

Learning Prospective Robot Behavior

University of Victoria School of Exercise Science, Physical and Health Education EPHE 245 MOTOR LEARNING. Calendar Description Units: 1.

Test How To. Creating a New Test

Laboratorio di Intelligenza Artificiale e Robotica

On the Combined Behavior of Autonomous Resource Management Agents

AMULTIAGENT system [1] can be defined as a group of

Adaptive Generation in Dialogue Systems Using Dynamic User Modeling

Lecture 1: Machine Learning Basics

CS Machine Learning

Knowledge based expert systems D H A N A N J A Y K A L B A N D E

Robot Learning Simultaneously a Task and How to Interpret Human Instructions

Leader s Guide: Dream Big and Plan for Success

IMGD Technical Game Development I: Iterative Development Techniques. by Robert W. Lindeman

Improving Fairness in Memory Scheduling

Create Quiz Questions

The Round Earth Project. Collaborative VR for Elementary School Kids

Hentai High School A Game Guide

Discriminative Learning of Beam-Search Heuristics for Planning

E-3: Check for academic understanding

What s in a Step? Toward General, Abstract Representations of Tutoring System Log Data

Changing User Attitudes to Reduce Spreadsheet Risk

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Lahore University of Management Sciences. FINN 321 Econometrics Fall Semester 2017

Cognitive Modeling. Tower of Hanoi: Description. Tower of Hanoi: The Task. Lecture 5: Models of Problem Solving. Frank Keller.

Session 2B From understanding perspectives to informing public policy the potential and challenges for Q findings to inform survey design

FOUR STARS OUT OF FOUR

Modeling user preferences and norms in context-aware systems

Emergency Management Games and Test Case Utility:

Colloque: Le bilinguisme au sein d un Canada plurilingue: recherches et incidences Ottawa, juin 2008

Designing a Rubric to Assess the Modelling Phase of Student Design Projects in Upper Year Engineering Courses

Student Handbook. This handbook was written for the students and participants of the MPI Training Site.

Improving Action Selection in MDP s via Knowledge Transfer

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Knowledge-Based - Systems

An Introduction to Simio for Beginners

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

Shockwheat. Statistics 1, Activity 1

Forget catastrophic forgetting: AI that learns after deployment

Regret-based Reward Elicitation for Markov Decision Processes

Bluetooth mlearning Applications for the Classroom of the Future

Acquiring Competence from Performance Data

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

INTERMEDIATE ALGEBRA PRODUCT GUIDE

IAT 888: Metacreation Machines endowed with creative behavior. Philippe Pasquier Office 565 (floor 14)

Artificial Neural Networks written examination

Probability estimates in a scenario tree

Learning From the Past with Experiment Databases

Learning Cases to Resolve Conflicts and Improve Group Behavior

Fragment Analysis and Test Case Generation using F- Measure for Adaptive Random Testing and Partitioned Block based Adaptive Random Testing

CPS122 Lecture: Identifying Responsibilities; CRC Cards. 1. To show how to use CRC cards to identify objects and find responsibilities

Seminar - Organic Computing

Georgetown University at TREC 2017 Dynamic Domain Track

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

A Comparison of Annealing Techniques for Academic Course Scheduling

Generative models and adversarial training

Co-teaching in the ESL Classroom

CSC200: Lecture 4. Allan Borodin

UDL AND LANGUAGE ARTS LESSON OVERVIEW

12- A whirlwind tour of statistics

On Human Computer Interaction, HCI. Dr. Saif al Zahir Electrical and Computer Engineering Department UBC

Study Group Handbook

Lecture 2: Quantifiers and Approximation

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

Kelli Allen. Vicki Nieter. Jeanna Scheve. Foreword by Gregory J. Kaiser

Assessing System Agreement and Instance Difficulty in the Lexical Sample Tasks of SENSEVAL-2

POLA: a student modeling framework for Probabilistic On-Line Assessment of problem solving performance

How to make an A in Physics 101/102. Submitted by students who earned an A in PHYS 101 and PHYS 102.

M55205-Mastering Microsoft Project 2016

P-4: Differentiate your plans to fit your students

MYCIN. The MYCIN Task

Team Formation for Generalized Tasks in Expertise Social Networks

GCSE. Mathematics A. Mark Scheme for January General Certificate of Secondary Education Unit A503/01: Mathematics C (Foundation Tier)

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

Chapter 2. Intelligent Agents. Outline. Agents and environments. Rationality. PEAS (Performance measure, Environment, Actuators, Sensors)

Welcome to the Purdue OWL. Where do I begin? General Strategies. Personalizing Proofreading

Thesis-Proposal Outline/Template

FUZZY EXPERT. Dr. Kasim M. Al-Aubidy. Philadelphia University. Computer Eng. Dept February 2002 University of Damascus-Syria

Transcription:

Chapter 9: Planning and Learning Models Objectives of this chapter:! Use of environment models! Integration of planning and learning methods! Model: anything the agent can use to predict how the environment will respond to its actions! Distribution model: description of all possibilities and their probabilities! e.g., P a s s! and R a s! for all s, s!, and a "A(s)! Sample model: produces sample experiences! e.g., a simulation model! Both types of models can be used to produce simulated experience! Often sample models are much easier to come by R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 2 Planning! Planning: any computational process that uses a model to create or improve a policy! Planning in AI:! state-space planning! plan-space planning (e.g., partial-order planner)! We take the following (unusual) view:! all state-space planning methods involve computing value functions, either explicitly or implicitly! they all apply backups to simulated experience Planning Cont.! Classical DP methods are state-space planning methods! Heuristic search methods are state-space planning methods! A planning method based on Q-learning: Random-Sample One-Step Tabular Q-Planning R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 3 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 4

Learning, Planning, and Acting Direct vs. Indirect RL! Two uses of real experience:! model learning: to improve the model! direct RL: to directly improve the value function and policy! Improving value function and/or policy via a model is sometimes called indirect RL or model-based R!L. Here, we call it planning.! Indirect (model-based) methods:! make fuller use of experience: get better policy with fewer environment interactions! Direct methods! simpler! not affected by bad models But they are very closely related and can be usefully combined: planning, acting, model learning, and direct RL can occur simultaneously and in parallel R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 5 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 6 The Dyna Architecture (Sutton 1990) The Dyna-Q Algorithm direct RL model learning planning R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 7 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 8

Dyna-Q on a Simple Maze Dyna-Q Snapshots: Midway in 2nd Episode rewards = 0 until goal, when =1 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 9 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 10 When the Model is Wrong: Blocking Maze The changed environment is harder The changed environment is easier Shortcut Maze R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 11 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 12

+ What is Dyna-Q? Prioritized Sweeping! Uses an exploration bonus :! Keeps track of time since each state-action pair was tried for real! An extra reward is added for transitions caused by state-action pairs related to how long ago they were tried: the longer unvisited, the more reward for visiting! The agent actually plans how to visit long unvisited states! Which states or state-action pairs should be generated during planning?! Work backwards from states whose values have just changed:! Maintain a queue of state-action pairs whose values would change a lot if backed up, prioritized by the size of the change! When a new backup occurs, insert predecessors according to their priorities! Always perform backups from first in queue! Moore and Atkeson 1993; Peng and Williams, 1993 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 13 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 14 Prioritized Sweeping Prioritized Sweeping vs. Dyna-Q Both use N=5 backups per environmental interaction R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 15 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 16

Rod Maneuvering (Moore and Atkeson 1993) Full and Sample (One-Step) Backups R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 17 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 18 Full vs. Sample Backups Trajectory Sampling! Trajectory sampling: perform backups along simulated trajectories! This samples from the on-policy distribution! Advantages when function approximation is used! Focusing of computation: can cause vast uninteresting parts of the state space to be (usefully) ignored: Initial states b successor states, equally likely; initial error = 1; assume all next states values are correct Reachable under optimal control Irrelevant states R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 19 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 20

Trajectory Sampling Experiment! one-step full tabular backups! uniform: cycled through all stateaction pairs! on-policy: backed up along simulated trajectories! 200 randomly generated undiscounted episodic tasks! 2 actions for each state, each with b equally likely next states!.1 prob of transition to terminal state! expected reward on each transition selected from mean 0 variance 1 Gaussian Heuristic Search! Used for action selection, not for changing a value function (=heuristic evaluation function)! Backed-up values are computed, but typically discarded! Extension of the idea of a greedy policy only deeper! Also suggests ways to select states to backup: smart focusing: R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 21 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 22 Summary! Emphasized close relationship between planning and learning! Important distinction between distribution models and sample models! Looked at some ways to integrate planning and learning! synergy among planning, acting, model learning! Distribution of backups: focus of the computation! trajectory sampling: backup along trajectories! prioritized sweeping! heuristic search! Size of backups: full vs. sample; deep vs. shallow R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 23