Unified View ... Dynamic programming. Temporaldifference. learning. Exhaustive search. Monte Carlo. Dyna. Eligibilty traces MCTS.

Similar documents
Lecture 10: Reinforcement Learning

Reinforcement Learning by Comparing Immediate Reward

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

FF+FPG: Guiding a Policy-Gradient Planner

Laboratorio di Intelligenza Artificiale e Robotica

TD(λ) and Q-Learning Based Ludo Players

High-level Reinforcement Learning in Strategy Games

Speeding Up Reinforcement Learning with Behavior Transfer

Learning and Transferring Relational Instance-Based Policies

Laboratorio di Intelligenza Artificiale e Robotica

AMULTIAGENT system [1] can be defined as a group of

Session 2B From understanding perspectives to informing public policy the potential and challenges for Q findings to inform survey design

Regret-based Reward Elicitation for Markov Decision Processes

Georgetown University at TREC 2017 Dynamic Domain Track

Rover Races Grades: 3-5 Prep Time: ~45 Minutes Lesson Time: ~105 minutes

Lecture 6: Applications

Emergency Management Games and Test Case Utility:

Surprise-Based Learning for Autonomous Systems

Introduction to Simulation

Axiom 2013 Team Description Paper

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

SMALL GROUPS AND WORK STATIONS By Debbie Hunsaker 1

How to make an A in Physics 101/102. Submitted by students who earned an A in PHYS 101 and PHYS 102.

Discriminative Learning of Beam-Search Heuristics for Planning

Improving Action Selection in MDP s via Knowledge Transfer

Learning to Schedule Straight-Line Code

EVOLVING POLICIES TO SOLVE THE RUBIK S CUBE: EXPERIMENTS WITH IDEAL AND APPROXIMATE PERFORMANCE FUNCTIONS

Adaptive Generation in Dialogue Systems Using Dynamic User Modeling

Generative models and adversarial training

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Planning with External Events

On the Combined Behavior of Autonomous Resource Management Agents

Create Quiz Questions

Chapter 2. Intelligent Agents. Outline. Agents and environments. Rationality. PEAS (Performance measure, Environment, Actuators, Sensors)

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

A Metacognitive Approach to Support Heuristic Solution of Mathematical Problems

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

INTERMEDIATE ALGEBRA PRODUCT GUIDE

Chapter 4 - Fractions

AI Agent for Ice Hockey Atari 2600

A Comparison of Annealing Techniques for Academic Course Scheduling

IAT 888: Metacreation Machines endowed with creative behavior. Philippe Pasquier Office 565 (floor 14)

BMBF Project ROBUKOM: Robust Communication Networks

Probability estimates in a scenario tree

What s in a Step? Toward General, Abstract Representations of Tutoring System Log Data

Active Learning. Yingyu Liang Computer Sciences 760 Fall

E-3: Check for academic understanding

Changing User Attitudes to Reduce Spreadsheet Risk

Teachable Robots: Understanding Human Teaching Behavior to Build More Effective Robot Learners

Edexcel GCSE. Statistics 1389 Paper 1H. June Mark Scheme. Statistics Edexcel GCSE

CS Machine Learning

FOUR STARS OUT OF FOUR

Assessing System Agreement and Instance Difficulty in the Lexical Sample Tasks of SENSEVAL-2

Math Pathways Task Force Recommendations February Background

Acquiring Competence from Performance Data

Task Completion Transfer Learning for Reward Inference

A Version Space Approach to Learning Context-free Grammars

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Case Acquisition Strategies for Case-Based Reasoning in Real-Time Strategy Games

An ICT environment to assess and support students mathematical problem-solving performance in non-routine puzzle-like word problems

Event on Teaching Assignments October 7, 2015

Activities, Exercises, Assignments Copyright 2009 Cem Kaner 1

Level 1 Mathematics and Statistics, 2015

Artificial Neural Networks written examination

An Estimating Method for IT Project Expected Duration Oriented to GERT

Data Structures and Algorithms

Arizona s College and Career Ready Standards Mathematics

ns-2 Tutorial Running Simulations

Automatic Discretization of Actions and States in Monte-Carlo Tree Search

HEPCLIL (Higher Education Perspectives on Content and Language Integrated Learning). Vic, 2014.

Ohio s Learning Standards-Clear Learning Targets

Infrastructure Issues Related to Theory of Computing Research. Faith Fich, University of Toronto

Version Space. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Version Space Term 2012/ / 18

Test How To. Creating a New Test

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

GCSE Mathematics B (Linear) Mark Scheme for November Component J567/04: Mathematics Paper 4 (Higher) General Certificate of Secondary Education

CSC200: Lecture 4. Allan Borodin

Task Completion Transfer Learning for Reward Inference

A Reinforcement Learning Variant for Control Scheduling

Transfer Learning Action Models by Measuring the Similarity of Different Domains

CLASSIFICATION OF PROGRAM Critical Elements Analysis 1. High Priority Items Phonemic Awareness Instruction

Contents. Foreword... 5

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Cognitive Modeling. Tower of Hanoi: Description. Tower of Hanoi: The Task. Lecture 5: Models of Problem Solving. Frank Keller.

SARDNET: A Self-Organizing Feature Map for Sequences

INTERNATIONAL BACCALAUREATE AT IVANHOE GRAMMAR SCHOOL. An Introduction to the International Baccalaureate Diploma Programme For Students and Families

An Introduction to Simulation Optimization

An OO Framework for building Intelligence and Learning properties in Software Agents

The Political Engagement Activity Student Guide

POLA: a student modeling framework for Probabilistic On-Line Assessment of problem solving performance

Evolution of Collective Commitment during Teamwork

Lecture 1: Basic Concepts of Machine Learning

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Co-teaching in the ESL Classroom

Visual CP Representation of Knowledge

A Study of Successful Practices in the IB Program Continuum

Welcome to the session on ACCUPLACER Policy Development. This session will touch upon common policy decisions an institution may encounter during the

Transcription:

Unified View Temporaldifference learning width of backup Dyna Dynamic programming height (depth) of backup Eligibilty traces Monte Carlo MCTS Exhaustive search... 1

Introduction to Reinforcement Learning Part 7: Planning & Learning

Models Model: anything the agent can use to predict how the environment will respond to its actions Distribution model: description of all possibilities and their probabilities e.g., p(s, ˆ r s, a) for all s, a, s, r Sample model, a.k.a. a simulation model produces sample experiences for given s, a allows reset, exploring starts often much easier to come by Both types of models can be used to produce hypothetical experience

Planning Planning: any computational process that uses a model to create or improve a policy Planning in AI: state-space planning plan-space planning (e.g., partial-order planner) We take the following (unusual) view: all state-space planning methods involve computing value functions, either explicitly or implicitly they all apply backups to simulated experience

Planning Cont. Classical DP methods are state-space planning methods Heuristic search methods are state-space planning methods A planning method based on Q-learning: Do forever: 1. Select a state, S 2 S, and an action, A 2 A(s), at random 2. Send S, A to a sample model, and obtain a sample next reward, R, and a sample next state, S 0 3. Apply one-step tabular Q-learning to S, A, R, S 0 : Q(S, A) Q(S, A)+ [R + max a Q(S 0,a) Q(S, A)] Random-Sample One-Step Tabular Q-Planning

Learning, Planning, and Acting Two uses of real experience: model learning: to improve the model direct RL: to directly improve the value function and policy Improving value function and/or policy via a model is sometimes called indirect RL. Here, we call it planning.

Direct (model-free) vs. Indirect (model-based) RL Direct methods simpler not affected by bad models Indirect methods: make fuller use of experience: get better policy with fewer environment interactions But they are very closely related and can be usefully combined: planning, acting, model learning, and direct RL can occur simultaneously and in parallel

The Dyna-Q Algorithm Initialize Q(s, a) and Model(s, a) for all s 2 S and a 2 A(s) Do forever: (a) S current (nonterminal) state (b) A "-greedy(s, Q) (c) Execute action A; observe resultant reward, R, and state, S 0 (d) Q(S, A) Q(S, A)+ [R + max direct RL a Q(S 0,a) Q(S, A)] (e) Model(S, A) R, S 0 (assuming deterministic environment) model learning (f) Repeat n times: S random previously observed state A random action previously taken in S planning R, S 0 Model(S, A) Q(S, A) Q(S, A)+ [R + max a Q(S 0,a) Q(S, A)]

Dyna-Q on a Simple Maze rewards = 0 until goal, when =1

Dyna-Q Snapshots: Midway in 2nd Episode WITHOUT PLANNING (N=0) n G n WITH PLANNING (N=50) G S S

When the Model is Wrong: Blocking Maze The changed environment is harder

When the Model is Wrong: Shortcut Maze The changed environment is easier

What is Dyna-Q? + Uses an exploration bonus : Keeps track of time since each state-action pair was tried for real An extra reward is added for transitions caused by state-action pairs related to how long ago they were tried: the longer unvisited, the more reward for visiting R + apple p, time since last visiting the state-action pair The agent actually plans how to visit long unvisited states

Prioritized Sweeping Which states or state-action pairs should be generated during planning? Work backwards from states whose values have just changed: Maintain a queue of state-action pairs whose values would change a lot if backed up, prioritized by the size of the change When a new backup occurs, insert predecessors according to their priorities Always perform backups from first in queue Moore & Atkeson 1993; Peng & Williams 1993 improved by McMahan & Gordon 2005; Van Seijen 2013

Improved Prioritized Sweeping with Small Backups Planning is a form of state-space search a massive computation which we want to control to maximize its efficiency Prioritized sweeping is a form of search control focusing the computation where it will do the most good But can we focus better? Can we focus more tightly? Small backups are perhaps the smallest unit of search work and thus permit the most flexible allocation of effort

Full and Sample (One-Step) Backups Value estimated Full backups (DP) Sample backups (one-step TD) s s V v! π (s) a r s' a r s' policy evaluation TD(0) V v * * (s) max s value iteration a r s' s,a s,a Q q! π (a,s) r s' r s' a' Q-policy evaluation a' Sarsa Q q * (a,s) * max s,a r s' a' Q-value iteration max s,a r s' a' Q-learning

Summary Emphasized close relationship between planning and learning Important distinction between distribution models and sample models Looked at some ways to integrate planning and learning synergy among planning, acting, model learning Distribution of backups: focus of the computation prioritized sweeping small backups sample backups trajectory sampling: backup along trajectories heuristic search Size of backups: full/sample/small; deep/shallow