TD Gammon. Chapter 11: Case Studies. A Few Details. Multi-layer Neural Network. Tesauro 1992, 1994, 1995,... Objectives of this chapter:

Similar documents
Axiom 2013 Team Description Paper

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Artificial Neural Networks written examination

Reinforcement Learning by Comparing Immediate Reward

Lecture 10: Reinforcement Learning

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

TD(λ) and Q-Learning Based Ludo Players

A Reinforcement Learning Variant for Control Scheduling

An Introduction to Simio for Beginners

Python Machine Learning

Improving Action Selection in MDP s via Knowledge Transfer

The Moodle and joule 2 Teacher Toolkit

Chapter 2. Intelligent Agents. Outline. Agents and environments. Rationality. PEAS (Performance measure, Environment, Actuators, Sensors)

An empirical study of learning speed in backpropagation

Learning to Schedule Straight-Line Code

The Evolution of Random Phenomena

Speeding Up Reinforcement Learning with Behavior Transfer

FUZZY EXPERT. Dr. Kasim M. Al-Aubidy. Philadelphia University. Computer Eng. Dept February 2002 University of Damascus-Syria

High-level Reinforcement Learning in Strategy Games

Introduction to Simulation

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Lecture 1: Machine Learning Basics

Learning Methods for Fuzzy Systems

An OO Framework for building Intelligence and Learning properties in Software Agents

Georgetown University at TREC 2017 Dynamic Domain Track

Laboratorio di Intelligenza Artificiale e Robotica

Knowledge Transfer in Deep Convolutional Neural Nets

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

Softprop: Softmax Neural Network Backpropagation Learning

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

Agents and environments. Intelligent Agents. Reminders. Vacuum-cleaner world. Outline. A vacuum-cleaner agent. Chapter 2 Actuators

An Introduction to Simulation Optimization

Evolutive Neural Net Fuzzy Filtering: Basic Description

Designing a Computer to Play Nim: A Mini-Capstone Project in Digital Design I

Laboratorio di Intelligenza Artificiale e Robotica

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

CS Machine Learning

Unit 3: Lesson 1 Decimals as Equal Divisions

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

(Sub)Gradient Descent

INPE São José dos Campos

Intelligent Agents. Chapter 2. Chapter 2 1

ENEE 302h: Digital Electronics, Fall 2005 Prof. Bruce Jacob

Seminar - Organic Computing

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

BMBF Project ROBUKOM: Robust Communication Networks

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Exploration. CS : Deep Reinforcement Learning Sergey Levine

LEGO MINDSTORMS Education EV3 Coding Activities

A Comparison of Annealing Techniques for Academic Course Scheduling

Extending Place Value with Whole Numbers to 1,000,000

Math-U-See Correlation with the Common Core State Standards for Mathematical Content for Third Grade

Title:A Flexible Simulation Platform to Quantify and Manage Emergency Department Crowding

Self Study Report Computer Science

Session 2B From understanding perspectives to informing public policy the potential and challenges for Q findings to inform survey design

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Executive Guide to Simulation for Health

Rule Learning With Negation: Issues Regarding Effectiveness

Regret-based Reward Elicitation for Markov Decision Processes

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

AMULTIAGENT system [1] can be defined as a group of

Teaching a Laboratory Section

Rule-based Expert Systems

Major Milestones, Team Activities, and Individual Deliverables

Learning Prospective Robot Behavior

A Neural Network GUI Tested on Text-To-Phoneme Mapping

ns-2 Tutorial Running Simulations

Learning From the Past with Experiment Databases

Test Effort Estimation Using Neural Network

Paper Reference. Edexcel GCSE Mathematics (Linear) 1380 Paper 1 (Non-Calculator) Foundation Tier. Monday 6 June 2011 Afternoon Time: 1 hour 30 minutes

Functional Skills Mathematics Level 2 assessment

Teachable Robots: Understanding Human Teaching Behavior to Build More Effective Robot Learners

Using focal point learning to improve human machine tacit coordination

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

Decision Analysis. Decision-Making Problem. Decision Analysis. Part 1 Decision Analysis and Decision Tables. Decision Analysis, Part 1

Dublin City Schools Mathematics Graded Course of Study GRADE 4

Alignment of Australian Curriculum Year Levels to the Scope and Sequence of Math-U-See Program

A simulated annealing and hill-climbing algorithm for the traveling tournament problem

AI Agent for Ice Hockey Atari 2600

Investigations for Chapter 1. How do we measure and describe the world around us?

Lecture 15: Test Procedure in Engineering Design

Rule Learning with Negation: Issues Regarding Effectiveness

The Use of Statistical, Computational and Modelling Tools in Higher Learning Institutions: A Case Study of the University of Dodoma

PeopleSoft Class Scheduling. The Mechanics of Schedule Build

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

Introduction to Causal Inference. Problem Set 1. Required Problems

Soft Computing based Learning for Cognitive Radio

Getting Started with Deliberate Practice

CSL465/603 - Machine Learning

arxiv: v1 [cs.cv] 10 May 2017

Rover Races Grades: 3-5 Prep Time: ~45 Minutes Lesson Time: ~105 minutes

DIDACTIC MODEL BRIDGING A CONCEPT WITH PHENOMENA

Model Ensemble for Click Prediction in Bing Search Ads

arxiv: v1 [cs.lg] 15 Jun 2015

Evolution of Symbolisation in Chimpanzees and Neural Nets

A theoretic and practical framework for scheduling in a stochastic environment

Assignment 1: Predicting Amazon Review Ratings

Transcription:

Objectives of this chapter: Chapter 11: Case Studies! Illustrate trade-offs and issues that arise in real applications! Illustrate use of domain knowledge! Illustrate representation development! Some historical insight: Samuel s checkers player TD Gammon Tesauro 1992, 1994, 1995,...! White has just rolled a 5 and a 2 so can move one of his pieces 5 and one (possibly the same) 2 steps! Objective is to advance all pieces to points 19-24! Hitting! Doubling! 30 pieces, 24 locations implies enormous number of configurations! Effective branching factor of 400 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 2 A Few Details Multi-layer Neural Network! Reward: 0 at all times except those in which the game is won, when it is 1! Episodic (game = episode), undiscounted! Gradient descent TD(!) with a multi-layer neural network! weights initialized to small random numbers! backpropagation of TD error! four input units for each point; unary encoding of number of white pieces, plus other features! Use of afterstates! Learning during self-play R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 3 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 4

Summary of TD-Gammon Results Samuel s Checkers Player Arthur Samuel 1959, 1967! Score board configurations by a scoring polynomial (after Shannon, 1950)! Minimax to determine backed-up score of a position! Alpha-beta cutoffs! Rote learning: save each board config encountered together with backed-up score! needed a sense of direction : like discounting! Learning by generalization: similar to TD algorithm R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 5 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 6 Samuel s Backups The Basic Idea... we are attempting to make the score, calculated for the current board position, look like that calculated for the terminal board positions of the chain of moves which most probably occur during actual play. A. L. Samuel Some Studies in Machine Learning Using the Game of Checkers, 1959 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 7 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 8

More Samuel Details The Acrobot Spong 1994! Did not include explicit rewards! Instead used the piece advantage feature with a fixed weight! No special treatment of terminal positions! This can lead to problems...! Generalization method produced better than average play; tricky but beatable! Ability to search through feature set and combine features! Supervised mode: book learning! Signature tables R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 9 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 10 A bit more detail Acrobot Learning Curves for Sarsa(!)! SARSA(!) was used; the tilings were over positions and velocities of the links. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 11 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 12

Typical Acrobot Learned Behavior Elevator Dispatching Crites and Barto 1996 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 13 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 14 Elevator Dispatching: Cost functions Average waiting time Average system time (from when the passenger starts waiting until they are delivered) Percentage of passengers whose wait time is > 60 seconds Crites & Barto used average squared waiting time: Tends to keep waiting times low encourages fairness to all passengers Passenger Arrival Patterns Up-peak and Down-peak traffic Not equivalent: down-peak handling capacity is much greater than up-peak handling capacity; so up-peak capacity is limiting factor. Up-peak easiest to analyse: once everyone is onboard at lobby, rest of trip is determined. The only decision is when to open and close doors at lobby. Optimal policy for pure case is: close doors when threshold number on; threshold depends on traffic intensity. More policies to consider for two-way and down-peak traffic. We focus on down-peak traffic pattern. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 15 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 16

Constraints Control Strategies standard special heuristic A car cannot pass a floor if a passenger wants to get off there A car cannot change direction until it has serviced all onboard passengers traveling in the current direction Donʼt stop at a floor if another car is already stopping, or is stopped, there Donʼt stop at a floor unless someone wants to get off there Given a choice, always move up Two actions: when arriving at a floor with passengers waiting: Stop or Continue Zoning: divide building into zones; park in zone when idle. Robust in heavy traffic. Search-based methods: greedy or non-greedy. Receding Horizon control. Rule-based methods: expert systems/fuzzy logic; from human experts Other heuristic methods: Longest Queue First (LQF), Highest Unanswered Floor First (HUFF), Dynamic Load Balancing (DLB) Adaptive/Learning methods: NNs for prediction, parameter space search using simulation, DP on simplified model, non-sequential RL R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 17 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 18 The Elevator Model (from Lewis, 1991) State Space Discrete Event System: continuous time, asynchronous elevator operation Parameters: Floor Time (time to move one floor at max speed): 1.45 secs. Stop Time (time to decelerate, open and close doors, and accelerate again): 7.19 secs. TurnTime (time needed by a stopped car to change directions): 1 sec. Load Time (the time for one passenger to enter or exit a car): a random variable with range from 0.6 to 6.0 secs, mean of 1 sec. Car Capacity: 20 passengers 18 18 hall call buttons: 2 combinations 4 positions and directions of cars: 18 (rounding to nearest floor) motion states of cars (accelerating, moving, decelerating, stopped, loading, turning): 6 40 car buttons: 240 Set of passengers waiting at each floor, each passenger's arrival time and destination: unobservable. However, 18 real numbers are available giving elapsed time since hall buttons pushed; we discretize these. Set of passengers riding each car and their destinations: observable only through the car buttons 22 Conservatively about 10 states Traffic Profile: Poisson arrivals with rates changing every 5 minutes; down-peak R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 19 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 20

Actions When moving (halfway between floors): stop at next floor continue past next floor When stopped at a floor: go up go down Asynchronous " Semi-Markov Q-Learning Continuous-time problem but decisions in discrete jumps Reward is integrated over the actual time period between jumps R t = #! k r t + k +1 becomes R t = e $ %& r t +& d& k = 0 Suppose system takes action a from state s at time t 1, and next decision is needed at time t 2 in state s! : " ' 0 t ( 2 + Q(s, a)! Q(s, a) + " e #$ ( % # t 1 ) r% d% + e # $ ( t 2 # t 1 ) *' max Q( s &, a &) # Q(s, a) a & - ) t 1, R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 21 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 22 Average Squared Wait Time Algorithm Instantaneous cost: "( ) r! = wait p (! ) p 2 Define return as an integral rather than a sum (Bradtke and Duff, 1994): " # t = 0! 2 r t becomes Repeat forever: 1. In state x at time t x, car c must decide to STOP or CONTINUE 2. It selects an action using softmax (with decreasing temperature) based on current Q values 3. The next decision by car c is required in state y at time t y 4. Implements the gradient descent version of the following backup using backprop: 5. x! y, t x! t y ( Q(x,a)! Q(x,a) + " * )* t y ' t x e #$ ( % #t x ) r% d% + e #$ ( t y #t x ) min a& + Q(y, a& ) # Q(x,a) -,- $ % 0 e! "# r # d# R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 23 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 24

Must calculate $ % 0 e! "(#!t x ) r # d# Computing Rewards Omniscient Rewards : the simulator knows how long each passenger has been waiting. On-Line Rewards : Assumes only arrival time of first passenger in each queue is known (elapsed hall button time); estimate arrival times Function approximator: Neural Network 47 inputs, 20 sigmoid hidden units, 1 or 2 output units Inputs: 9 binary: state of each hall down button 9 real: elapsed time of hall down button if pushed 16 binary: one on at a time: position and direction of car making decision 10 real: location/direction of other cars: footprint 1 binary: at highest floor with waiting passenger? 1 binary: at floor with longest waiting passenger? 1 bias unit " 1 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 25 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 26 Elevator Results Dynamic Channel Allocation Singh and Bertsekas 1997 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 27 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 28

Job-Shop Scheduling Zhang and Dietterich 1995, 1996 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 29