Chapter 11: Case Studies

Similar documents
Axiom 2013 Team Description Paper

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Reinforcement Learning by Comparing Immediate Reward

TD(λ) and Q-Learning Based Ludo Players

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

A Reinforcement Learning Variant for Control Scheduling

Speeding Up Reinforcement Learning with Behavior Transfer

Artificial Neural Networks written examination

Lecture 10: Reinforcement Learning

Chapter 2. Intelligent Agents. Outline. Agents and environments. Rationality. PEAS (Performance measure, Environment, Actuators, Sensors)

Python Machine Learning

Introduction to Simulation

An Introduction to Simio for Beginners

An empirical study of learning speed in backpropagation

Lecture 1: Machine Learning Basics

Agents and environments. Intelligent Agents. Reminders. Vacuum-cleaner world. Outline. A vacuum-cleaner agent. Chapter 2 Actuators

Learning Prospective Robot Behavior

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

Improving Action Selection in MDP s via Knowledge Transfer

Learning to Schedule Straight-Line Code

The Evolution of Random Phenomena

Seminar - Organic Computing

Learning Methods for Fuzzy Systems

The Moodle and joule 2 Teacher Toolkit

Laboratorio di Intelligenza Artificiale e Robotica

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Intelligent Agents. Chapter 2. Chapter 2 1

Georgetown University at TREC 2017 Dynamic Domain Track

High-level Reinforcement Learning in Strategy Games

Rover Races Grades: 3-5 Prep Time: ~45 Minutes Lesson Time: ~105 minutes

Executive Guide to Simulation for Health

FUZZY EXPERT. Dr. Kasim M. Al-Aubidy. Philadelphia University. Computer Eng. Dept February 2002 University of Damascus-Syria

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

An OO Framework for building Intelligence and Learning properties in Software Agents

AMULTIAGENT system [1] can be defined as a group of

Paper Reference. Edexcel GCSE Mathematics (Linear) 1380 Paper 1 (Non-Calculator) Foundation Tier. Monday 6 June 2011 Afternoon Time: 1 hour 30 minutes

LEGO MINDSTORMS Education EV3 Coding Activities

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Evolutive Neural Net Fuzzy Filtering: Basic Description

Softprop: Softmax Neural Network Backpropagation Learning

FF+FPG: Guiding a Policy-Gradient Planner

Probability and Statistics Curriculum Pacing Guide

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

Knowledge Transfer in Deep Convolutional Neural Nets

(Sub)Gradient Descent

Laboratorio di Intelligenza Artificiale e Robotica

Assignment 1: Predicting Amazon Review Ratings

Investigations for Chapter 1. How do we measure and describe the world around us?

CSL465/603 - Machine Learning

BMBF Project ROBUKOM: Robust Communication Networks

ACTL5103 Stochastic Modelling For Actuaries. Course Outline Semester 2, 2014

A Comparison of Annealing Techniques for Academic Course Scheduling

Automatic Discretization of Actions and States in Monte-Carlo Tree Search

SOFTWARE EVALUATION TOOL

An Introduction to Simulation Optimization

Title:A Flexible Simulation Platform to Quantify and Manage Emergency Department Crowding

A Neural Network GUI Tested on Text-To-Phoneme Mapping

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Interaction Design Considerations for an Aircraft Carrier Deck Agent-based Simulation

ENEE 302h: Digital Electronics, Fall 2005 Prof. Bruce Jacob

Teaching a Laboratory Section

Extending Place Value with Whole Numbers to 1,000,000

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

DIDACTIC MODEL BRIDGING A CONCEPT WITH PHENOMENA

Math-U-See Correlation with the Common Core State Standards for Mathematical Content for Third Grade

Functional Skills Mathematics Level 2 assessment

Edexcel GCSE. Statistics 1389 Paper 1H. June Mark Scheme. Statistics Edexcel GCSE

4-3 Basic Skills and Concepts

Lecture 6: Applications

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

Generative models and adversarial training

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

arxiv: v1 [cs.lg] 15 Jun 2015

Designing a Computer to Play Nim: A Mini-Capstone Project in Digital Design I

Learning From the Past with Experiment Databases

DIGITAL GAMING & INTERACTIVE MEDIA BACHELOR S DEGREE. Junior Year. Summer (Bridge Quarter) Fall Winter Spring GAME Credits.

The Strong Minimalist Thesis and Bounded Optimality

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Regret-based Reward Elicitation for Markov Decision Processes

Improving Fairness in Memory Scheduling

A theoretic and practical framework for scheduling in a stochastic environment

INPE São José dos Campos

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

STT 231 Test 1. Fill in the Letter of Your Choice to Each Question in the Scantron. Each question is worth 2 point.

Planning with External Events

Analysis of Enzyme Kinetic Data

The Round Earth Project. Collaborative VR for Elementary School Kids

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

University of Groningen. Systemen, planning, netwerken Bosman, Aart

Self Study Report Computer Science

CS Machine Learning

Making welding simulators effective

Test Effort Estimation Using Neural Network

Transcription:

Chapter 11: Case Studies Objectives of this chapter: Illustrate trade-offs and issues that arise in real applications Illustrate use of domain knowledge Illustrate representation development Some historical insight: Samuel s checkers player

TD Gammon Tesauro 1992, 1994, 1995,... White has just rolled a 5 and a 2 so can move one of his pieces 5 and one (possibly the same) 2 steps Objective is to advance all pieces to points 19-24 Hitting Doubling 30 pieces, 24 locations implies enormous number of configurations Effective branching factor of 400 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 2

A Few Details Reward: 0 at all times except those in which the game is won, when it is 1 Episodic (game = episode), undiscounted Gradient descent TD(λ) with a multi-layer neural network weights initialized to small random numbers backpropagation of TD error four input units for each point; unary encoding of number of white pieces, plus other features Use of afterstates Learning during self-play R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 3

Multi-layer Neural Network R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 4

Summary of TD-Gammon Results R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 5

Samuel s Checkers Player Arthur Samuel 1959, 1967 Score board configurations by a scoring polynomial (after Shannon, 1950) Minimax to determine backed-up score of a position Alpha-beta cutoffs Rote learning: save each board config encountered together with backed-up score needed a sense of direction : like discounting Learning by generalization: similar to TD algorithm R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 6

Samuel s Backups R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 7

The Basic Idea... we are attempting to make the score, calculated for the current board position, look like that calculated for the terminal board positions of the chain of moves which most probably occur during actual play. A. L. Samuel Some Studies in Machine Learning Using the Game of Checkers, 1959 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 8

More Samuel Details Did not include explicit rewards Instead used the piece advantage feature with a fixed weight No special treatment of terminal positions This can lead to problems... Generalization method produced better than average play; tricky but beatable Ability to search through feature set and combine features Supervised mode: book learning Signature tables R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 9

The Acrobot Spong 1994 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 10

Acrobot Learning Curves for Sarsa(λ) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 11

Typical Acrobot Learned Behavior R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 12

Crites and Barto 1996 Elevator Dispatching R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 13

Semi-Markov Q-Learning Continuous-time problem but decisions in dicrete jumps # R t = $ " k r t +k +1 becomes R t = e %&' r t +' d' k= 0 Suppose system takes action a from state s at time t 1, and next decision is needed at time t 2 in state s! : ) t 2, Q(s,a) " Q(s,a) + # e $% ( & $t 1 ) r & d& + e $% ( t 2 $t + ( ) 1 maxq( s ', a ') $ Q(s,a). * + a ' -. t 1 # ( 0 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 14

Passenger Arrival Patterns Up-peak and Down-peak traffic Not equivalent: down-peak handling capacity is much greater than up-peak handling capacity; so up-peak capacity is limiting factor. Up-peak easiest to analyse: once everyone is onboard at lobby, rest of trip is determined. The only decision is when to open and close doors at lobby. Optimal policy for pure case is: close doors when threshold number on; threshold depends on traffic intensity. More policies to consider for two-way and down-peak traffic. We focus on down-peak traffic pattern. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 15

Control Strategies Zoning: divide building into zones; park in zone when idle. Robust in heavy traffic. Search-based methods: greedy or non-greedy. Receding Horizon control. Rule-based methods: expert systems/fuzzy logic; from human experts Other heuristic methods: Longest Queue First (LQF), Highest Unanswered Floor First (HUFF), Dynamic Load Balancing (DLB) Adaptive/Learning methods: NNs for prediction, parameter space search using simulation, DP on simplified model, non-sequential RL R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 16

The Elevator Model (from Lewis, 1991) Discrete Event System: continuous time, asynchronous elevator operation Parameters: Floor Time (time to move one floor at max speed): 1.45 secs. Stop Time (time to decelerate, open and close doors, and accelerate again): 7.19 secs. Turn Time (time needed by a stopped car to change directions): 1 sec. Load Time (the time for one passenger to enter or exit a car): a random variable with range from 0.6 to 6.0 secs, mean of 1 sec. Car Capacity: 20 passengers Traffic Profile: Poisson arrivals with rates changing every 5 minutes; down-peak R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 17

State Space 18 18 hall call buttons: 2 combinations 4 positions and directions of cars: 18 (rounding 4 to nearest floor) motion states of cars (accelerating, moving, decelerating, stopped, loading, turning): 6 40 car buttons: 240 Set of passengers waiting at each floor, each passenger's arrival time and destination: unobservable. However, 18 real numbers are available giving elapsed time since hall buttons pushed; we discretize these. Set of passengers riding each car and their destinations: observable only through the car buttons Conservatively about 10 22 states R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 18

Actions When moving (halfway between floors): stop at next floor continue past next floor When stopped at a floor: go up go down Asynchronous R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 19

Constraints standard special heuristic A car cannot pass a floor if a passenger wants to get off there A car cannot change direction until it has serviced all onboard passengers traveling in the current direction Don t stop at a floor if another car is already stopping, or is stopped, there Don t stop at a floor unless someone wants to get off there Given a choice, always move up Stop and Continue R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 20

Performance Criteria Minimize: Average wait time Average system time (wait + travel time) % waiting > T seconds (e.g., T = 60) Average squared wait time (to encourage fast and fair service) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 21

Average Squared Wait Time Instantaneous cost: " p ( ) r! = wait p (! ) 2 Define return as an integral rather than a sum (Bradtke and Duff, 1994): " #! 2 r t t = 0 becomes $ % 0 e! "# r # d# R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 22

Algorithm Repeat forever : 1. In state x at time t x, car c must decide to STOP or CONTINUE 2. It selects an action using Boltzmann distribution (with decreasing temperature) based on current Q values 3. The next decision by car c is required in state y at time t y 4. Implements the gradient descent version of the following backup using backprop: 5. x " y, t x " t y ) Q(x,a) " Q(x,a) + # + * + t y ( t x, e $% ( & $t ) x r & d& + e $% ( t y $t x ) max Q(y, a ') $ Q(x,a). a ' -. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 23

Computing Rewards Must calculate % & 0 e "# ($ "t x ) r $ d$ Omniscient Rewards : the simulator knows how long each passenger has been waiting. On-Line Rewards : Assumes only arrival time of first passenger in each queue is known (elapsed hall button time); estimate arrival times R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 24

Neural Networks 47 inputs, 20 sigmoid hidden units, 1 or 2 output units Inputs: 9 binary: state of each hall down button 9 real: elapsed time of hall down button if pushed 16 binary: one on at a time: position and direction of car making decision 10 real: location/direction of other cars: footprint 1 binary: at highest floor with waiting passenger? 1 binary: at floor with longest waiting passenger? 1 bias unit 1 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 25

Elevator Results R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 26

Dynamic Channel Allocation Singh and Bertsekas 1997 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 27

Job-Shop Scheduling Zhang and Dietterich 1995, 1996 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 28

Job-Shop Scheduling R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 29

Autonomous Helicopter Flight A. Ng, Stanford; H. Kim, M. Jordon, S. Sastry, Berkeley R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 30

Model-Based Direct Policy Search Identify model of helicopter dynamics as flown by human pilot Model using locally-weighted linear regression Estimate values via Monte Carlo evaluation Simple stochastic hillclimbing to adapt policy neural network Does not store a value function To hover: 30 evaluations of 35 seconds of flying time each R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 31

Quadrupedal Locomotion Nate Kohl & Peter Stone, Univ of Texas at Austin All training done with physical robots: Sony Aibo ERS-210A Before Learning After 1000 trials, or about 3 hours R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 32

Direct Policy Search Policy Parameterization: Half-elliptical locus for each foot 12 parameters: Position of front locus (x, y, z) Position of rear locus (x, y, z) Locus length Locus skew (for turning) Height of front of body Height of rear of body Time for each foot to move through locus Fraction of time each foot spends on the ground Simple stochastic hillclimbing to increase speed R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 33

Learning Control for Dynamically Stable Walking Robots Russ Tedrake, Teresa Zhang, H. Sebastion Seung, MIT Start with a Passive Walker R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 34

Value Function + Policy Adaptation Sophisticated form of Actor-Critic algorithm Passive walker + 4 actuators: roll and pitch of each foot Value function and policy represented by linear function approximators Behavior is periodic; learning tunes return map Goal: walk on the flat like the passive walker walks on a slope R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 35

http://hebb.mit.edu/~russt/robots

Grasp Control R. Platt, A. Fagg, R. Grupen, Univ of Mass Umass Torso: Dexter R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 37

Control Basis Approach A set of parameterized closed-loop controllers Multiple controllers can operate at the same time Sequencing controllers and combinations of controllers can generate a variety of behavior ADP done in a smallish abstract state space: R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 38