ICRA 2012 Tutorial on Reinforcement Learning 4. Value Function Methods

Similar documents
Generative models and adversarial training

Artificial Neural Networks written examination

Lecture 10: Reinforcement Learning

Lecture 1: Machine Learning Basics

Reinforcement Learning by Comparing Immediate Reward

Axiom 2013 Team Description Paper

(Sub)Gradient Descent

Speeding Up Reinforcement Learning with Behavior Transfer

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

Georgetown University at TREC 2017 Dynamic Domain Track

Exploration. CS : Deep Reinforcement Learning Sergey Levine

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Hentai High School A Game Guide

Python Machine Learning

Introduction to Simulation

Functional Skills Mathematics Level 2 assessment

arxiv: v1 [cs.lg] 15 Jun 2015

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

An Introduction to Simio for Beginners

AMULTIAGENT system [1] can be defined as a group of

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

A Reinforcement Learning Variant for Control Scheduling

IN THIS UNIT YOU LEARN HOW TO: SPEAKING 1 Work in pairs. Discuss the questions. 2 Work with a new partner. Discuss the questions.

TD(λ) and Q-Learning Based Ludo Players

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

An OO Framework for building Intelligence and Learning properties in Software Agents

Probability and Game Theory Course Syllabus

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

FF+FPG: Guiding a Policy-Gradient Planner

Go fishing! Responsibility judgments when cooperation breaks down

How People Learn Physics

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Discriminative Learning of Beam-Search Heuristics for Planning

Ohio s Learning Standards-Clear Learning Targets

ACTL5103 Stochastic Modelling For Actuaries. Course Outline Semester 2, 2014

Philosophy 301L: Early Modern Philosophy, Spring 2012

Lecture 6: Applications

Active Learning. Yingyu Liang Computer Sciences 760 Fall

arxiv: v2 [cs.ro] 3 Mar 2017

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Learning Methods for Fuzzy Systems

Outline for Session III

Learning Disability Functional Capacity Evaluation. Dear Doctor,

Interpreting ACER Test Results

Science Fair Project Handbook

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

CSL465/603 - Machine Learning

Laboratorio di Intelligenza Artificiale e Robotica

Improving Conceptual Understanding of Physics with Technology

Visit us at:

AI Agent for Ice Hockey Atari 2600

Shockwheat. Statistics 1, Activity 1

IS FINANCIAL LITERACY IMPROVED BY PARTICIPATING IN A STOCK MARKET GAME?

Physics 270: Experimental Physics

arxiv: v1 [math.at] 10 Jan 2016

Edexcel GCSE. Statistics 1389 Paper 1H. June Mark Scheme. Statistics Edexcel GCSE

Evolutive Neural Net Fuzzy Filtering: Basic Description

Model Ensemble for Click Prediction in Bing Search Ads

Navigating the PhD Options in CMS

What to Do When Conflict Happens

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

Chapter 4 - Fractions

IMGD Technical Game Development I: Iterative Development Techniques. by Robert W. Lindeman

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

How to get the most out of EuroSTAR 2013

AP Calculus AB. Nevada Academic Standards that are assessable at the local level only.

While you are waiting... socrative.com, room number SIMLANG2016

Data Integration through Clustering and Finding Statistical Relations - Validation of Approach

College Pricing. Ben Johnson. April 30, Abstract. Colleges in the United States price discriminate based on student characteristics

Improving Action Selection in MDP s via Knowledge Transfer

Assignment 1: Predicting Amazon Review Ratings

Machine Learning and Development Policy

English (native), German (fair/good, I am one year away from speaking at the classroom level), French (written).

Major Milestones, Team Activities, and Individual Deliverables

Training Pack. Kaizen Focused Improvement Teams (F.I.T.)

Backwards Numbers: A Study of Place Value. Catherine Perez

Title:A Flexible Simulation Platform to Quantify and Manage Emergency Department Crowding

The open source development model has unique characteristics that make it in some

Programme Specification

Inquiry Learning Methodologies and the Disposition to Energy Systems Problem Solving

Managerial Decision Making

A Case Study: News Classification Based on Term Frequency

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

Teaching a Laboratory Section

arxiv: v1 [cs.cv] 10 May 2017

Emotional Variation in Speech-Based Natural Language Generation

Probability and Statistics Curriculum Pacing Guide

Learning to Schedule Straight-Line Code

BMBF Project ROBUKOM: Robust Communication Networks

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

High-level Reinforcement Learning in Strategy Games

Math 098 Intermediate Algebra Spring 2018

Software Maintenance

Cognitive Thinking Style Sample Report

Miami-Dade County Public Schools

Transcription:

ICRA 2012 Tutorial on Reinforcement Learning 4. Value Function Methods Pieter Abbeel UC Berkeley Jan Peters TU Darmstadt

A Reinforcement Learning Ontology Prior Knowledge Data { (x t, u t, x t+1, r t ) } T, R V * V * ¼ * Optimal Control (with Model Learning) ¼ * Model-Free Value Function Methods ¼ * Model-Free Policy Search Methods

Outline Challenge: Most real-world problems have large, often infinite and continuous, state spaces Value Function Methods: Model-free learning Monte Carlo, TD-learning and Q-learning (tabular) Function approximation Q-learning with feature-based representations Fitted Q-learning Often good approach, even when model is available 3

Model-Based Learning Step 1: Learn the model: Supervised learning to find T(x,u,x ) and R(x,u) from experiences (x,u,x ) Step 2: Solve for optimal policy: Can be done with optimal control methods, such as value iteration 4

Model-free: 1. Monte Carlo / Direct Evaluation Repeatedly execute the policy Estimate the value of the state s as the average over all times the state s was visited of the sum of discounted rewards accumulated from state s onwards π 5

Exercise: Direct Evaluation γ = 1, R = -1 y +100-100 x (1,1) up -1 (1,2) up -1 (1,3) right -1 (2,3) right -1 (3,3) right -1 (3,2) up -1 (4,2) exit -100 (done) (a) According to Direct Evaluation: What is V(3,3)? (b) According to Direct Evaluation: What is V(2,3)? (1,1) up -1 (1,2) up -1 (1,2) up -1 (1,3) right -1 (2,3) right -1 (3,3) right -1 (3,2) up -1 (3,3) right -1 (4,3) exit +100 (done) (c) Just based on these samples, what could be a better estimate for V(2,3)? 6

Limitations of Direct Evaluation Assume random initial state Assume the value of state (1,2) is known perfectly based on past runs Now for the first time encounter (1,1) --- can we do better than estimating V(1,1) as the rewards outcome of that run? 9

Model-free: 2. TD Learning Who needs T and R? Approximate the expectation with samples of s (drawn from T!) Almost! But we cat rewind time to get sample after sample from state s. 10

Exponential Moving Average Exponential moving average Makes recent samples more important Forgets about the past (distant past values were wrong anyway) Easy to compute from the running average Decreasing learning rate can give converging averages 11

Problems with TD Value Learning TD value leaning is a model-free way to do policy evaluation However, if we want to turn values into a (new) policy, we re sunk: Idea: learn Q-values directly Makes action selection model-free too! 12

Detour: Q-Value Iteration Value iteration: find successive approx optimal values Start with V 0* (x) = 0, which we know is right (why?) Given V i*, calculate the values for all states for depth i+1: But Q-values are more useful! Start with Q 0* (x,u) = 0, which we know is right (why?) Given Q i*, calculate the q-values for all q-states for depth i+1: 13

Q-Learning Q-Learning: sample-based Q-value iteration Learn Q* values Receive a sample (x,u,x,r) Consider your old estimate: Consider your new sample estimate: Incorporate the new estimate into a running average: 14

Q-Learning Properties Amazing result: Q-learning converges to optimal policy If you explore enough If you make the learning rate small enough but not decrease it too quickly! Basically doesn t matter how you select actions (!) Neat property: off-policy learning learn optimal policy without following it 15

Q-Learning In realistic situations, we cannot possibly learn about every single state! Too many states to visit them all in training Too many states to hold the q-tables in memory Instead, we want to generalize: Learn about some small number of training states from experience Generalize that experience to new, similar states This is a fundamental idea in machine learning, and we ll see it over and over again 16

Example: Pacman Let s say we discover through experience that this state is bad: In naïve q learning, we know nothing about this state or its q states: Or even this one! 17

Feature-Based Representations Solution: describe a state using a vector of features Features are functions from states to real numbers (often 0/1) that capture important properties of the state Example features: Distance to closest ghost Distance to closest dot Number of ghosts 1 / (dist to dot) 2 Is Pacman in a tunnel? (0/1) etc. Can also describe a q-state (s, a) with features (e.g. action moves closer to food) 18

Linear Feature Functions Using a feature representation, we can write a q function (or value function) for any state using a few weights: Advantage: our experience is summed up in a few powerful numbers Disadvantage: states may share features but be very different in value! 19

Tabular Q-function Linear Q-function Q table Sample: Difference: Update: 20

Linear Q-function Sample: Difference: Update: Intuitive interpretation: Adjust weights of active features E.g. if something unexpectedly bad happens, disprefer all states with that state s features Formal justification: online least squares on 21

Example: Q-Pacman 22

Ordinary Least Squares (OLS) Observation Error or residual Prediction 0 0 20 23

Minimizing Error Value update explained: 24

Function approximation Update we covered = gradient descent on one sample à How about batch version? = called fitted Q-iteration

Fitted Q-Iteration Assume Q-function of the form Q(x, u; w) E.g.: Q(x, u; w) = i w i f i (x,u) Iterate for k = 1, 2, (improve w in each iter) Obtain samples (x (j), u (j), x (j), r (j) ), j=1,2,,j (from model or from experience, and can keep set fixed or grow over time) Supervised learning on: w (k+1) = argmin w j loss( Q(x (j), u (j) ; w), sample (j) ) where sample (j) = r (j) + max u Q(x (j), u ; w (k) )

Outline Challenge: Most real-world problems have large, often infinite and continuous, state spaces Value Function Methods: Model-free learning Monte Carlo, TD-learning and Q-learning (tabular) Function approximation Q-learning with feature-based representations Fitted Q-learning Often good approach, even when model is available 27

Fitted Q-iteration demo Martin Riedmiller and collaborators

Fitted Q-iteration demo Martin Riedmiller and collaborators Neural fitted Q-iteration, learning from scratch, without a model; growing batch: typically, improving the Q function and collecting the transitions is done in alternating fashion. Dribbling with soccer robots: difficult to solve analytically, due to physical interactions of robot and ball. First some random playing with the ball and then learn to dribble by rewarding the robot if it turns to the desired target direction without loosing the ball and punish it otherwise. Also: slot-car racing, cart and double pole, active suspension of a convertible car, steering of an autonomous car, magnetic levitation,...

Mini Project! (Optional) Consolidate your understanding! Implement and experiment with Value iteration Q-learning Q-learning with function approximation Time-frame: now and lunch break