! Reinforcement Learning Part 2! Value Function Methods. Jan Peters Gerhard Neumann

Similar documents
Lecture 10: Reinforcement Learning

Lecture 1: Machine Learning Basics

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Axiom 2013 Team Description Paper

Artificial Neural Networks written examination

Reinforcement Learning by Comparing Immediate Reward

Python Machine Learning

(Sub)Gradient Descent

CSL465/603 - Machine Learning

TD(λ) and Q-Learning Based Ludo Players

Exploration. CS : Deep Reinforcement Learning Sergey Levine

AMULTIAGENT system [1] can be defined as a group of

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

arxiv: v1 [cs.lg] 15 Jun 2015

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Georgetown University at TREC 2017 Dynamic Domain Track

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

An Introduction to Simio for Beginners

AI Agent for Ice Hockey Atari 2600

Speeding Up Reinforcement Learning with Behavior Transfer

Task Completion Transfer Learning for Reward Inference

FF+FPG: Guiding a Policy-Gradient Planner

A Reinforcement Learning Variant for Control Scheduling

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Lecture 6: Applications

Task Completion Transfer Learning for Reward Inference

Generative models and adversarial training

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

STA 225: Introductory Statistics (CT)

High-level Reinforcement Learning in Strategy Games

Active Learning. Yingyu Liang Computer Sciences 760 Fall

On Human Computer Interaction, HCI. Dr. Saif al Zahir Electrical and Computer Engineering Department UBC

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

SARDNET: A Self-Organizing Feature Map for Sequences

Laboratorio di Intelligenza Artificiale e Robotica

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Grade 6: Correlated to AGS Basic Math Skills

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Lecture 1: Basic Concepts of Machine Learning

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Assignment 1: Predicting Amazon Review Ratings

CS Machine Learning

Knowledge Transfer in Deep Convolutional Neural Nets

Lahore University of Management Sciences. FINN 321 Econometrics Fall Semester 2017

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Regret-based Reward Elicitation for Markov Decision Processes

A Comparison of Annealing Techniques for Academic Course Scheduling

Detailed course syllabus

Edexcel GCSE. Statistics 1389 Paper 1H. June Mark Scheme. Statistics Edexcel GCSE

Improving Action Selection in MDP s via Knowledge Transfer

Softprop: Softmax Neural Network Backpropagation Learning

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Seminar - Organic Computing

Truth Inference in Crowdsourcing: Is the Problem Solved?

Probability and Statistics Curriculum Pacing Guide

Evolutive Neural Net Fuzzy Filtering: Basic Description

Mathematics process categories

Laboratorio di Intelligenza Artificiale e Robotica

Improving Fairness in Memory Scheduling

Intelligent Agents. Chapter 2. Chapter 2 1

The Strong Minimalist Thesis and Bounded Optimality

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

LEARNING TO PLAY IN A DAY: FASTER DEEP REIN-

MGT/MGP/MGB 261: Investment Analysis

Go fishing! Responsibility judgments when cooperation breaks down

TeacherPlus Gradebook HTML5 Guide LEARN OUR SOFTWARE STEP BY STEP

An Introduction to Simulation Optimization

Automatic Discretization of Actions and States in Monte-Carlo Tree Search

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

Learning to Schedule Straight-Line Code

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Human Emotion Recognition From Speech

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Broward County Public Schools G rade 6 FSA Warm-Ups

Introduction to Simulation

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

An OO Framework for building Intelligence and Learning properties in Software Agents

On the Combined Behavior of Autonomous Resource Management Agents

A study of speaker adaptation for DNN-based speech synthesis

arxiv: v1 [cs.cv] 10 May 2017

Learning Methods for Fuzzy Systems

Diagnostic Test. Middle School Mathematics

Functional Skills Mathematics Level 2 assessment

ACTL5103 Stochastic Modelling For Actuaries. Course Outline Semester 2, 2014

Meeting Agenda for 9/6

STT 231 Test 1. Fill in the Letter of Your Choice to Each Question in the Scantron. Each question is worth 2 point.

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

Rule Learning With Negation: Issues Regarding Effectiveness

Interaction Design Considerations for an Aircraft Carrier Deck Agent-based Simulation

Software Maintenance

Comment-based Multi-View Clustering of Web 2.0 Items

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Dialog-based Language Learning

Time series prediction

BADM 641 (sec. 7D1) (on-line) Decision Analysis August 16 October 6, 2017 CRN: 83777

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Transcription:

! Reinforcement Learning Part 2! Value Function Methods Jan Peters Gerhard Neumann 1

The Bigger Picture: How to learn policies 1. 2. 3. 4.

Purpose of this Lecture Often, learning a good model is too hard 3 The optimization inherent in optimal control is prone to model errors, as the controller may achieve the objective only because model errors get exploited Optimal control methods based on linearization of the dynamics work only for moderately non-linear tasks Model-free approaches are needed that do not make any assumption on the structure of the model Classical Reinforcement Learning: Solve the optimal control problem by learning the value function, not the model!

Outline of the Lecture 1. Quick recap of dynamic programming 2. Reinforcement Learning with Temporal Differences 3. Value Function Approximation 4. Batch Reinforcement Learning Methods Least-Squares Temporal Difference Learning Fitted Q-Iteration 5. Robot Application: Robot Soccer Final Remarks 4

Markov Decision Processes (MDP) Classical reinforcement learning is typically formulated for the infinite horizon objective Infinite Horizon: maximize discounted accumulated reward discount factor Trades-off long term vs. immediate reward 5

Value functions and State-Action Value Functions Refresher: Value function and state-action value function can be computed iteratively 6

Finding an optimal value function Bellman Equation of optimality Iterating the Bellman Equation converges to the optimal value function and is called value iteration Alternatively we can also iterate Q-functions 7

Outline of the Lecture 1. Quick recap of dynamic programming 2. Reinforcement Learning with Temporal Differences 3. Value Function Approximation 4. Batch Reinforcement Learning Methods Least-Squares Temporal Difference Learning Fitted Q-Iteration 5. Robot Application: Robot Soccer Final Remarks 8

Value-based Reinforcement Learning Classical Reinforcement Learning Updates the value function based on samples We do not have a model and we do not want to learn it Use the samples to update Q-function (or V-function) Lets start simple: Discrete states/actions Tabular Q-function 9

Temporal difference learning Given a transition, we want to update the V-function Estimate of the current value: 1-step prediction of the current value: 1-step prediction error (called temporal difference (TD) error) Update current value with the temporal difference error 10

Temporal difference learning The TD error compares the one-time step lookahead prediction with the current estimate of the value function if than is increased if than is decreased 11

Dopamine as TD-error? Temporal difference error signals can be measured in the brain of monkeys 12 Monkey brains seem to have it...

Algorithmic Description of TD Learning Init: Repeat Observe transition Compute TD error Update V-Function until convergence of V Used to compute Value function of behavior policy Sample-based version of policy evaluation 13

Temporal difference learning for control So far: Policy evaluation with TD methods Can we also do the policy improvement step with samples? Yes, but we need to enforce exploration! Epsilon-Greedy Policy: Soft-Max Policy: 14 Do not always take greedy action

Temporal difference learning for control Update equations for learning the Q-function Two different methods to estimate Q-learning: Estimates Q-function of optimal policy Off-policy samples: SARSA:, where Estimates Q-function of exploration policy On-policy samples 15 Note: The policy for generating the actions depends on the Q- function non-stationary policy

Outline of the Lecture 1. Quick recap of dynamic programming 2. Reinforcement Learning with Temporal Differences 3. Value Function Approximation 4. Batch Reinforcement Learning Methods Least-Squares Temporal Difference Learning Fitted Q-Iteration 5. Robot Application: Robot Soccer Final Remarks 16

Approximating the Value Function In the continuous case, we need to approximate the V-function (except for LQR) Lets keep it simple, we use a linear model to represent the V-function How can we find the parameters? Again with Temporal Difference Learning 17

TD-learning with Function Approximation Derivation: Use the recursive definition of V-function: with Bootstrapping (BS): Use the old approximation to get the target values for a new approximation How can we minimize this function? Lets use stochastic gradient descent 18

Refresher: Stochastic Gradient Descent Consider an expected error function, We can find a local minimum of E by Gradient descent: Stochastic Gradient Descent does the gradient update already after a single sample Converges under the stochastic approximation conditions 19

Temporal difference learning Stochastic gradient descent on our error function MSE BS Update rule (for current time step t, ) with

Temporal difference learning TD with function approximation Difference to discrete algorithm: TD-error is correlated with the feature vector Equivalent if tabular feature coding is used, i.e., Similar update rules can be obtained for SARSA and Q-learning 21 where

Temporal difference learning Some remarks on temporal difference learning: Its not a proper stochastic gradient descent!! Why? Target values change after each parameter update! We ignore the fact that also depends on Side note: This ignorance actually introduces a bias in our optimization, such that we are optimizing a different objective than the MSE In certain cases, we also get divergence (e.g. off-policy samples) TD-learning is very fast in terms of computation time O(#features), but not data-efficient each sample is just used once! 22 Dann, Neumann, Peters: Policy Evaluation with Temporal Differences: A survey and comparison, JMLR, in press

Sucessful examples Linear function approximation Tetris, Go Non-linear function approximation TD Gammon (Worldchampion level) Atari Games (learning from raw pixel input) 23

Outline of the Lecture 1. Quick recap of dynamic programming 2. Value function approximation 3. Reinforcement Learning with Temporal Differences 4. Batch Reinforcement Learning Methods Least-Squares Temporal Difference Learning Fitted Q-Iteration 5. Robot Application: Robot Soccer Final Remarks 24

Batch-Mode Reinforcement Learning Online methods are typically data-inefficient as they use each data point only once Can we re-use the whole batch of data to increase data-efficiency? Least-Squares Temporal Difference (LSTD) Learning Fitted Q-Iteration Computationally much more expensive then TD-learning! 25

Least-Squares Temporal Difference (LSTD) Lets minimize the bootstrapped MSE objective (MSE BS ) Least-Squares Solution: with 26

Least-Squares Temporal Difference (LSTD) Least-Squares Solution: Fixed Point: In case of convergence, we want to have 27

Least-Squares Temporal Difference (LSTD) LSTD solution: Same solution as convergence point of TD-learning One shot! No iterations necessary for policy evaluation LSQ: Adaptation for learning the Q-function 28 Used for Least-Squares Policy Iteration (LSPI) Lagoudakis and Parr, Least-Squares Policy Iteration, JMLR

Learning to Ride a Bicycle State space: angle of handlebar, vertical angle of bike, angle to goal Action space: 5 discrete actions (torque applied to handle, displacement of rider) Feature space: 20 basis functions 29

Fitted Q-iteration 30 In Batch-Mode RL it is also much easier to use non-linear function approximators Many of them only exists in the batch setup, e.g. regression trees No catastrophic forgetting, e.g., for neural networks. Strong divergence problems, fixed for Neural Networks by ensuring that there is a goal state where the Q-Function value is always zero (see Lange et al. below). Fitted Q-iteration uses non-linear function approximators for approximate value iteration. Ernst, Geurts and Wehenkel, Tree-Based Batch Mode Reinforcement Learning, JMLR 2005 Lange, Gabel and Riedmiller. Batch Reinforcement Learning, Reinforcement Learning: State of the Art

Fitted Q-iteration Given: Dataset Algorithm: Initialize, input data: for k = 1 to L Generate target values: Learn new Q-function: end Like Value-Iteration, but we use supervised learning methods to approximate the Q-function at each iteration k 31

Fitted Q-iteration Some Remarks: Regression does the expectation for us The max operator is still hard to solve for continous action spaces For continuous actions, see: Neumann and Peters, Fitted Q-iteration by Advantage weighted regression, NIPS, 2008 32

33 Case Study I: Learning Defense

34 Success

35 Dueling Behavior

36 Case Study II: Learning Motor Speeds

37 Case Study III: Learning to Dribble

Value Function Methods... have been the driving reinforcement learning approach in the 1990s. You can do loads of cool things with them: Learn Chess at professional level, learn Backgammon and Checkers at Grandmaster-Level... and winning the Robot Soccer Cup with a minimum of man power. So, why are they not always the method of choice? You need to fill-up you state-action space up with sufficient samples. Another curse of dimensionality with an exponential explosion. Errors in the Value function approximation might have a catastrophic effect on the policy, can be very hard to control 38 However, it scales better as we only need samples at relevant locations.