Reinforcement Learning (Model-free RL) R&N Chapter 21. Reinforcement Learning

Similar documents
Reinforcement Learning by Comparing Immediate Reward

Axiom 2013 Team Description Paper

Lecture 10: Reinforcement Learning

Improving Action Selection in MDP s via Knowledge Transfer

Artificial Neural Networks written examination

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

High-level Reinforcement Learning in Strategy Games

Georgetown University at TREC 2017 Dynamic Domain Track

TD(λ) and Q-Learning Based Ludo Players

Laboratorio di Intelligenza Artificiale e Robotica

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

AMULTIAGENT system [1] can be defined as a group of

A Reinforcement Learning Variant for Control Scheduling

Learning Prospective Robot Behavior

Regret-based Reward Elicitation for Markov Decision Processes

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Laboratorio di Intelligenza Artificiale e Robotica

Exploration. CS : Deep Reinforcement Learning Sergey Levine

A Comparison of Annealing Techniques for Academic Course Scheduling

Speeding Up Reinforcement Learning with Behavior Transfer

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

Adaptive Generation in Dialogue Systems Using Dynamic User Modeling

FF+FPG: Guiding a Policy-Gradient Planner

Lecture 6: Applications

Teachable Robots: Understanding Human Teaching Behavior to Build More Effective Robot Learners

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Seminar - Organic Computing

A Case-Based Approach To Imitation Learning in Robotic Agents

On the Combined Behavior of Autonomous Resource Management Agents

Predicting Future User Actions by Observing Unmodified Applications

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Case Acquisition Strategies for Case-Based Reasoning in Real-Time Strategy Games

Automatic Discretization of Actions and States in Monte-Carlo Tree Search

Python Machine Learning

Planning with External Events

Learning to Schedule Straight-Line Code

Robot Shaping: Developing Autonomous Agents through Learning*

Learning and Transferring Relational Instance-Based Policies

Lesson plan for Maze Game 1: Using vector representations to move through a maze Time for activity: homework for 20 minutes

MYCIN. The MYCIN Task

Lecture 1: Machine Learning Basics

An investigation of imitation learning algorithms for structured prediction

An empirical study of learning speed in backpropagation

Robot Learning Simultaneously a Task and How to Interpret Human Instructions

Evolutive Neural Net Fuzzy Filtering: Basic Description

Learning Methods for Fuzzy Systems

Knowledge Transfer in Deep Convolutional Neural Nets

Softprop: Softmax Neural Network Backpropagation Learning

Major Milestones, Team Activities, and Individual Deliverables

Surprise-Based Learning for Autonomous Systems

Task Completion Transfer Learning for Reward Inference

An OO Framework for building Intelligence and Learning properties in Software Agents

Team Formation for Generalized Tasks in Expertise Social Networks

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Transfer Learning Action Models by Measuring the Similarity of Different Domains

Lecture 2: Quantifiers and Approximation

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

An Empirical and Computational Test of Linguistic Relativity

SARDNET: A Self-Organizing Feature Map for Sequences

The SREB Leadership Initiative and its

EVOLVING POLICIES TO SOLVE THE RUBIK S CUBE: EXPERIMENTS WITH IDEAL AND APPROXIMATE PERFORMANCE FUNCTIONS

Diagnostic Test. Middle School Mathematics

EECS 571 PRINCIPLES OF REAL-TIME COMPUTING Fall 10. Instructor: Kang G. Shin, 4605 CSE, ;

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

arxiv: v1 [cs.cv] 10 May 2017

TABLE OF CONTENTS TABLE OF CONTENTS COVER PAGE HALAMAN PENGESAHAN PERNYATAAN NASKAH SOAL TUGAS AKHIR ACKNOWLEDGEMENT FOREWORD

Knowledge-Based - Systems

Discriminative Learning of Beam-Search Heuristics for Planning

While you are waiting... socrative.com, room number SIMLANG2016

Navigating the PhD Options in CMS

Mathematics Success Grade 7

Strategic Planning for Retaining Women in Undergraduate Computing

(Sub)Gradient Descent

Evaluating Statements About Probability

Learning Human Utility from Video Demonstrations for Deductive Planning in Robotics

How People Learn Physics

Title:A Flexible Simulation Platform to Quantify and Manage Emergency Department Crowding

AI Agent for Ice Hockey Atari 2600

Chapter 2. Intelligent Agents. Outline. Agents and environments. Rationality. PEAS (Performance measure, Environment, Actuators, Sensors)

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

The Strong Minimalist Thesis and Bounded Optimality

Stopping rules for sequential trials in high-dimensional data

Learning to Think Mathematically With the Rekenrek

The dilemma of Saussurean communication

SESSION 2: HELPING HAND

Study Group Handbook

A Grammar for Battle Management Language

A Reinforcement Learning Approach for Adaptive Single- and Multi-Document Summarization

A Pipelined Approach for Iterative Software Process Model

Evolution of Symbolisation in Chimpanzees and Neural Nets

*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN

A Neural Network GUI Tested on Text-To-Phoneme Mapping

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

Interaction Design Considerations for an Aircraft Carrier Deck Agent-based Simulation

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Probability and Statistics Curriculum Pacing Guide

Transcription:

Reinforcement Learning (Model-free RL) R&N Chapter 21 Demos and Data Contributions from Vivek Mehta (vivekm@cs.cmu.edu) Rohit Kelkar (ryk@cs.cmu.edu) 3 Reinforcement Learning 1 2 3 4 +1 Intended action a: T(s,a,s ) 0.8 2-1 0.1 0.1 1 Same (fully observable) MDP as before except: We don t know the model of the environment We don t know T(.,.,.) We don t know R(.) Task is still the same: Find an optimal policy 1

General Problem All we can do is try to execute actions and record the resulting rewards World: You are in state 102, you have a choice of 4 actions Robot: I ll take action 2 World: You get a reward of 1 and you are now in state 63, you have a choice of 3 actions Robot: I ll take action 3 World: You get a reward of -10 and you are now in state 12, you have a choice of 4 actions.. Notice we have state-observability! Classes of Techniques Reinforcement Learning Model-Based Try to learn an explicit model of T(.,.,.) and R(.) Model-Free Recover an optimal policy without ever estimating a model 2

Model-Free We are not interested in T(.,.,.), we are only interested in the resulting values and policies Can we compute something without an explicit model of T(.,.,.)? First, let s fix a policy and compute the resulting values Temporal Differencing Upon action a = π(s), the values satisfy: U(s) = R(s) + γ Σ s T(s,a,s ) U(s ) For any s successor of s, U(s) is in between : The new value considering only the actual s reached: R(s) + γ U(s ) and the old value U(s) The γ is the discount of future reward. 3

Temporal Differencing Upon moving from s to s by using action a, the new estimate of U(s) is approximated by: U(s) = (1-α) U(s) + α (R(s) + γ U(s )) Temporal Differencing: When moving from any state s to a state s, update: U(s) U(s) + α (R(s) + γ U(s ) U(s)) Temporal Differencing Current value Discrepancy between current value and new guess at a value after moving to s U(s) U(s) + α (R(s) + γ U(s ) U(s)) The transition probabilities do not appear anywhere!!! 4

Temporal Differencing Learning rate U(s) U(s) + α (R(s) + γ U(s ) U(s)) How to choose 0 < α < 1? Too small: Converges slowly; tends to always trust the current estimate of U Too large: Changes very quickly; tends to always replace the current estimate by the new guess Temporal Differencing How to choose 0 < α < 1? Start with large α Not confident in our current estimate so we can change it a lot Decrease α as we explore more We are more and more confident in our estimate so we don t want to change it a lot α Iterations 5

Summary Learning exploring environment and recording received rewards Model-Based techniques Evaluate transition probabilities and apply previous MDP techniques to find values and policies More efficient: Single value update at each state Selection of interesting states to update: Prioritized sweeping Exploration strategies Model-Free Techniques (so far) Temporal update to estimate values without ever estimating the transition model Parameter: Learning rate must decay over iterations Temporal Differencing Current value Discrepancy between current value and new guess at a value after moving to s U(s) U(s) + α (R(s) + γ U(s ) U(s)) The transition probabilities do not appear anywhere!!! But how to find the optimal policy? 6

Q-Learning U(s) = Utility of state s = expected sum of future discounted rewards Q(s,a) = Value of taking action a at state s = expected sum of future discounted rewards after taking action a at state s Q-Learning (s,a) = state-action pair. U(s) = Utility Maintain of state table s = of expected Q(s,a) sum instead of U(s) of future discounted rewards Q(s,a) = Value of taking action a at state s = expected sum of future discounted rewards after taking action a at state s 7

For the optimal Q*: Q-Learning Q*(s,a) = R(s) + γ Σ s T (s,a,s ) max a Q*(s,a ) π*(s) = argmax a Q*(s,a) Best expected value for state action (s,a) For the optimal Q*: Best value averaged over all possible states s that can be reached from s after executing action a Q-Learning Q*(s,a) = R(s) + γ Σ s T (s,a,s ) max a Q*(s,a ) Reward at state s π*(s) = argmax a Q*(s,a) Best value at the next state = Maximum over all actions that could be executed at the next state s 8

Q-Learning: Updating Q without a Model Use temporal differencing; after moving from state s to state s using action a: Q(s,a)Q(s,a)+α(R(s)+γ max a Q(s,a ) Q(s,a)) Q-Learning: Updating Q without a Model After moving from state s to state s using action a: Old estimate of Q(s,a) Difference between old estimate and new guess after taking action a Q(s,a)Q(s,a)+α(R(s)+γ max a Q(s,a ) Q(s,a)) New estimate of Q(s,a) Learning rate 0< α <1 9

Q-Learning: Estimating the policy Q-Update: After moving from state s to state s using action a: Q(s,a) Q(s,a) + α(r(s) + γ max a Q(s,a ) Q(s,a)) Policy estimation: π(s) = argmax a Q(s,a) Q-Learning: Estimating the policy Key Point: We do not use T(.,.,.) anywhere We Q-Update: can compute After optimal moving values from and statepolicies s to state without s using ever computing a model action of the a: MDP! Q(s,a) Q(s,a) + α(r(s) + γ max a Q(s,a ) Q(s,a)) Policy estimation: π(s) = argmax a Q(s,a) 10

Q-Learning: Convergence Q-learning guaranteed to converge to an optimal policy (Watkins) Very general procedure (because completely model-free) May be slow (because completely modelfree) 11

π*(s 1 ) = a 1 π*(s 2 ) = a 1 12

Q-Learning: Exploration Strategies How to choose the next action while we re learning? Random Greedy: Always choose the estimated best action π(s) ε-greedy: Choose the estimated best with probability 1-ε Boltzmann: Choose the estimated best with probability proportional to e Q(s,a)/T Evaluation How to measure how well the learning procedure is doing? U(s) = Value estimated at s at the current learning iteration U*(s) = Optimal value if we knew everything about the environment Error = U U* 13

Constant Learning Rate α = 0.001 α = 0.1 Decaying Learning Rate α = K/(K+iteration #) [Data from Rohit & Vivek, 2005] 14

Changing Environments [Data from Rohit & Vivek, 2005] Adaptive Learning Rate [Data from Rohit & Vivek, 2005] 15

Example: Pushing Robot Task: Learn how to push boxes around. States: Sensor readings Actions: Move forward, turn Example from Mahadevan and Connell, Automatic Programming of Behaviorbased Robots using Reinforcement Learning, Proceedings AAAI 1991 Example: Pushing Robot NEAR FAR BUMP STUCK State = 1 bit for each NEAR and FAR gates x 8 sensors + 1 bit for BUMP + 1 bit for STUCK = 18 bits Actions = move forward or turn +/- 22 o or turn +/- 45 o = 5 actions Example from Mahadevan and Connell, Automatic Programming of Behaviorbased Robots using Reinforcement Learning, Proceedings AAAI 1991 16

Learn How to Find the Boxes Box is found when the NEAR bits are on for all the front sonars. Reward: R(s) = +3 if NEAR bits are on R(s) = -1 if NEAR bits are off NEAR Learn How to Push the Box Try to maintain contact with the box while moving forward Reward: R(s) = +1 if BUMP while moving forward R(s) = -3 if robot loses contact BUMP 17

Learn how to Get Unwedged Robot may get wedged against walls, in which the STUCK bit is raised. Reward: R(s) = +1 if STUCK is 0 R(s) = -3 if STUCK is 1 STUCK Q-Learning Initialize Q(s,a) to 0 for all stateaction pairs Repeat: Observe the current state s 90% of the time, choose the action a that maximimizes Q(s,a) Else choose a random action a Update Q(s,a) 18

Q-Learning Initialize Q(s,a) to 0 for all state-action pairs Repeat: Observe the current state s 90% of the time, choose the action a that maximimizes Q(s,a) Else choose a random action a Update Q(s,a) Improvement: Update also all the states s that are similar to s. In this case: Similarity between s and s is measured by the Hamming distance between the bit strings Performance Hand-coded Q-Learning (2 different versions of similarity) Random agent 19

Generalization In real problems: Too many states (or state-action pairs) to store in a table Example: Backgammon 10 20 states! Need to: Store U for a subset of states {s 1,..,s K } Generalize to compute U(s) for any other states s Generalization Value U(s) Value U(s) f(s n ) ~ U(s n ) s 1 s 2.. We have sample values of U for some of the states s 1, s 2 States s States s We interpolate a function f(.), such that for any query state s n, f(s n ) approximates U(s n ) 20

Generalization Possible function approximators: Neural networks Memory-based methods and many others solutions to representing U over large state spaces: Decision trees Clustering Hierarchical representations State s Value U(s) Example: Backgammon States: Number of red and white checkers at each location Order 10 20 states!!!! Branching factor prevents direct search Actions: Set of legal moves from any state Example from: G. Tesauro. Temporal Difference Learning and TD-Gammon. Communications of the ACM, 1995 21

Example: Backgammon Represent mapping from states to expected outcomes by multilayer neural net Run a large number of training games For each state s in a training game: Update using temporal differencing At every step of the game Choose best move according to current estimate of U Initially: Random moves After learning: Converges to good selection of moves Performance Can learn starting with no knowledge at all! Example: 200,000 training games with 40 hidden units. Enhancements use better encoding and additional hand-designed features Example: 1,500,000 training games 80 hidden units -1 pt/40 games (against world-class opponent) 22

Example: Control and Robotics Devil-stick juggling (Schaal and Atkeson): Nonlinear control at 200ms per decision. Program learns to keep juggling after ~40 trials. A human requires 10 times more practice. Helicopter control (Andrew Ng): Control of a helicopter for specific flight patterns. Learning policies from simulator. Learns policies for control pattern that are difficult even for human experts (e.g., inverted flight). http://heli.stanford.edu/ Summary Certainty equivalent learning for estimating future rewards Exploration strategies One-backup update, prioritized sweeping Model free (Temporal Differencing = TD) for estimating future rewards Q-Learning for model-free estimation of future rewards and optimal policy Exploration strategies and selection of actions 23

(Some) References S. Sutton and A.G. Barto. Reinforcement Learning: An Introduction. MIT Press. L. Kaelbling, M. Littman and A. Moore. Reinforcement Learning: A Survey. Journal of Artificial Intelligence Research. Volume 4, 1996. G. Tesauro. TD-Gammon, a self-teaching backgammon program, achieves master-level play. Neural Computation 6(2), 1995. http://ai.stanford.edu/~ang/ http://www-all.cs.umass.edu/rlr/ 24