Reinforcement learning (Chapter 21)

Similar documents
Lecture 10: Reinforcement Learning

Reinforcement Learning by Comparing Immediate Reward

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

Exploration. CS : Deep Reinforcement Learning Sergey Levine

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Improving Action Selection in MDP s via Knowledge Transfer

Georgetown University at TREC 2017 Dynamic Domain Track

Laboratorio di Intelligenza Artificiale e Robotica

Axiom 2013 Team Description Paper

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

AMULTIAGENT system [1] can be defined as a group of

Regret-based Reward Elicitation for Markov Decision Processes

High-level Reinforcement Learning in Strategy Games

Artificial Neural Networks written examination

AI Agent for Ice Hockey Atari 2600

TD(λ) and Q-Learning Based Ludo Players

Laboratorio di Intelligenza Artificiale e Robotica

Lecture 1: Machine Learning Basics

Lecture 6: Applications

A Reinforcement Learning Variant for Control Scheduling

(Sub)Gradient Descent

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Adaptive Generation in Dialogue Systems Using Dynamic User Modeling

FF+FPG: Guiding a Policy-Gradient Planner

arxiv: v2 [cs.ro] 3 Mar 2017

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

CS Machine Learning

An Introduction to Simulation Optimization

Python Machine Learning

The Strong Minimalist Thesis and Bounded Optimality

Softprop: Softmax Neural Network Backpropagation Learning

B. How to write a research paper

arxiv: v1 [cs.lg] 8 Mar 2017

A Case Study: News Classification Based on Term Frequency

Session 2B From understanding perspectives to informing public policy the potential and challenges for Q findings to inform survey design

Seminar - Organic Computing

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Learning Prospective Robot Behavior

Using Deep Convolutional Neural Networks in Monte Carlo Tree Search

Truth Inference in Crowdsourcing: Is the Problem Solved?

Speeding Up Reinforcement Learning with Behavior Transfer

An investigation of imitation learning algorithms for structured prediction

Transferring End-to-End Visuomotor Control from Simulation to Real World for a Multi-Stage Task

An Effective Framework for Fast Expert Mining in Collaboration Networks: A Group-Oriented and Cost-Based Method

Knowledge Transfer in Deep Convolutional Neural Nets

Task Completion Transfer Learning for Reward Inference

Task Completion Transfer Learning for Reward Inference

On the Combined Behavior of Autonomous Resource Management Agents

Story Problems with. Missing Parts. s e s s i o n 1. 8 A. Story Problems with. More Story Problems with. Missing Parts

LEARNING TO PLAY IN A DAY: FASTER DEEP REIN-

Rule Learning With Negation: Issues Regarding Effectiveness

AC : DESIGNING AN UNDERGRADUATE ROBOTICS ENGINEERING CURRICULUM: UNIFIED ROBOTICS I AND II

Robot Learning Simultaneously a Task and How to Interpret Human Instructions

A Comparison of Annealing Techniques for Academic Course Scheduling

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

Teachable Robots: Understanding Human Teaching Behavior to Build More Effective Robot Learners

Managerial Decision Making

Julia Smith. Effective Classroom Approaches to.

Learning Human Utility from Video Demonstrations for Deductive Planning in Robotics

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Improving Fairness in Memory Scheduling

A Metacognitive Approach to Support Heuristic Solution of Mathematical Problems

Learning to Schedule Straight-Line Code

CSL465/603 - Machine Learning

Edexcel GCSE. Statistics 1389 Paper 1H. June Mark Scheme. Statistics Edexcel GCSE

Essentials of Ability Testing. Joni Lakin Assistant Professor Educational Foundations, Leadership, and Technology

Learning and Transferring Relational Instance-Based Policies

EVOLVING POLICIES TO SOLVE THE RUBIK S CUBE: EXPERIMENTS WITH IDEAL AND APPROXIMATE PERFORMANCE FUNCTIONS

SARDNET: A Self-Organizing Feature Map for Sequences

Cooperative evolutive concept learning: an empirical study

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Data Structures and Algorithms

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

BMBF Project ROBUKOM: Robust Communication Networks

Active Learning. Yingyu Liang Computer Sciences 760 Fall

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Rule Learning with Negation: Issues Regarding Effectiveness

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Short vs. Extended Answer Questions in Computer Science Exams

NUMBERS AND OPERATIONS

Using focal point learning to improve human machine tacit coordination

A Reinforcement Learning Approach for Adaptive Single- and Multi-Document Summarization

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Radius STEM Readiness TM

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Strategies for Solving Fraction Tasks and Their Link to Algebraic Thinking

Lecture 2: Quantifiers and Approximation

Probabilistic Latent Semantic Analysis

IAT 888: Metacreation Machines endowed with creative behavior. Philippe Pasquier Office 565 (floor 14)

AN EXAMPLE OF THE GOMORY CUTTING PLANE ALGORITHM. max z = 3x 1 + 4x 2. 3x 1 x x x x N 2

XXII BrainStorming Day

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

An OO Framework for building Intelligence and Learning properties in Software Agents

Using AMT & SNOMED CT-AU to support clinical research

CS 100: Principles of Computing

Individual Differences & Item Effects: How to test them, & how to test them well

Transcription:

Reinforcement learning (Chapter 21)

Reinforcement learning Regular MDP Given: Transition model P(s s, a) Reward function R(s) Find: Policy π(s) Reinforcement learning Transition model and reward function initially unknown Still need to find the right policy Learn by doing

Offline (MDPs) vs. Online (RL) Offline Solu+on Online Learning Source: Berkeley CS188

Reinforcement learning: In each time step: Basic scheme Take some action Observe the outcome of the action: successor state and reward Update some internal representation of the environment and policy If you reach a terminal state, just start over (each pass through the environment is called a trial) Why is this called reinforcement learning?

Applications of reinforcement Backgammon learning http://www.research.ibm.com/massive/tdl.html http://en.wikipedia.org/wiki/td-gammon

Applications of reinforcement AlphaGo learning https://deepmind.com/research/alphago/

Applications of reinforcement learning Learning a fast gait for Aibos Initial gait Learned gait Policy Gradient Reinforcement Learning for Fast Quadrupedal Locomotion Nate Kohl and Peter Stone. IEEE International Conference on Robotics and Automation, 2004.

Applications of reinforcement learning Stanford autonomous helicopter Pieter Abbeel et al.

Applications of reinforcement learning Playing Atari with deep reinforcement learning Video V. Mnih et al., Nature, February 2015

Applications of reinforcement learning End-to-end training of deep visuomotor policies Video Sergey Levine et al., Berkeley

Applications of reinforcement Object detection learning Video J. Caicedo and S. Lazebnik, Active Object Localization with Deep Reinforcement Learning, ICCV 2015

OpenAI Gym https://gym.openai.com/

Reinforcement learning Model-based strategies Learn the model of the MDP (transition probabilities and rewards) and try to solve the MDP concurrently Model-free Learn how to act without explicitly learning the transition probabilities P(s s, a) Q-learning: learn an action-utility function Q(s,a) that tells us the value of doing action a in state s

Model-based reinforcement learning Learning the model: Keep track of how many times state s follows state s when you take action a and update the transition probability P(s s, a) according to the relative frequencies Keep track of the rewards R(s) Learning how to act: Estimate the utilities U(s) using Bellman s equations Choose the action that maximizes expected future utility: * π ( s) = argmax a A( s) s' P( s' s, a) U( s') Is there any problem with this greedy approach?

Exploration vs. exploitation Source: Berkeley CS188

Exploration vs. exploitation Exploration: take a new action with unknown consequences Pros: Get a more accurate model of the environment Discover higher-reward states than the ones found so far Cons: When you re exploring, you re not maximizing your utility Something bad might happen Exploitation: go with the best strategy found so far Pros: Maximize reward as reflected in the current utility estimates Avoid bad stuff Cons: Might also prevent you from discovering the true optimal strategy

Exploration strategies Idea: explore more in the beginning, become more and more greedy over time ε-greedy: with probability 1 ε, follow the greedy policy, with probability ε, take random action Possibly decrease ε over time More complex exploration functions to bias towards less visited state-action pairs E.g., keep track of how many times each state-action pair has been seen, return over-optimistic utility estimate if a given pair has not been seen enough times

Model-free reinforcement learning Idea: learn how to act without explicitly learning the transition probabilities P(s s, a) Q-learning: learn an action-utility function Q(s,a) that tells us the value of doing action a in state s

Model-free reinforcement learning Idea: learn how to act without explicitly learning the transition probabilities P(s s, a) Q-learning: learn an action-utility function Q(s,a) that tells us the value of doing action a in state s Relationship between Q-values and utilities: With Q-values, you don t need the transition model to select the next action: Compare with: U( s) = max Q( s, a) * π ( s) = * π ( s) a arg max = a arg max Q( s, a) a s' P( s' s, a) U ( s')

Model-free reinforcement learning Q-learning: learn an action-utility function Q(s,a) that tells us the value of doing action a in state s Bellman equation for Q values: Compare to Bellman equation for utilities: ), ( max ) ( a s Q s U a = + = ' ' ) ' ', ( )max, ' ( ) ( ), ( s a a Q s a s s P s R a s Q γ + = ' ) ( ') ( ), ' ( max ) ( ) ( s s A a s U a s s P s R s U γ

Model-free reinforcement learning Q-learning: learn an action-utility function Q(s,a) that tells us the value of doing action a in state s U( s) = max Q( s, a) Bellman equation for Q values: Q( s, a) = R( s) + γ s' Problem: we don t know (and don t want to learn) P(s s, a) Solution: build up estimates of Q(s,a) over time by making small updates based on observed transitions a P( s' s, a)max a' Q( s', a' )

TD learning Motivation: the mean of a sequence x 1, x 2, can be computed incrementally: µ k = 1 k # k x i = 1 k 1 % i=1 k x + k $ i=1 = 1 ( k x k + (k 1)µ ) k 1 = µ k 1 + 1 ( k x k µ ) k 1 By analogy, temporal difference (TD) updates to Q(s,a) have the form Q(s, a) Q(s, a)+α ( Q target (s, a) Q(s, a) ) x i & ( ' Source: D. Silver

TD learning TD update: Q(s, a) Q(s, a)+α ( Q target (s, a) Q(s, a) ) Suppose we have observed the transition (s,a,s ) Q target (s, a) = R(s)+γ max a' Q(s', a') The target is the return if (s,a,s ) was the only possible transition: Q( s, a) = R( s) + γ s' P( s' s, a)max a' Q( s', a' )

TD learning TD update: Q(s, a) Q(s, a)+α ( Q target (s, a) Q(s, a) ) Suppose we have observed the transition (s,a,s ) Q target (s, a) = R(s)+γ max a' Q(s', a') Full update equation: Q(s, a) Q(s, a)+α R(s)+γ max a' Q(s', a') Q(s, a) ( ) Updating a guess towards a guess

TD algorithm outline At each time step t From current state s, select an action a given exploration policy Get the successor state s Perform the TD update: ( R( s) + γ max Q( s', a') Q( s, )) Q( s, a) Q( s, a) + α a' a Learning rate Should start at 1 and decay as O(1/t) e.g., α(t) = c/(c - 1 + t)

Exploration policies Standard ( greedy ) selection of optimal action: a = argmax a' Q(s, a') Epsilon-greedy: with probability ε, take random action Policy recommended by textbook: a = argmax a' A(s) ( ) f Q(s, a'), N(s, a') exploration function Number of times we ve taken action a in state s f ( u, n) = + R if n < Ne u otherwise (optimistic reward estimate)

SARSA In TD Q-learning, we re learning about the optimal policy while following the exploration policy ( R( s) + γ max Q( s', a') Q( s, )) Q( s, a) Q( s, a) + α a' a

SARSA In TD Q-learning, we re learning about the optimal policy while following the exploration policy Alternative (SARSA): also select action a according to exploration policy Q(s, a) Q(s, a)+α R(s)+γQ(s', a') Q(s, a) ( ) SARSA vs. Q-learning example

TD Q-learning demos Andrej Karpathy s demo Older Java-based demo

Function approximation So far, we ve assumed a lookup table representation for utility function U(s) or action-utility function Q(s,a) But what if the state space is really large or continuous? Alternative idea: approximate the utility function, e.g., as a weighted linear combination of features: U( s) = w1 f1( s) + w2 f2( s) + wn fn( s) RL algorithms can be modified to estimate these weights More generally, functions can be nonlinear (e.g., neural networks) Recall: features for designing evaluation functions in games Benefits: Can handle very large state spaces (games), continuous state spaces (robot control) Can generalize to previously unseen states

Other techniques Policy search: instead of getting the Q- values right, you simply need to get their ordering right Write down the policy as a function of some parameters and adjust the parameters to improve the expected reward Learning from imitation: instead of an explicit reward function, you have expert demonstrations of the task to learn from