CSC 411: Lecture 19: Reinforcement Learning

Similar documents
Lecture 10: Reinforcement Learning

Reinforcement Learning by Comparing Immediate Reward

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Georgetown University at TREC 2017 Dynamic Domain Track

High-level Reinforcement Learning in Strategy Games

Improving Action Selection in MDP s via Knowledge Transfer

Regret-based Reward Elicitation for Markov Decision Processes

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Speeding Up Reinforcement Learning with Behavior Transfer

AMULTIAGENT system [1] can be defined as a group of

Axiom 2013 Team Description Paper

Artificial Neural Networks written examination

Task Completion Transfer Learning for Reward Inference

Task Completion Transfer Learning for Reward Inference

TD(λ) and Q-Learning Based Ludo Players

Laboratorio di Intelligenza Artificiale e Robotica

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

Learning Prospective Robot Behavior

FF+FPG: Guiding a Policy-Gradient Planner

AI Agent for Ice Hockey Atari 2600

Laboratorio di Intelligenza Artificiale e Robotica

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

How long did... Who did... Where was... When did... How did... Which did...

Session 2B From understanding perspectives to informing public policy the potential and challenges for Q findings to inform survey design

An investigation of imitation learning algorithms for structured prediction

Teachable Robots: Understanding Human Teaching Behavior to Build More Effective Robot Learners

LEARNING TO PLAY IN A DAY: FASTER DEEP REIN-

A Reinforcement Learning Variant for Control Scheduling

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

BMBF Project ROBUKOM: Robust Communication Networks

Adaptive Generation in Dialogue Systems Using Dynamic User Modeling

Evolutive Neural Net Fuzzy Filtering: Basic Description

An OO Framework for building Intelligence and Learning properties in Software Agents

Introduction to Simulation

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

A Reinforcement Learning Approach for Adaptive Single- and Multi-Document Summarization

Transfer Learning Action Models by Measuring the Similarity of Different Domains

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

Agents and environments. Intelligent Agents. Reminders. Vacuum-cleaner world. Outline. A vacuum-cleaner agent. Chapter 2 Actuators

Learning and Transferring Relational Instance-Based Policies

Chapter 2. Intelligent Agents. Outline. Agents and environments. Rationality. PEAS (Performance measure, Environment, Actuators, Sensors)

IAT 888: Metacreation Machines endowed with creative behavior. Philippe Pasquier Office 565 (floor 14)

Learning Human Utility from Video Demonstrations for Deductive Planning in Robotics

Lecture 6: Applications

EVOLVING POLICIES TO SOLVE THE RUBIK S CUBE: EXPERIMENTS WITH IDEAL AND APPROXIMATE PERFORMANCE FUNCTIONS

Go fishing! Responsibility judgments when cooperation breaks down

Python Machine Learning

The Strong Minimalist Thesis and Bounded Optimality

Case Acquisition Strategies for Case-Based Reasoning in Real-Time Strategy Games

TABLE OF CONTENTS TABLE OF CONTENTS COVER PAGE HALAMAN PENGESAHAN PERNYATAAN NASKAH SOAL TUGAS AKHIR ACKNOWLEDGEMENT FOREWORD

An Introduction to Simulation Optimization

On the Combined Behavior of Autonomous Resource Management Agents

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Lecture 1: Machine Learning Basics

Lahore University of Management Sciences. FINN 321 Econometrics Fall Semester 2017

Planning with External Events

Mathematics subject curriculum

Robot Learning Simultaneously a Task and How to Interpret Human Instructions

Discriminative Learning of Beam-Search Heuristics for Planning

White Paper. The Art of Learning

Visual CP Representation of Knowledge

Improving Fairness in Memory Scheduling

arxiv: v1 [cs.lg] 8 Mar 2017

Intelligent Agents. Chapter 2. Chapter 2 1

Decision Analysis. Decision-Making Problem. Decision Analysis. Part 1 Decision Analysis and Decision Tables. Decision Analysis, Part 1

Human-like Natural Language Generation Using Monte Carlo Tree Search

Carter M. Mast. Participants: Peter Mackenzie-Helnwein, Pedro Arduino, and Greg Miller. 6 th MPM Workshop Albuquerque, New Mexico August 9-10, 2010

On-Line Data Analytics

ECE-492 SENIOR ADVANCED DESIGN PROJECT

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Truth Inference in Crowdsourcing: Is the Problem Solved?

Corrective Feedback and Persistent Learning for Information Extraction

Automatic Discretization of Actions and States in Monte-Carlo Tree Search

Using focal point learning to improve human machine tacit coordination

DIDACTIC MODEL BRIDGING A CONCEPT WITH PHENOMENA

A Bayesian Model of Imitation in Infants and Robots

PRINCE2 Foundation (2009 Edition)

Metadata of the chapter that will be visualized in SpringerLink

Dialog-based Language Learning

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

The Evolution of Random Phenomena

A Comparison of Annealing Techniques for Academic Course Scheduling

Ricochet Robots - A Case Study for Human Complex Problem Solving

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

P a g e 1. Grade 4. Grant funded by: MS Exemplar Unit English Language Arts Grade 4 Edition 1

CSL465/603 - Machine Learning

College Pricing and Income Inequality

Generative models and adversarial training

University of Cincinnati College of Medicine. DECISION ANALYSIS AND COST-EFFECTIVENESS BE-7068C: Spring 2016

(Sub)Gradient Descent

Rajesh P. N. Rao, Aaron P. Shon and Andrew N. Meltzoff

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

Conversation Starters: Using Spatial Context to Initiate Dialogue in First Person Perspective Games

Surprise-Based Learning for Autonomous Systems

A Case Study: News Classification Based on Term Frequency

Major Milestones, Team Activities, and Individual Deliverables

Communities in Networks. Peter J. Mucha, UNC Chapel Hill

CS177 Python Programming

Transcription:

CSC 411: Lecture 19: Reinforcement Learning Class based on Raquel Urtasun & Rich Zemel s lectures Sanja Fidler University of Toronto April 3, 2016 Urtasun, Zemel, Fidler (UofT) CSC 411: 19-Reinforcement Learning April 3, 2016 1 / 39

Today Learn to play games Reinforcement Learning [pic from: Peter Abbeel] Urtasun, Zemel, Fidler (UofT) CSC 411: 19-Reinforcement Learning April 3, 2016 2 / 39

Playing Games: Atari https://www.youtube.com/watch?v=v1eynij0rnk Urtasun, Zemel, Fidler (UofT) CSC 411: 19-Reinforcement Learning April 3, 2016 3 / 39

Playing Games: Super Mario https://www.youtube.com/watch?v=wfl4l_l4u9a Urtasun, Zemel, Fidler (UofT) CSC 411: 19-Reinforcement Learning April 3, 2016 4 / 39

Making Pancakes! https://www.youtube.com/watch?v=w_gxlksssie Urtasun, Zemel, Fidler (UofT) CSC 411: 19-Reinforcement Learning April 3, 2016 5 / 39

Reinforcement Learning Resources RL tutorial on course website Reinforcement Learning: An Introduction, Sutton & Barto Book (1998) Urtasun, Zemel, Fidler (UofT) CSC 411: 19-Reinforcement Learning April 3, 2016 6 / 39

What is Reinforcement Learning? [pic from: Peter Abbeel] Urtasun, Zemel, Fidler (UofT) CSC 411: 19-Reinforcement Learning April 3, 2016 7 / 39

Reinforcement Learning Learning algorithms differ in the information available to learner Supervised: correct outputs Unsupervised: no feedback, must construct measure of good output Reinforcement learning More realistic learning scenario: Continuous stream of input information, and actions Effects of action depend on state of the world Obtain reward that depends on world state and actions not correct response, just some feedback Urtasun, Zemel, Fidler (UofT) CSC 411: 19-Reinforcement Learning April 3, 2016 8 / 39

Reinforcement Learning [pic from: Peter Abbeel] Urtasun, Zemel, Fidler (UofT) CSC 411: 19-Reinforcement Learning April 3, 2016 9 / 39

Example: Tic Tac Toe, Notation Urtasun, Zemel, Fidler (UofT) CSC 411: 19-Reinforcement Learning April 3, 2016 10 / 39

Example: Tic Tac Toe, Notation Urtasun, Zemel, Fidler (UofT) CSC 411: 19-Reinforcement Learning April 3, 2016 11 / 39

Example: Tic Tac Toe, Notation Urtasun, Zemel, Fidler (UofT) CSC 411: 19-Reinforcement Learning April 3, 2016 12 / 39

Example: Tic Tac Toe, Notation Urtasun, Zemel, Fidler (UofT) CSC 411: 19-Reinforcement Learning April 3, 2016 13 / 39

Formulating Reinforcement Learning World described by a discrete, finite set of states and actions At every time step t, we are in a state s t, and we: Take an action at (possibly null action) Receive some reward rt+1 Move into a new state st+1 An RL agent may include one or more of these components: Policy π: agents behaviour function Value function: how good is each state and/or action Model: agent s representation of the environment Urtasun, Zemel, Fidler (UofT) CSC 411: 19-Reinforcement Learning April 3, 2016 14 / 39

Policy A policy is the agent s behaviour. It s a selection of which action to take, based on the current state Deterministic policy: a = π(s) Stochastic policy: π(a s) = P[a t = a s t = s] [Slide credit: D. Silver] Urtasun, Zemel, Fidler (UofT) CSC 411: 19-Reinforcement Learning April 3, 2016 15 / 39

Value Function Value function is a prediction of future reward Used to evaluate the goodness/badness of states Our aim will be to maximize the value function (the total reward we receive over time): find the policy with the highest expected reward By following a policy π, the value function is defined as: V π (s t ) = r t + γr t+1 + γ 2 r t+2 + γ is called a discount rate, and it is always 0 γ 1 If γ close to 1, rewards further in the future count more, and we say that the agent is farsighted γ is less than 1 because there is usually a time limit to the sequence of actions needed to solve a task (we prefer rewards sooner rather than later) [Slide credit: D. Silver] Urtasun, Zemel, Fidler (UofT) CSC 411: 19-Reinforcement Learning April 3, 2016 16 / 39

Model The model describes the environment by a distribution over rewards and state transitions: P(s t+1 = s, r t+1 = r s t = s, a t = a) We assume the Markov property: the future depends on the past only through the current state Urtasun, Zemel, Fidler (UofT) CSC 411: 19-Reinforcement Learning April 3, 2016 17 / 39

Maze Example Rewards: 1 per time-step Actions: N, E, S, W States: Agent s location [Slide credit: D. Silver] Urtasun, Zemel, Fidler (UofT) CSC 411: 19-Reinforcement Learning April 3, 2016 18 / 39

Maze Example Arrows represent policy π(s) for each state s [Slide credit: D. Silver] Urtasun, Zemel, Fidler (UofT) CSC 411: 19-Reinforcement Learning April 3, 2016 19 / 39

Maze Example Numbers represent value V π (s) of each state s [Slide credit: D. Silver] Urtasun, Zemel, Fidler (UofT) CSC 411: 19-Reinforcement Learning April 3, 2016 20 / 39

Example: Tic-Tac-Toe Consider the game tic-tac-toe: reward: win/lose/tie the game (+1/ 1/0) [only at final move in given game] state: positions of X s and O s on the board policy: mapping from states to actions based on rules of game: choice of one open position value function: prediction of reward in future, based on current state In tic-tac-toe, since state space is tractable, can use a table to represent value function Urtasun, Zemel, Fidler (UofT) CSC 411: 19-Reinforcement Learning April 3, 2016 21 / 39

RL & Tic-Tac-Toe Each board position (taking into account symmetry) has some probability Simple learning process: start with all values = 0.5 policy: choose move with highest probability of winning given current legal moves from current state update entries in table based on outcome of each game After many games value function will represent true probability of winning from each state Can try alternative policy: sometimes select moves randomly (exploration) Urtasun, Zemel, Fidler (UofT) CSC 411: 19-Reinforcement Learning April 3, 2016 22 / 39

Basic Problems Markov Decision Problem (MDP): tuple (S, A, P, γ) where P is P(s t+1 = s, r t+1 = r s t = s, a t = a) Standard MDP problems: 1. Planning: given complete Markov decision problem as input, compute policy with optimal expected return [Pic: P. Abbeel] Urtasun, Zemel, Fidler (UofT) CSC 411: 19-Reinforcement Learning April 3, 2016 23 / 39

Basic Problems Markov Decision Problem (MDP): tuple (S, A, P, γ) where P is Standard MDP problems: P(s t+1 = s, r t+1 = r s t = s, a t = a) 1. Planning: given complete Markov decision problem as input, compute policy with optimal expected return 2. Learning: We don t know which states are good or what the actions do. We must try out the actions and states to learn what to do [P. Abbeel] Urtasun, Zemel, Fidler (UofT) CSC 411: 19-Reinforcement Learning April 3, 2016 24 / 39

Example of Standard MDP Problem 1. Planning: given complete Markov decision problem as input, compute policy with optimal expected return 2. Learning: Only have access to experience in the MDP, learn a near-optimal strategy Urtasun, Zemel, Fidler (UofT) CSC 411: 19-Reinforcement Learning April 3, 2016 25 / 39

Example of Standard MDP Problem 1. Planning: given complete Markov decision problem as input, compute policy with optimal expected return 2. Learning: Only have access to experience in the MDP, learn a near-optimal strategy We will focus on learning, but discuss planning along the way Urtasun, Zemel, Fidler (UofT) CSC 411: 19-Reinforcement Learning April 3, 2016 26 / 39

Exploration vs. Exploitation If we knew how the world works (embodied in P), then the policy should be deterministic just select optimal action in each state Reinforcement learning is like trial-and-error learning The agent should discover a good policy from its experiences of the environment Without losing too much reward along the way Since we do not have complete knowledge of the world, taking what appears to be the optimal action may prevent us from finding better states/actions Interesting trade-off: immediate reward (exploitation) vs. gaining knowledge that might enable higher future reward (exploration) Urtasun, Zemel, Fidler (UofT) CSC 411: 19-Reinforcement Learning April 3, 2016 27 / 39

Examples Restaurant Selection Exploitation: Go to your favourite restaurant Exploration: Try a new restaurant Online Banner Advertisements Exploitation: Show the most successful advert Exploration: Show a different advert Oil Drilling Exploitation: Drill at the best known location Exploration: Drill at a new location Game Playing [Slide credit: D. Silver] Exploitation: Play the move you believe is best Exploration: Play an experimental move Urtasun, Zemel, Fidler (UofT) CSC 411: 19-Reinforcement Learning April 3, 2016 28 / 39

MDP Formulation Goal: find policy π that maximizes expected accumulated future rewards V π (s t ), obtained by following π from state s t : Game show example: V π (s t ) = r t + γr t+1 + γ 2 r t+2 + = γ i r t+i i=0 assume series of questions, increasingly difficult, but increasing payoff choice: accept accumulated earnings and quit; or continue and risk losing everything Notice that: V π (s t ) = r t + γv π (s t+1 ) Urtasun, Zemel, Fidler (UofT) CSC 411: 19-Reinforcement Learning April 3, 2016 29 / 39

What to Learn We might try to learn the function V (which we write as V ) V (s) = max [r(s, a) + γv (δ(s, a))] a Here δ(s, a) gives the next state, if we perform action a in current state s We could then do a lookahead search to choose best action from any state s: π (s) = arg max [r(s, a) + γv (δ(s, a))] a But there s a problem: This works well if we know δ() and r() But when we don t, we cannot choose actions this way Urtasun, Zemel, Fidler (UofT) CSC 411: 19-Reinforcement Learning April 3, 2016 30 / 39

Q Learning Define a new function very similar to V Q(s, a) = r(s, a) + γv (δ(s, a)) If we learn Q, we can choose the optimal action even without knowing δ! π (s) = arg max [r(s, a) + γv (δ(s, a))] a = arg max Q(s, a) a Q is then the evaluation function we will learn Urtasun, Zemel, Fidler (UofT) CSC 411: 19-Reinforcement Learning April 3, 2016 31 / 39

Urtasun, Zemel, Fidler (UofT) CSC 411: 19-Reinforcement Learning April 3, 2016 32 / 39

Training Rule to Learn Q Q and V are closely related: V (s) = max Q(s, a) a So we can write Q recursively: Q(s t, a t ) = r(s t, a t ) + γv (δ(s t, a t )) = r(s t, a t ) + γ max a Q(s t+1, a ) Let ˆQ denote the learner s current approximation to Q Consider training rule ˆQ(s, a) r(s, a) + γ max a ˆQ(s, a ) where s is state resulting from applying action a in state s Urtasun, Zemel, Fidler (UofT) CSC 411: 19-Reinforcement Learning April 3, 2016 33 / 39

Q Learning for Deterministic World For each s, a initialize table entry ˆQ(s, a) 0 Start in some initial state s Do forever: Select an action a and execute it Receive immediate reward r Observe the new state s Update the table entry for ˆQ(s, a) using Q learning rule: ˆQ(s, a) r(s, a) + γ max a ˆQ(s, a ) s s If we get to absorbing state, restart to initial state, and run thru Do forever loop until reach absorbing state Urtasun, Zemel, Fidler (UofT) CSC 411: 19-Reinforcement Learning April 3, 2016 34 / 39

Updating Estimated Q Assume the robot is in state s 1 ; some of its current estimates of Q are as shown; executes rightward move ˆQ(s 1, a right ) r + γ max a ˆQ(s2, a ) r + 0.9 max{63, 81, 100} 90 a Important observation: at each time step (making an action a in state s only one entry of ˆQ will change (the entry ˆQ(s, a)) Notice that if rewards are non-negative, then ˆQ values only increase from 0, approach true Q Urtasun, Zemel, Fidler (UofT) CSC 411: 19-Reinforcement Learning April 3, 2016 35 / 39

Q Learning: Summary Training set consists of series of intervals (episodes): sequence of (state, action, reward) triples, end at absorbing state Each executed action a results in transition from state s i to s j ; algorithm updates ˆQ(s i, a) using the learning rule Intuition for simple grid world, reward only upon entering goal state Q estimates improve from goal state back 1. All ˆQ(s, a) start at 0 2. First episode only update ˆQ(s, a) for transition leading to goal state 3. Next episode if go thru this next-to-last transition, will update ˆQ(s, a) another step back 4. Eventually propagate information from transitions with non-zero reward throughout state-action space Urtasun, Zemel, Fidler (UofT) CSC 411: 19-Reinforcement Learning April 3, 2016 36 / 39

Q Learning: Exploration/Exploitation Have not specified how actions chosen (during learning) Can choose actions to maximize ˆQ(s, a) Good idea? Can instead employ stochastic action selection (policy): P(a i s) = exp(k ˆQ(s, a i )) j exp(k ˆQ(s, a j )) Can vary k during learning more exploration early on, shift towards exploitation Urtasun, Zemel, Fidler (UofT) CSC 411: 19-Reinforcement Learning April 3, 2016 37 / 39

Non-deterministic Case What if reward and next state are non-deterministic? We redefine V, Q based on probabilistic estimates, expected values of them: and V π (s) = E π [r t + γr t+1 + γ 2 r t+2 + ] = E π [ γ i r t+i ] i=0 Q(s, a) = E[r(s, a) + γv (δ(s, a))] = E[r(s, a) + γ s p(s s, a) max a Q(s, a )] Urtasun, Zemel, Fidler (UofT) CSC 411: 19-Reinforcement Learning April 3, 2016 38 / 39

Non-deterministic Case: Learning Q Training rule does not converge (can keep changing ˆQ even if initialized to true Q values) So modify training rule to change more slowly ˆQ(s, a) (1 α n ) ˆQ n 1 (s, a) + α n [r + γ max a ˆQn 1 (s, a )] where s is the state land in after s, and a indexes the actions that can be taken in state s 1 α n = 1 + visits n (s, a) where visits is the number of times action a is taken in state s Urtasun, Zemel, Fidler (UofT) CSC 411: 19-Reinforcement Learning April 3, 2016 39 / 39