Reinforcement Learning: An Introduction. Deep Learning Indaba September 2017 Vukosi Marivate and Benjamin Rosman

Similar documents
Lecture 10: Reinforcement Learning

Reinforcement Learning by Comparing Immediate Reward

Exploration. CS : Deep Reinforcement Learning Sergey Levine

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Lecture 1: Machine Learning Basics

Georgetown University at TREC 2017 Dynamic Domain Track

Python Machine Learning

Laboratorio di Intelligenza Artificiale e Robotica

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

AMULTIAGENT system [1] can be defined as a group of

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

Artificial Neural Networks written examination

Axiom 2013 Team Description Paper

High-level Reinforcement Learning in Strategy Games

Speeding Up Reinforcement Learning with Behavior Transfer

Task Completion Transfer Learning for Reward Inference

Laboratorio di Intelligenza Artificiale e Robotica

TD(λ) and Q-Learning Based Ludo Players

Introduction to Simulation

Intelligent Agents. Chapter 2. Chapter 2 1

Regret-based Reward Elicitation for Markov Decision Processes

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

Task Completion Transfer Learning for Reward Inference

FF+FPG: Guiding a Policy-Gradient Planner

Session 2B From understanding perspectives to informing public policy the potential and challenges for Q findings to inform survey design

Chapter 2. Intelligent Agents. Outline. Agents and environments. Rationality. PEAS (Performance measure, Environment, Actuators, Sensors)

Seminar - Organic Computing

Improving Action Selection in MDP s via Knowledge Transfer

Agents and environments. Intelligent Agents. Reminders. Vacuum-cleaner world. Outline. A vacuum-cleaner agent. Chapter 2 Actuators

A Reinforcement Learning Variant for Control Scheduling

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Learning Prospective Robot Behavior

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

IAT 888: Metacreation Machines endowed with creative behavior. Philippe Pasquier Office 565 (floor 14)

Transfer Learning Action Models by Measuring the Similarity of Different Domains

Learning Human Utility from Video Demonstrations for Deductive Planning in Robotics

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

University of Cincinnati College of Medicine. DECISION ANALYSIS AND COST-EFFECTIVENESS BE-7068C: Spring 2016

Generative models and adversarial training

Robot Learning Simultaneously a Task and How to Interpret Human Instructions

Learning and Transferring Relational Instance-Based Policies

Rule Learning With Negation: Issues Regarding Effectiveness

A Comparison of Annealing Techniques for Academic Course Scheduling

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

Mathematics subject curriculum

Teachable Robots: Understanding Human Teaching Behavior to Build More Effective Robot Learners

Model Ensemble for Click Prediction in Bing Search Ads

An OO Framework for building Intelligence and Learning properties in Software Agents

EVOLVING POLICIES TO SOLVE THE RUBIK S CUBE: EXPERIMENTS WITH IDEAL AND APPROXIMATE PERFORMANCE FUNCTIONS

Lecture 6: Applications

Probabilistic Latent Semantic Analysis

Truth Inference in Crowdsourcing: Is the Problem Solved?

CS Machine Learning

Stopping rules for sequential trials in high-dimensional data

Go fishing! Responsibility judgments when cooperation breaks down

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Softprop: Softmax Neural Network Backpropagation Learning

Lecture 1: Basic Concepts of Machine Learning

AI Agent for Ice Hockey Atari 2600

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

Designing a Rubric to Assess the Modelling Phase of Student Design Projects in Upper Year Engineering Courses

Medical Complexity: A Pragmatic Theory

Radius STEM Readiness TM

*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN

Rule Learning with Negation: Issues Regarding Effectiveness

CSL465/603 - Machine Learning

Software Maintenance

Planning with External Events

Learning Cases to Resolve Conflicts and Improve Group Behavior

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Learning to Schedule Straight-Line Code

Managerial Decision Making

arxiv: v2 [cs.ro] 3 Mar 2017

Evolutive Neural Net Fuzzy Filtering: Basic Description

On the Combined Behavior of Autonomous Resource Management Agents

Using focal point learning to improve human machine tacit coordination

Redirected Inbound Call Sampling An Example of Fit for Purpose Non-probability Sample Design

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

(Sub)Gradient Descent

Conceptual and Procedural Knowledge of a Mathematics Problem: Their Measurement and Their Causal Interrelations

Historical maintenance relevant information roadmap for a self-learning maintenance prediction procedural approach

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

AUTOMATED TROUBLESHOOTING OF MOBILE NETWORKS USING BAYESIAN NETWORKS

ACTL5103 Stochastic Modelling For Actuaries. Course Outline Semester 2, 2014

Automatic Discretization of Actions and States in Monte-Carlo Tree Search

Transferring End-to-End Visuomotor Control from Simulation to Real World for a Multi-Stage Task

BMBF Project ROBUKOM: Robust Communication Networks

Calibration of Confidence Measures in Speech Recognition

Doctor of Public Health (DrPH) Degree Program Curriculum for the 60 Hour DrPH Behavioral Science and Health Education

Why Did My Detector Do That?!

Time series prediction

Purdue Data Summit Communication of Big Data Analytics. New SAT Predictive Validity Case Study

Improving Fairness in Memory Scheduling

Attributed Social Network Embedding

An Introduction to Simulation Optimization

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

12- A whirlwind tour of statistics

Transcription:

Reinforcement Learning: An Introduction Deep Learning Indaba September 2017 Vukosi Marivate and Benjamin Rosman 1

Contents Contents 2 1. What is reinforcement learning? 2. Value-based methods 3. Model-based methods and policy search 4. Inverse reinforcement learning and applications

What is reinforcement learning? We ve seen how to solve many cool problems around supervised and unsupervised learning But a major component of intelligence is decision making 3

What is reinforcement learning? Reinforcement learning is the branch of machine learning relating to learning in sequential decision making settings Behaviour learning 4

From supervised to reinforcement Supervised learning, single decision point Multiple decision points How do I know if I m doing the right thing? How do my decisions now impact the future? Actions affect the environment! 5

Interacting with an environment Decision maker (agent) exists within an environment 6

Interacting with an environment Decision maker (agent) exists within an environment Agent takes actions based on the environment state 7

Interacting with an environment Decision maker (agent) exists within an environment Agent takes actions based on the environment state 8 Environment state updates Agent receives feedback as rewards

A model for decision making Markov Decision Process (MDP) M = S, A, T, R, γ 9

A model for decision making Markov Decision Process (MDP) M = S, A, T, R, γ 10 States: encode world configurations Actions: choices made by agent

A model for decision making Markov Decision Process (MDP) M = S, A, T, R, γ Transition function: how the world evolves under actions 11

A model for decision making Markov Decision Process (MDP) M = S, A, T, R, γ Rewards: feedback signal to agent 12

A model for decision making Markov Decision Process (MDP) M = S, A, T, R, γ γ [0,1] discounting for future rewards 13

A model for decision making Markov Decision Process (MDP) M = S, A, T, R, γ Markov: Future is independent of the past, given the present 14

An example Cleaning Robot Actions: Reward: +1 for finding dirt -1 for falling into hole -0.001 for every move 15

An example States: Position on grid e.g. S is (1,1), goal (4,3) 1 Actions: 0 Reward: +1 for finding dirt -1 for falling into hole -0.001 for every move 16 0

What is the optimal policy? 0.8 0.1 17 0.1

What is the optimal policy? Change the action transitions? 0.1 0.45 18 0.45

What is the optimal policy? Change the action transitions? 0.1 0.45 19 0.45

Practically, why RL? 20 Treating disease in an individual Chronic disease (HIV, Cancer, Schizophrenia, etc.) Not a single decision event Information about: patient (demographics, family history) body (test results, etc.) disease (genomics, progression etc.) How do we find the best treatment strategy?

Evaluating behaviours Many different trajectories are possible through a space 42 Use the total discounted accumulated rewards to evaluate them 21-18 37.6

Rewards Scalar feedback signal Encode (un)desirable features of behaviours: Winning/losing, collisions, taking expensive actions,... Sparse Delayed Only have relative value 22

The Rats of Hanoi 23

Policies A policy (or behaviour or strategy) states to actions Deterministic or stochastic is any mapping from Optimal policy * Accumulates maximal rewards over a trajectory This is what we want to learn! 24

Immediate vs delayed rewards Cannot just rely on the instantaneous reward function Tradeoff: don t just act myopically (short term) 1 step 5 steps Notion of value to codify the goodness of a state, considering a policy running into the future Represented as a value function 25

Value Functions Value function: accumulated reward The expected return (R) starting at state s and then executing policy How good is s under? 26

Example Value Functions Reward -1 for every move 27

Example Value Functions Random policy: 28

Example Value Functions Optimal policy: 29

So what? How do we use these ideas to do something useful? 30

Value Functions: Recursion V(s) expected return starting at s and following Suggests dependence on V(s ) from next state s Bellman Equation: value of s 31 immediate reward for all possible next states the probability of reaching that state with value of s

Value Functions: Optimality Similarly, for an optimal policy * with optimal value function V*: Bellman Optimality Equation: take the best possible action 32

Value Functions Action-value function: transition probability The expected return (R) starting at state s and executing action a, and then following policy How good is a in s under? 33

Optimal policies and value functions *(a s) := 1 if a = argmax Q*(s,a), 0 otherwise Move in direction of greatest value Finding Q* (or V*) is equivalent to finding * Every MDP has an optimal policy 34

The goal of RL Given this formulation, how do we learn a policy? 35

Solving Bellman Given the Bellman equation Solve this as a large system of value function equations But: non-linear (max operator) So: solve iteratively What are we trying to do here? 36 Learn how good each state of the world is, when looking perfectly into the future

Dynamic Programming Value Iteration: Dynamic Programming Iteratively update V (synchronous version) At each iteration i: For all states s in S: Update V(s) But: this requires the full MDP!! In general, T and R are unknown 37 (T,R,S,A)

Value Based Methods 38

Algorithm setup Value Based Methods: No Transition Model No Reward Model Access to environment for experiment or access to training data (s,a,r,s ) Goal: Learn Value of States, State-Actions Policy through learned values 39

Data generation T and R unknown! -5 Instead, generate samples of training data (s,a,r,s ) from environment 40 0

Learning from Experience We need A method to choose actions Some model to keep track of and learn Value Function 41

The Bandit Problem Consider a row of one-arm bandit machines in a casino Set of arms (actions) that each generate rewards from different distributions Exploration vs exploitation 42

Action selection The exploration-exploitation tradeoff! Maximizing expected returns means balancing between: Exploiting gained knowledge (greedy) Take the best known action Exploring new actions/states (random) Try something new 43

Action selection strategies ε-greedy (0 < ε 1): With probability 1- ε exploit Choose the best action for a state With probability ε explore Randomly choose action ε usually higher at beginning of learning, decay later Softmax Sample action given softmax 44

Learning from Experience We need A method to choose actions Some model to keep track of and learn Value Function 45

TD Learning Temporal Difference (TD) Learning: Initialise V for all s in S For each experience tuple (s,r,s ) under policy : Update V: estimated return (TD target) TD error 46 (T,R,S,A)

Eligibility traces - Keep track of where agent has been - More efficient updates 47

TD(0) TD(0) Learning: Initialise V for all s For each trajectory/episode: for all s e(s) = 0 for each experience tuple (s,r,s ) under policy e(s) = e(s) + 1 for all s in S e(s) = 0 We are back to normal TD Learning. 48 (T,R,S,A) in episode:

TD rollouts (T,R,S,A) 49

TD(1) TD(1) Learning: Initialise V for all s For each trajectory/episode: for all s e(s) = 0 for each experience tuple (s,r,s ) under policy in episode: e(s) = e(s) + 1 Mark whole trajectory for all s in S e(s) = γe(s) Decay trace 50 (T,R,S,A)

Tuning the decay TD(0) TD(1) No traces Traces decay with γ TD( ) Control the decay rate 51

TD( ) TD( ) Learning: Initialise V for all s For each trajectory/episode: for all s e(s) = 0 for each experience tuple (s,r,s ) under policy e(s) = e(s) + 1 for all s in S e(s) = γ e(s) Control the speed of decay 52 (T,R,S,A) in episode:

Intermission 15 minutes 53

Onwards from TD Recap: we can now learn by estimating V from experience But: Not using actions A We would rather learn Q, for easier policy extraction! V requires a one-step lookahead model 54

SARSA Learn from s, a, r, s, a Initialise Q for all s, a For each episode Initialise Choose in from Q act For each step t in episode look ahead Take, observe Choose in from Q 55 (T,R,S,A) learn

SARSA Where did we get the? Taking the next action under Q This is an on policy algorithm What about off policy? Learn about optimal policy while exploring Reuse experience from other policies Learn from observations 56 (T,R,S,A)

Q-Learning Initialise Q for all s, a For each episode Initialise For each step t in episode Choose in from Q Take, observe 57 (T,R,S,A) act learn take best next action (so far)

Q-Learning demo (T,R,S,A) Shreyas Skandan: https://www.youtube.com/watch?v=rtu7g0y4os4 58

Typical Learning Curves 59

Generalising... What about extending behaviour to different tasks? What about building a simulator? Ask questions about the domain Solution: we need a model!!! 60

Model Based Methods 61

From Values to Environment Models Model based reinforcement learning Learn a model (T and R) from experience Supervised learning problem Models let you predict next state and reward Reason about uncertainty 62

Algorithm setup (T,R,S,A) Model Based RL: No Transition Model No Reward Model Access to environment for experiment or access to training data (s,a,r,s ) Goal: Learn Transition and Reward Models Policy through learned models. 63

Model Based RL (T,R,S,A) Learn a Transition and Reward Model On receiving experience 64 :

Dyna Q Algorithm For each step t in episode Choose in from Q Take, observe Update Q: Given Update T and R 65 Q-learning model update Repeat n times: Sample previously observed s Sample previously taken a (in s) Get r and s from model Update Q: sample model to update Q

What else can I do with a model? Quantify uncertainty in value functions Uncertainty from: Data sparsity Inherent stochasticity Latent structure Approaches: Monte Carlo sampling Simulation 66

A little bit of overkill? Ok, so we ve gone to all this trouble to learn T, R Q Can t we just learn the policy? 67

Policy Search 68

Algorithm setup (T,R,S,A) Direct Policy Learning: No Transition Model No Reward Model Access to environment for experiment or access to training data (s,a,r,s ) Goal: Learn policy directly 69

Policy Gradient Parametrise policy: Choices: Linear combination of basis functions Set of state features Deep neural network Goal: find best Optimisation problem! 70

Optimising the policy Define cost function J( ): Start value, average reward per time step Find that maximises J( ) e.g. gradient ascent on: policy gradient 71

Why policy gradient? + High-dimensional action spaces + Continuous action spaces + Many recent successes in robotics - Local convergence - Policy evaluation high variance 72

Recap - RL Approaches Policy Search Value Function Based Model Based s sa sa Q T, R a s r a 73

Inverse Reinforcement Learning 74

Inferring a Reward Function Designing reward functions is hard! Often not clear what should be done or how it should be rewarded Where do these come from? Learn the incentives that explain observed behaviour From an expert 75 We do not observe the reward, but want to learn it

Inverse Reinforcement Learning Environment Reward 76 RL Policy/ Behaviour

Inverse Reinforcement Learning Environment Reward 77 IRL Policy/ Behaviour

Algorithm setup (T,R,S,A) Inverse RL: Transition Model (Can be learned) No Reward Model Observe training data (s,a,s ) Goal: Learn a reward model to explain the behaviour observed through the training data 78

IRL: From paths to rewards Observe trajectory/trajectories (s,a,s ) Would like to know: What was the goal of the agent? What was the reward? Get to G and avoid water? 79

Maximum Likelihood IRL Possible reward function 80 ML IRL Algorithm (Intuition): Given sample trajectories D Initialise a reward function R Calculate policy from R, T Calculate P(D ) Calculate gradient, update R

IRL: From paths to rewards What about different teachers? Information not in the data when we get it. MLIRL with multiple intentions!!! 81 M Babes et. al. Apprenticeship learning about multiple intentions

IRL Learn from demonstration Crowdsourcing Showing tasks to robots Learning from experts 82

(Some) Reinforcement Learning Applications 83

Application Areas Randomised Controlled Trials Efficacy in Sequential Multiple Assignment Randomized Trial 84 An Introduction to Dynamic Treatment Regimes: Marie Davidian

Application Areas Advertising :( Nuff Said!!! 85

Application Areas Strategies to Improve Donations or Collecting Taxes :) 86 Tax Collections Optimization for New York State - Gerard Miller et. al.

Application Areas Mobile Health Interventions 87 Experimental Design & Machine Learning Opportunities in Mobile Health: Susan Murphy

HIV Treatment: Possible Formulation Features: baseline viral load, CD4 count, baseline CD4 percentage, Age, # previous treatments. States: Viral Load tracked monthly over 24 months. Patient s treatment stage bins for the viral load, in copies/ml, were [0.0,50,100,1K,100K]. Actions: Therapy/drug cocktail groups occurring in the data set. Reward: Negated AUC 88 V Marivate: Improved empirical methods in reinforcement-learning evaluation

Application Areas Robotics: learning behaviours 89

RL Application Areas Games Standardised testbeds Long decision horizons 90

Application Areas Automated Trading 1: 91 2:??? 3:

Thank you + Resources 2nd Edition Draft Recommended. Draft available online http://incompleteideas.net/ sutton/book/the-book-2nd. html 92 RL class: https://www.udacity.com/course/reinforcement-l earning--ud600 Vukosi Marivate and Benjamin Rosman vmarivate@csir.co.za, brosman@csir.co.za