A Brief Introduction to Reinforcement Learning. Jingwei Zhang

Similar documents
Lecture 10: Reinforcement Learning

Reinforcement Learning by Comparing Immediate Reward

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Georgetown University at TREC 2017 Dynamic Domain Track

TD(λ) and Q-Learning Based Ludo Players

Laboratorio di Intelligenza Artificiale e Robotica

AI Agent for Ice Hockey Atari 2600

High-level Reinforcement Learning in Strategy Games

AMULTIAGENT system [1] can be defined as a group of

Laboratorio di Intelligenza Artificiale e Robotica

Improving Action Selection in MDP s via Knowledge Transfer

Speeding Up Reinforcement Learning with Behavior Transfer

Axiom 2013 Team Description Paper

FF+FPG: Guiding a Policy-Gradient Planner

Generative models and adversarial training

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Regret-based Reward Elicitation for Markov Decision Processes

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

LEARNING TO PLAY IN A DAY: FASTER DEEP REIN-

Artificial Neural Networks written examination

A Reinforcement Learning Variant for Control Scheduling

Lecture 1: Machine Learning Basics

Introduction to Simulation

Learning and Transferring Relational Instance-Based Policies

Task Completion Transfer Learning for Reward Inference

Transfer Learning Action Models by Measuring the Similarity of Different Domains

IAT 888: Metacreation Machines endowed with creative behavior. Philippe Pasquier Office 565 (floor 14)

Lecture 6: Applications

Case Acquisition Strategies for Case-Based Reasoning in Real-Time Strategy Games

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

An investigation of imitation learning algorithms for structured prediction

Learning Prospective Robot Behavior

Task Completion Transfer Learning for Reward Inference

Intelligent Agents. Chapter 2. Chapter 2 1

Learning to Schedule Straight-Line Code

Comparison Between Three Memory Tests: Cued Recall, Priming and Saving Closed-Head Injured Patients and Controls

Active Learning. Yingyu Liang Computer Sciences 760 Fall

The Evolution of Random Phenomena

Self Study Report Computer Science

On the Combined Behavior of Autonomous Resource Management Agents

A Reinforcement Learning Approach for Adaptive Single- and Multi-Document Summarization

Python Machine Learning

Adaptive Generation in Dialogue Systems Using Dynamic User Modeling

Robot Learning Simultaneously a Task and How to Interpret Human Instructions

Designing a Rubric to Assess the Modelling Phase of Student Design Projects in Upper Year Engineering Courses

EVOLVING POLICIES TO SOLVE THE RUBIK S CUBE: EXPERIMENTS WITH IDEAL AND APPROXIMATE PERFORMANCE FUNCTIONS

SARDNET: A Self-Organizing Feature Map for Sequences

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Learning Human Utility from Video Demonstrations for Deductive Planning in Robotics

Guru: A Computer Tutor that Models Expert Human Tutors

Learning Methods for Fuzzy Systems

Surprise-Based Learning for Autonomous Systems

College Pricing. Ben Johnson. April 30, Abstract. Colleges in the United States price discriminate based on student characteristics

Houghton Mifflin Reading Correlation to the Common Core Standards for English Language Arts (Grade1)

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Clock Hour Workshop. June 28, Clock Hours

Rule Learning With Negation: Issues Regarding Effectiveness

Discriminative Learning of Beam-Search Heuristics for Planning

TOKEN-BASED APPROACH FOR SCALABLE TEAM COORDINATION. by Yang Xu PhD of Information Sciences

Coaching Others for Top Performance 16 Hour Workshop

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

Story Problems with. Missing Parts. s e s s i o n 1. 8 A. Story Problems with. More Story Problems with. Missing Parts

First Grade Curriculum Highlights: In alignment with the Common Core Standards

Session 2B From understanding perspectives to informing public policy the potential and challenges for Q findings to inform survey design

AN EXAMPLE OF THE GOMORY CUTTING PLANE ALGORITHM. max z = 3x 1 + 4x 2. 3x 1 x x x x N 2

Taught Throughout the Year Foundational Skills Reading Writing Language RF.1.2 Demonstrate understanding of spoken words,

Natural Language Processing. George Konidaris

Office of Semester Conversion Cal Poly Pomona

First Grade Standards

Automatic Discretization of Actions and States in Monte-Carlo Tree Search

A Case-Based Approach To Imitation Learning in Robotic Agents

Improving Fairness in Memory Scheduling

Characteristics of Functions

An Effective Framework for Fast Expert Mining in Collaboration Networks: A Group-Oriented and Cost-Based Method

Chapter 2. Intelligent Agents. Outline. Agents and environments. Rationality. PEAS (Performance measure, Environment, Actuators, Sensors)

Emergency Management Games and Test Case Utility:

Teachable Robots: Understanding Human Teaching Behavior to Build More Effective Robot Learners

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN

Student Services Job Family FY18 General

WHAT DOES IT REALLY MEAN TO PAY ATTENTION?

Disambiguation of Thai Personal Name from Online News Articles

Software Maintenance

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Chapter 2 Rule Learning in a Nutshell

Acquiring Competence from Performance Data

An Investigation into Team-Based Planning

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

Evolutive Neural Net Fuzzy Filtering: Basic Description

Building Community Online

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Short vs. Extended Answer Questions in Computer Science Exams

University of Victoria School of Exercise Science, Physical and Health Education EPHE 245 MOTOR LEARNING. Calendar Description Units: 1.

LEAD AGENCY MEMORANDUM OF UNDERSTANDING

CS 1103 Computer Science I Honors. Fall Instructor Muller. Syllabus

KLI: Infer KCs from repeated assessment events. Do you know what you know? Ken Koedinger HCI & Psychology CMU Director of LearnLab

Transcription:

A Brief Introduction to Reinforcement Learning Jingwei Zhang zhang@informatik.uni-freiburg.de 1

Outline Characteristics of Reinforcement Learning (RL) Components of RL (MDP, value, policy, Bellman) Planning (policy iteration, value iteration) Model-free Prediction (MC, TD) Model-free Control (Q-Learning) Deep Reinforcement Learning (DQN) 2

Outline Characteristics of Reinforcement Learning (RL) The RL Problem (MDP, value, policy, Bellman) Planning (policy iteration, value iteration) Model-free Prediction (MC, TD) Model-free Control (Q-Learning) Deep Reinforcement Learning (DQN) 4

Characteristics of RL SL VS RL Supervised Learning i.i.d data direct and strong supervision (label: what is the right thing to do) instantaneous feedback Reinforcement Learning sequential data, non-i.i.d no supervisor, only a reward signal (rule: what you did is good or bad) delayed feedback 5

Characteristics of RL SL VS RL Supervised Learning i.i.d data direct and strong supervision (label: what is the right thing to do) instantaneous feedback Reinforcement Learning sequential data, non-i.i.d no supervisor, only a reward signal (rule: what you did is good or bad) delayed feedback 6

Characteristics of RL SL VS RL Supervised Learning i.i.d data direct and strong supervision (label: what is the right thing to do) instantaneous feedback Reinforcement Learning sequential data, non-i.i.d no supervisor, only a reward signal (rule: what you did is good or bad) delayed feedback 7

Characteristics of RL SL VS RL Supervised Learning i.i.d data direct and strong supervision (label: what is the right thing to do) instantaneous feedback Reinforcement Learning sequential data, non-i.i.d no supervisor, only a reward signal (rule: what you did is good or bad) delayed feedback 8

Outline Characteristics of Reinforcement Learning (RL) Components of RL (MDP, value, policy, Bellman) Planning (policy iteration, value iteration) Model-free Prediction (MC, TD) Model-free Control (Q-Learning) Deep Reinforcement Learning (DQN) 10

Components of RL MDP A general framework for sequential decision making A MDP is a tuple: hs, A, P, R, i S :states A :actions P :transition probability, Pss a = P [S 0 t+1 = s 0 S t = s, A t = a] R :reward function, R a s = E [R t+1 S t = s, A t = a] :discount factor, 2 [0, 1] Markov property: The future is independent of the past given the present 11

Components of RL MDP A general framework for sequential decision making A MDP is a tuple: hs, A, P, R, i S :states A :actions P :transition probability, Pss a = P [S 0 t+1 = s 0 S t = s, A t = a] R :reward function, R a s = E [R t+1 S t = s, A t = a] :discount factor, 2 [0, 1] Markov property: The future is independent of the past given the present 12

Components of RL MDP A general framework for sequential decision making A MDP is a tuple: hs, A, P, R, i S :states A :actions P :transition probability, Pss a = P [S 0 t+1 = s 0 S t = s, A t = a] R :reward function, R a s = E [R t+1 S t = s, A t = a] :discount factor, 2 [0, 1] Markov property: The future is independent of the past given the present 13

Components of RL Policy & Return & Value Policy: (a s) =P [A t = a S t = s] Return: G t = R t+1 + R t+2 +...= 1X k=0 State-value function: v (s) =E [G t S t = s] k R t+k+1 Action-value function: q (s, a) =E [G t S t = s, A t = a] 14

Components of RL Policy & Return & Value Policy: (a s) =P [A t = a S t = s] Return: G t = R t+1 + R t+2 +...= 1X k=0 State-value function: v (s) =E [G t S t = s] k R t+k+1 Action-value function: q (s, a) =E [G t S t = s, A t = a] 15

Components of RL Policy & Return & Value Policy: (a s) =P [A t = a S t = s] Return: G t = R t+1 + R t+2 +...= 1X k=0 State-value function: v (s) =E [G t S t = s] k R t+k+1 Action-value function: q (s, a) =E [G t S t = s, A t = a] 16

Components of RL Policy & Return & Value Policy: (a s) =P [A t = a S t = s] Return: G t = R t+1 + R t+2 +...= 1X k=0 State-value function: v (s) =E [G t S t = s] k R t+k+1 Action-value function: q (s, a) =E [G t S t = s, A t = a] 17

Components of RL Bellman Equations Bellman Expectation Equation v (s) =E [R t+1 + v (S t+1 ) S t = s]! v (s) = X a2a (a s) R a s + X s 0 2S P a ss 0 v (s 0 ) 18

Components of RL Bellman Equations Bellman Optimality Equation! v (s) = max a R a s + X s 0 2S P a ss 0 v (s 0 ) 19

Components of RL Prediction VS Control Prediction given a policy, evaluate how much reward you can get by following that policy Control find an optimal policy that maximizes the cumulative future reward 20

Components of RL Prediction VS Control Prediction given a policy, evaluate how much reward you can get by following that policy Control find an optimal policy that maximizes the cumulative future reward 21

Components of RL Planning VS Learning Planning the underlying MDP is known agent only needs to perform computations on the given model dynamic programming (policy iteration, value iteration) Learning the underlying MDP is initially unknown agent needs to interact with the environment model-free (learn value / policy) / model-based (learn model, plan on it) 22

Components of RL Planning VS Learning Planning the underlying MDP is known agent only needs to perform computations on the given model dynamic programming (policy iteration, value iteration) Learning the underlying MDP is initially unknown agent needs to interact with the environment model-free (learn value / policy) / model-based (learn model, plan on it) 23

Components of RL Planning VS Learning Planning the underlying MDP is known agent only needs to perform computations on the given model dynamic programming (policy iteration, value iteration) Learning the underlying MDP is initially unknown agent needs to interact with the environment model-free (learn value / policy) / model-based (learn model, plan on it) 24

Components of RL Planning VS Learning Planning the underlying MDP is known agent only needs to perform computations on the given model dynamic programming (policy iteration, value iteration) Learning the underlying MDP is initially unknown agent needs to interact with the environment model-free (learn value / policy) / model-based (learn model, plan on it) 25

Outline Characteristics of Reinforcement Learning (RL) The RL Problem (MDP, value, policy, Bellman) Planning (policy iteration, value iteration) Model-free Prediction (MC, TD) Model-free Control (Q-Learning) Deep Reinforcement Learning (DQN) 27

Planning Dynamic Programming Applied when optimal solutions can be decomposed into subproblems For prediction: Input: Output: < S, A, R, S, >, v For control: Input: Output: < S, A, R, S, > v, 28

Planning Dynamic Programming Applied when optimal solutions can be decomposed into subproblems For prediction: (iterative policy evaluation) Input: Output: < S, A, R, S, >, v For control: (policy iteration, value iteration) Input: Output: < S, A, R, S, > v, 29

Planning Dynamic Programming Applied when optimal solutions can be decomposed into subproblems For prediction: (iterative policy evaluation) Input: Output: < S, A, R, S, >, v For control: (policy iteration, value iteration) Input: Output: < S, A, R, S, > v, 30

Planning Iterative Policy Evaluation Iterative application of Bellman Expectation backup v 1! v 2!...! v! v k+1 (s) = X a2a (a s) R a s + X s 0 2S P a ss 0 v k (s 0 ) 31

Planning Policy Iteration Evaluate the given policy and get: v Get an improved policy by acting greedily: 0 = greedy(v ) 32

Planning Value Iteration Iterative application of Bellman Optimality backup v 1! v 2!...! v! v k+1 (s) = max a2a Ra s + X s 0 2S P a ss 0 v k (s 0 ) 33

Planning Synchronous DP Algorithms 34

Outline Characteristics of Reinforcement Learning (RL) The RL Problem (MDP, value, policy, Bellman) Planning (policy iteration, value iteration) Model-free Prediction (MC, TD) Model-free Control (Q-Learning) Deep Reinforcement Learning (DQN) 36

Recap: Components of RL Planning VS Learning Planning the underlying MDP is known agent only needs to perform computations on the given model dynamic programming (policy iteration, value iteration) Learning the underlying MDP is initially unknown agent needs to interact with the environment model-free (learn value / policy) / model-based (learn model, plan on it) 37

Model-free Prediction MC VS TD Monte Carlo Learning learns from complete trajectories, no bootstrapping estimates values by looking at sample returns, empirical mean return Temporal Difference Learning learns from incomplete episodes, by bootstrapping, substituting the remainder of the trajectory with our estimate updates a guess towards a guess 38

Model-free Prediction MC VS TD Monte Carlo Learning learns from complete trajectories, no bootstrapping estimates values by looking at sample returns, empirical mean return Temporal Difference Learning learns from incomplete episodes, by bootstrapping, substituting the remainder of the trajectory with our estimate updates a guess towards a guess 39

Model-free Prediction MC VS TD Monte Carlo Learning learns from complete trajectories, no bootstrapping estimates values by looking at sample returns, empirical mean return Temporal Difference Learning learns from incomplete episodes, by bootstrapping, substituting the remainder of the trajectory with our estimate updates a guess towards a guess 40

Model-free Prediction MC Goal: learn v from episodes of experience under policy Recall: Return is the total discounted reward: G t = R t+1 + R t+2 +...= 1X k=0 Recall: Value function is the expected return: k R t+k+1 v (s) =E [G t S t = s] MC policy evaluation (every visit MC): uses empirical mean return instead of expected return 41

Model-free Prediction MC Goal: learn v from episodes of experience under policy Recall: Return is the total discounted reward: G t = R t+1 + R t+2 +...= 1X k=0 Recall: Value function is the expected return: k R t+k+1 v (s) =E [G t S t = s] MC policy evaluation (every visit MC): uses empirical mean return instead of expected return 42

Model-free Prediction MC Goal: learn v from episodes of experience under policy Recall: Return is the total discounted reward: G t = R t+1 + R t+2 +...= 1X k=0 Recall: Value function is the expected return: k R t+k+1 v (s) =E [G t S t = s] MC policy evaluation (every visit MC): uses empirical mean return instead of expected return 43

Model-free Prediction MC Goal: learn v from episodes of experience under policy Recall: Return is the total discounted reward: G t = R t+1 + R t+2 +...= 1X k=0 Recall: Value function is the expected return: k R t+k+1 v (s) =E [G t S t = s] MC policy evaluation (every visit MC): uses empirical mean return instead of expected return 44

Goal: MC: TD: Model-free Prediction MC -> TD learn v from episodes of experience under policy updates V (S t ) towards actual return: G t V (S t ) V (S t )+ (G t V (S t )) updates V (S t ) towards estimated return: R t+1 + V (S t+1 ) V (S t ) V (S t )+ (R t+1 + V (S t+1 ) V (S t )) 45

Model-free Prediction MC VS TD: Driving Home 46

Model-free Prediction MC Backup 47

Model-free Prediction TD Backup 48

Model-free Prediction DP Backup 49

Model-free Prediction Unified View 50

Outline Characteristics of Reinforcement Learning (RL) The RL Problem (MDP, value, policy, Bellman) Planning (policy iteration, value iteration) Model-free Prediction (MC, TD) Model-free Control (Q-Learning) Deep Reinforcement Learning (DQN) 52

Model-free Control Why model-free? MDP is unknown: but experience can be sampled MDP is known: but too big to use except to sample from it 53

Recap: Planning Policy Iteration Evaluate the given policy and get: v Get an improved policy by acting greedily: 0 = greedy(v ) 54

Model-free Control Generalized Policy Iteration Evaluate the given policy and get: v Get an improved policy by acting greedily: 0 = greedy(v ) 55

Model-free Control V -> Q Greedy policy improvement over V(s) requires model of MDP 0 (s) = arg max a2a (R a s + P a ss 0 V (s 0 )) Greedy policy improvement over Q(s,a) is model-free 0 (s) = arg maxq(s, a) a2a 56

Model-free Control SARSA Q(S, A) Q(S, A)+ (R + Q(S 0, A 0 ) Q(S, A)) 57

Model-free Control Q-Learning Q(S, A) Q(S, A)+ R + max a 0 Q(S 0, a 0 ) Q(S, A) 58

Model-free Control SARSA VS Q-Learning 59

Model-free Control DP VS TD 60

Model-free Control DP VS TD 61

Outline Characteristics of Reinforcement Learning (RL) The RL Problem (MDP, value, policy, Bellman) Planning (policy iteration, value iteration) Model-free Prediction (MC, TD) Model-free Control (Q-Learning) Deep Reinforcement Learning (DQN) 63

Deep Reinforcement Learning Why? So far we represented value function by a lookup table every state s has an entry V(s) every state-action pair (s, a) has an entry Q(s, a) Problem w/ large MDPs too many states and/or actions to store in memory too slow to learn the value of each state individually 64

Deep Reinforcement Learning Why? So far we represented value function by a lookup table every state s has an entry V(s) every state-action pair (s, a) has an entry Q(s, a) Problem w/ large MDPs too many states and/or actions to store in memory too slow to learn the value of each state individually 65

Deep Reinforcement Learning How to? Use deep networks to represent: value function (value-based methods) policy (policy-based methods) model (model-based methods) Optimize value function / policy / model end-to-end 66

Deep Reinforcement Learning How to? Use deep networks to represent: value function (value-based methods) policy (policy-based methods) model (model-based methods) Optimize value function / policy / model end-to-end 67

Deep Reinforcement Learning Q-learning -> DQN 68

Deep Reinforcement Learning Q-learning -> DQN 69

Deep Reinforcement Learning AI = RL + DL Reinforcement Learning (RL) a general purpose framework for decision making learn policies to maximize future reward Deep Learning (DL) a general purpose framework for representation learning given an objective, learn representation that is required to achieve objective DRL: a single agent which can solve any human-level task RL defines the objective DL gives the mechanism RL + DL = general intelligence 70

Deep Reinforcement Learning AI = RL + DL Reinforcement Learning (RL) a general purpose framework for decision making learn policies to maximize future reward Deep Learning (DL) a general purpose framework for representation learning given an objective, learn representation that is required to achieve objective DRL: a single agent which can solve any human-level task RL defines the objective DL gives the mechanism RL + DL = general intelligence 71

Deep Reinforcement Learning AI = RL + DL Reinforcement Learning (RL) a general purpose framework for decision making learn policies to maximize future reward Deep Learning (DL) a general purpose framework for representation learning given an objective, learn representation that is required to achieve objective DRL: a single agent which can solve any human-level task RL defines the objective DL gives the mechanism RL + DL = general intelligence 72

Some Recommendations Reinforcement Learning from David Silver on YouTube Reinforcement Learning, An Introduction, Richard Sutton, 2nd Edition DQN Nature Paper: Human-level Control Through Deep Reinforcement Learning Flappy Bird: Tabular RL: https://github.com/sarvagyavaish/flappybirdrl Deep RL: https://github.com/songrotek/drl-flappybird Many many 3rd party implementations, just search for deep reinforcement learning, dqn, a3c on github My implementations in pytorch: https://github.com/jingweiz/pytorch-rl 73