Reinforcement Learning

Similar documents
Lecture 10: Reinforcement Learning

Reinforcement Learning by Comparing Immediate Reward

Laboratorio di Intelligenza Artificiale e Robotica

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Regret-based Reward Elicitation for Markov Decision Processes

AMULTIAGENT system [1] can be defined as a group of

Improving Action Selection in MDP s via Knowledge Transfer

TD(λ) and Q-Learning Based Ludo Players

Artificial Neural Networks written examination

Laboratorio di Intelligenza Artificiale e Robotica

Georgetown University at TREC 2017 Dynamic Domain Track

Axiom 2013 Team Description Paper

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

High-level Reinforcement Learning in Strategy Games

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

Learning Prospective Robot Behavior

A Reinforcement Learning Variant for Control Scheduling

Speeding Up Reinforcement Learning with Behavior Transfer

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Lecture 6: Applications

Lecture 1: Machine Learning Basics

Learning and Transferring Relational Instance-Based Policies

Task Completion Transfer Learning for Reward Inference

An investigation of imitation learning algorithms for structured prediction

IAT 888: Metacreation Machines endowed with creative behavior. Philippe Pasquier Office 565 (floor 14)

On the Combined Behavior of Autonomous Resource Management Agents

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

AI Agent for Ice Hockey Atari 2600

Teachable Robots: Understanding Human Teaching Behavior to Build More Effective Robot Learners

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Using focal point learning to improve human machine tacit coordination

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

FF+FPG: Guiding a Policy-Gradient Planner

Accelerated Learning Course Outline

Task Completion Transfer Learning for Reward Inference

Adaptive Generation in Dialogue Systems Using Dynamic User Modeling

Accelerated Learning Online. Course Outline

Learning Methods in Multilingual Speech Recognition

Lecture 1: Basic Concepts of Machine Learning

An OO Framework for building Intelligence and Learning properties in Software Agents

Discriminative Learning of Beam-Search Heuristics for Planning

(Sub)Gradient Descent

A Comparison of Annealing Techniques for Academic Course Scheduling

Learning Methods for Fuzzy Systems

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

LEARNING TO PLAY IN A DAY: FASTER DEEP REIN-

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

Robot Learning Simultaneously a Task and How to Interpret Human Instructions

Evolutive Neural Net Fuzzy Filtering: Basic Description

The Strong Minimalist Thesis and Bounded Optimality

Predicting Students Performance with SimStudent: Learning Cognitive Skills from Observation

Session 2B From understanding perspectives to informing public policy the potential and challenges for Q findings to inform survey design

Visual CP Representation of Knowledge

SARDNET: A Self-Organizing Feature Map for Sequences

Automatic Discretization of Actions and States in Monte-Carlo Tree Search

A Version Space Approach to Learning Context-free Grammars

Human Emotion Recognition From Speech

Transfer Learning Action Models by Measuring the Similarity of Different Domains

BAYESIAN ANALYSIS OF INTERLEAVED LEARNING AND RESPONSE BIAS IN BEHAVIORAL EXPERIMENTS

CS Machine Learning

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Softprop: Softmax Neural Network Backpropagation Learning

While you are waiting... socrative.com, room number SIMLANG2016

University of Victoria School of Exercise Science, Physical and Health Education EPHE 245 MOTOR LEARNING. Calendar Description Units: 1.

Planning with External Events

File # for photo

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Rover Races Grades: 3-5 Prep Time: ~45 Minutes Lesson Time: ~105 minutes

Performance Modeling and Design of Computer Systems

Python Machine Learning

AGENDA LEARNING THEORIES LEARNING THEORIES. Advanced Learning Theories 2/22/2016

A Bayesian Model of Imitation in Infants and Robots

Generative models and adversarial training

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

The dilemma of Saussurean communication

Neuroscience I. BIOS/PHIL/PSCH 484 MWF 1:00-1:50 Lecture Center F6. Fall credit hours

An Introduction to Simulation Optimization

The Mirror System, Imitation, and the Evolution of Language DRAFT: December 10, 1999

Stochastic Calculus for Finance I (46-944) Spring 2008 Syllabus

Intelligent Agents. Chapter 2. Chapter 2 1

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Rule-based Expert Systems

CSL465/603 - Machine Learning

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Truth Inference in Crowdsourcing: Is the Problem Solved?

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Case Acquisition Strategies for Case-Based Reasoning in Real-Time Strategy Games

A Genetic Irrational Belief System

TOKEN-BASED APPROACH FOR SCALABLE TEAM COORDINATION. by Yang Xu PhD of Information Sciences

Liquid Narrative Group Technical Report Number

Evolution of Symbolisation in Chimpanzees and Neural Nets

Given a directed graph G =(N A), where N is a set of m nodes and A. destination node, implying a direction for ow to follow. Arcs have limitations

Seminar - Organic Computing

EVOLVING POLICIES TO SOLVE THE RUBIK S CUBE: EXPERIMENTS WITH IDEAL AND APPROXIMATE PERFORMANCE FUNCTIONS

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Natural Language Processing. George Konidaris

CAFE ESSENTIAL ELEMENTS O S E P P C E A. 1 Framework 2 CAFE Menu. 3 Classroom Design 4 Materials 5 Record Keeping

Curriculum Vitae. Work Address Center for Economics and Neuroscience (CENs) and Telephone Nachtigallenweg 86

Transcription:

Markov Decision Processes and Reinforcement Learning Readings: Mitchell, chapter 13 Kaelbling, et al., Reinforcement Learning: A Survey, JAIR, 1996 for much more: Reinforcement Learning, an Introduction, Sutton & Barto Machine Learning 10-701 April 26, 2010 Tom M. Mitchell Machine Learning Department Carnegie Mellon University Reinforcement Learning [Sutton and Barto 1981; Samuel 1957;...] 1

Reinforcement Learning: Backgammon Learning task: chose move at arbitrary board states [Tessauro, 1995] Training signal: final win or loss Training: played 300,000 games against itself Algorithm: reinforcement learning + neural network Result: World-class Backgammon player Outline Learning control strategies Credit assignment and delayed reward Discounted rewards Markov Decision Processes Solving a known MDP Online learning of control strategies When next-state function is known: value function V*(s) When next-state function unknown: learning Q*(s,a) Role in modeling reward learning in animals 2

Markov Decision Process = Reinforcement Learning Setting Set of states S Set of actions A At each time, agent observes state st S, then chooses action at A Then receives reward rt, and state changes to st+1 Markov assumption: P(st+1 st, at, st-1, at-1,...) = P(st+1 st, at) Also assume reward Markov: P(rt st, at, st-1, at-1,...) = P(rt st, at) The task: learn a policy π: S A for choosing actions that maximizes for every possible starting state s0 3

HMM, Markov Process, Markov Decision Process HMM, Markov Process, Markov Decision Process 4

Reinforcement Learning Task for Autonomous Agent Execute actions in environment, observe results, and Learn control policy π: S A that maximizes from every state s S Note: Function to be learned is π: S A But training examples are not of the form <s, a> They are instead of the form < <s,a>, r > Reinforcement Learning Task for Autonomous Agent Execute actions in environment, observe results, and Learn control policy π: S A that maximizes from every state s S Example: Robot grid world, deterministic reward r(s,a) 5

Value Function for each Policy Given a policy π : S A, define assuming action sequence chosen according to π, starting at state s Then we want the policy π* where For any MDP, such a policy exists! We ll abbreviate Vπ *(s) as V*(s) Note if we have V*(s) and P(st+1 st,a), we can compute π*(s) Value Function what are the Vπ(s) values? 6

Value Function what are the Vπ(s) values? Value Function what are the V*(s) values? 7

Immediate rewards r(s,a) State values V*(s) Recursive definition for V*(S) assuming actions are chosen according to the optimal policy, π* 8

Value Iteration for learning V* : assumes P(St+1 St, A) known Initialize V(s) arbitrarily Loop until policy good enough Loop for s in S Loop for a in A End loop End loop V(s) converges to V*(s) Dynamic programming Value Iteration Interestingly, value iteration works even if we randomly traverse the environment instead of looping through each state and action methodically but we must still visit each state infinitely often on an infinite run For details: [Bertsekas 1989] Implications: online learning as agent randomly roams If max (over states) difference between two successive value function estimates is less than ε, then the value of the greedy policy differs from the optimal policy by no more than 9

So far: learning optimal policy when we know P(st st-1, at-1) What if we don t? Q learning Define new function, closely related to V* If agent knows Q(s,a), it can choose optimal action without knowing P(st+1 st,a)! And, it can learn Q without knowing P(st+1 st,a) 10

Consider first the deterministic case. P(s s,a) deterministic, denoted δ(s,a) Immediate rewards r(s,a) State values V*(s) State-action values Q*(s,a) Bellman equation. 11

12

Use general fact: 13

14

MDPs and Reinforcement Learning: Further Issues What strategy for choosing actions will optimize learning rate? (explore uninvestigated states) obtained reward? (exploit what you know so far) Can we bound sample complexity? R-Max learns with δ, ε bounds in polynomial number of actions Partially observable Markov Decision Processes state is not fully observable must maintain probability distribution over possible state you re in Convergence guarantee with function approximators? Correspondence to human learning? 15

Dopamine As Reward Signal t [Schultz et al., Science, 1997] 31 Dopamine As Reward Signal t [Schultz et al., Science, 1997] 32 16

Dopamine As Reward Signal t [Schultz et al., Science, 1997] 33 RL Models for Human Learning [Seymore et al., Nature 2004] 34 17

[Seymore et al., Nature 2004] 35 One Theory of RL in the Brain from [Nieuwenhuis et al.] Basal ganglia monitor events, predict future rewards When prediction revised upward (downward), causes increase (decrease) in activity of midbrain dopaminergic neurons, influencing ACC This dopamine-based activation somehow results in revising the reward prediction function. Possibly through direct influence on Basal ganglia, and via prefrontal cortex 36 18

Summary: Temporal Difference ML Model Predicts Dopaminergic Neuron Acitivity during Learning Evidence now of neural reward signals from Direct neural recordings in monkeys fmri in humans (1 mm spatial resolution) EEG in humans (1-10 msec temporal resolution) Dopaminergic responses encode Bellman error Some differences, and efforts to refine the model How/where is the value function encoded in the brain? Study timing (e.g., basal ganglia learns faster than PFC?) Role of prior knowledge, rehearsal of experience, multi-task learning? 19