Reinforcement Learning

Similar documents
Lecture 10: Reinforcement Learning

Reinforcement Learning by Comparing Immediate Reward

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Improving Action Selection in MDP s via Knowledge Transfer

Artificial Neural Networks written examination

Georgetown University at TREC 2017 Dynamic Domain Track

Laboratorio di Intelligenza Artificiale e Robotica

Axiom 2013 Team Description Paper

Learning Prospective Robot Behavior

AMULTIAGENT system [1] can be defined as a group of

Session 2B From understanding perspectives to informing public policy the potential and challenges for Q findings to inform survey design

High-level Reinforcement Learning in Strategy Games

Regret-based Reward Elicitation for Markov Decision Processes

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Laboratorio di Intelligenza Artificiale e Robotica

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

TD(λ) and Q-Learning Based Ludo Players

FF+FPG: Guiding a Policy-Gradient Planner

(Sub)Gradient Descent

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Speeding Up Reinforcement Learning with Behavior Transfer

On the Combined Behavior of Autonomous Resource Management Agents

Adaptive Generation in Dialogue Systems Using Dynamic User Modeling

Teachable Robots: Understanding Human Teaching Behavior to Build More Effective Robot Learners

Lecture 1: Basic Concepts of Machine Learning

Evolutive Neural Net Fuzzy Filtering: Basic Description

AI Agent for Ice Hockey Atari 2600

Rover Races Grades: 3-5 Prep Time: ~45 Minutes Lesson Time: ~105 minutes

Robot Learning Simultaneously a Task and How to Interpret Human Instructions

Lecture 1: Machine Learning Basics

Task Completion Transfer Learning for Reward Inference

Lecture 6: Applications

While you are waiting... socrative.com, room number SIMLANG2016

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

IAT 888: Metacreation Machines endowed with creative behavior. Philippe Pasquier Office 565 (floor 14)

A Comparison of Annealing Techniques for Academic Course Scheduling

Lahore University of Management Sciences. FINN 321 Econometrics Fall Semester 2017

A Reinforcement Learning Variant for Control Scheduling

Active Learning. Yingyu Liang Computer Sciences 760 Fall

LEARNING TO PLAY IN A DAY: FASTER DEEP REIN-

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

An OO Framework for building Intelligence and Learning properties in Software Agents

Using focal point learning to improve human machine tacit coordination

LEGO MINDSTORMS Education EV3 Coding Activities

Transfer Learning Action Models by Measuring the Similarity of Different Domains

Learning and Transferring Relational Instance-Based Policies

Task Completion Transfer Learning for Reward Inference

How to make an A in Physics 101/102. Submitted by students who earned an A in PHYS 101 and PHYS 102.

Learning Methods for Fuzzy Systems

Generative models and adversarial training

The Good Judgment Project: A large scale test of different methods of combining expert predictions

An investigation of imitation learning algorithms for structured prediction

Automatic Discretization of Actions and States in Monte-Carlo Tree Search

Probability and Game Theory Course Syllabus

Foundations of Knowledge Representation in Cyc

Contents. Foreword... 5

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Understanding and Interpreting the NRC s Data-Based Assessment of Research-Doctorate Programs in the United States (2010)

Improving Fairness in Memory Scheduling

Story Problems with. Missing Parts. s e s s i o n 1. 8 A. Story Problems with. More Story Problems with. Missing Parts

CSL465/603 - Machine Learning

CS Machine Learning

SARDNET: A Self-Organizing Feature Map for Sequences

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Planning with External Events

Introduction to Simulation

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

Predicting Students Performance with SimStudent: Learning Cognitive Skills from Observation

College Pricing. Ben Johnson. April 30, Abstract. Colleges in the United States price discriminate based on student characteristics

Emergency Management Games and Test Case Utility:

BMBF Project ROBUKOM: Robust Communication Networks

Grade 6: Correlated to AGS Basic Math Skills

The Strong Minimalist Thesis and Bounded Optimality

Intelligent Agents. Chapter 2. Chapter 2 1

TEAM-BUILDING GAMES, ACTIVITIES AND IDEAS

Multiagent Simulation of Learning Environments

Getting Started with Deliberate Practice

What to Do When Conflict Happens

Manipulative Mathematics Using Manipulatives to Promote Understanding of Math Concepts

TABLE OF CONTENTS TABLE OF CONTENTS COVER PAGE HALAMAN PENGESAHAN PERNYATAAN NASKAH SOAL TUGAS AKHIR ACKNOWLEDGEMENT FOREWORD

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Stochastic Calculus for Finance I (46-944) Spring 2008 Syllabus

Numerical Recipes in Fortran- Press et al (1992) Recursive Methods in Economic Dynamics - Stokey and Lucas (1989)

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

CAFE ESSENTIAL ELEMENTS O S E P P C E A. 1 Framework 2 CAFE Menu. 3 Classroom Design 4 Materials 5 Record Keeping

Measurement & Analysis in the Real World

A Bayesian Model of Imitation in Infants and Robots

Learning Human Utility from Video Demonstrations for Deductive Planning in Robotics

EGRHS Course Fair. Science & Math AP & IB Courses

CHAPTER V IMPLEMENTATION OF A LEARNING CONTRACT AND THE MODIFICATIONS TO THE ACTIVITIES Instructional Space The atmosphere created by the interaction

COMPUTATIONAL COMPLEXITY OF LEFT-ASSOCIATIVE GRAMMAR

Visual CP Representation of Knowledge

Cooperative Game Theoretic Models for Decision-Making in Contexts of Library Cooperation 1

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Software Maintenance

Truth Inference in Crowdsourcing: Is the Problem Solved?

Transcription:

Reinforcement Learning Maria-Florina Balcan Carnegie Mellon University April 20, 2015 Today: Learning of control policies Markov Decision Processes Temporal difference learning Q learning Readings: Mitchell, chapter 13 Kaelbling, et al., Reinforcement Learning: A Survey Slides courtesy: Tom Mitchell

Overview Different from ML pbs so far: decisions we make will be about actions to take, such as a robot deciding which way to move next, which will influence what we see next. Our decisions influence the next example we see. Goal will be not just to predict (say, whether there is a door in front of us or not) but to decide what to do. Model: Markov Decision Processes.

Reinforcement Learning [Sutton and Barto 1981; Samuel 1957;...] Main impact of our actions will not come right away but instead that will only come later. * 2 V (s) E[rt γ rt 1 γ rt 2...]

Reinforcement Learning: Backgammon Learning task: chose move at arbitrary board states [Tessauro, 1995] Training signal: final win or loss at the end of the game Training: played 300,000 games against itself Algorithm: reinforcement learning + neural network Result: World-class Backgammon player

Outline Learning control strategies Credit assignment and delayed reward Discounted rewards Markov Decision Processes Solving a known MDP Online learning of control strategies When next-state function is known: value function V * (s) When next-state function unknown: learning Q * (s,a) Role in modeling reward learning in animals

Agent lives in some environment; in some state: Robot: where robot is, what direction it is pointing, etc. Backgammon, state of the board (where all pieces are). Goal: Maximize long term discounted reward. I.e.: want a lot of reward, prefer getting it earlier to getting it later.

Markov Decision Process = Reinforcement Learning Setting Set of states S Set of actions A At each time, agent observes state s t S, then chooses action a t A Then receives reward r t, and state changes to s t+1 Markov assumption: P(s t+1 s t, a t, s t-1, a t-1,...) = P(s t+1 s t, a t ) Also assume reward Markov: P(r t s t, a t, s t-1, a t-1,...) = P(r t s t, a t ) E.g., if tell robot to move forward one meter, maybe it ends up moving forward 1.5 meters by mistake, so where the robot is at time t+1 can be a probabilistic function of where it was at time t and the action taken, but shouldn t depend on how we got to that state. The task: learn a policy : S A for choosing actions that maximizes for every possible starting state s 0

Reinforcement Learning Task for Autonomous Agent Execute actions in environment, observe results, and Learn control policy : S A that maximizes from every state s S Example: Robot grid world, deterministic reward r(s,a) Actions: move up, down, left, and right [except when you are in the top-right you stay there, and say any action that bumps you into a wall leaves you were you were]] reward fns r(s,a) is deterministic with reward 100 for entering the top-right and 0 everywhere else.

Reinforcement Learning Task for Autonomous Agent Execute actions in environment, observe results, and Learn control policy : S A that maximizes from every state s S Yikes!! Function to be learned is : S A But training examples are not of the form <s, a> They are instead of the form < <s,a>, r >

Value Function for each Policy Given a policy : S A, define Goal: find the optimal policy * where assuming action sequence chosen according to, starting at state s expected discounted reward we will get starting from state s if we follow policy π. policy whose value function is the maximum out of all policies simultaneously for all states For any MDP, such a policy exists! We ll abbreviate V * (s) as V*(s) Note if we have V*(s) and P(s t+1 s t,a), we can compute *(s)

Value Function what are the V (s) values?

Value Function what are the V (s) values?

Value Function what are the V * (s) values?

Immediate rewards r(s,a) State values V*(s)

Recursive definition for V*(S) assuming actions are chosen according to the optimal policy, * Value V (s 1 ) of performing optimal policy from s 1, is expected reward of the first action a 1 taken plus γ times the expected value, over states s 2 reached by performing action a 1 from s 1, of the value V (s 2 ) of performing the optimal policy from then on. optimal value of any state s is the expected reward of performing π (s) from s plus γ times the expected value, over states s reached by performing that action from state s, of the optimal value of s.

Value Iteration for learning V* : assumes P(S t+1 S t, A) known Initialize V(s) to 0 [optimal value can get in zero steps] For t=1, 2, [Loop until policy good enough] Loop for s in S Loop for a in A Inductively, if V is optimal discounted reward can get in t-1 steps, Q(s,a) is value of performing action a from state s and then being optimal from then on for the next t-1 steps. End loop End loop Optimal expected discounted reward can get by taking an action and then being optimal for t-1 steps= optimal expected discounted reward can get in t steps. V(s) converges to V*(s) Dynamic programming

Value Iteration for learning V* : assumes P(S t+1 S t, A) known Initialize V(s) to 0 [optimal value can get in zero steps] For t=1, 2, [Loop until policy good enough] Loop for s in S Loop for a in A each round we are computing the value of performing the optimal t-step policy starting from t=0, then t=1, t=2, etc, and since γ t goes to 0, once t is large enough this will be close to the optimal value V for the infinite-horizon case. End loop End loop V(s) converges to V*(s) Dynamic programming

Value Iteration for learning V* : assumes P(S t+1 S t, A) known Initialize V(s) to 0 [optimal value can get in zero steps] For t=1, 2, [Loop until policy good enough] Loop for s in S Loop for a in A End loop End loop Round t=0 we have V(s)=0 for all s. After round t=1, a top-row of 0, 100, 0 and a bottom-row of 0, 0, 100. After the next round (t=2), a top row of 90, 100, 0 and a bottom row of 0, 90, 100. After the next round (t=3) we will have a top-row of 90, 100, 0 and a bottom row of 81, 90, 100, and it will then stay there forever

Interestingly, value iteration works even if we randomly traverse the environment instead of looping through each state and action methodically but we must still visit each state infinitely often on an infinite run For details: [Bertsekas 1989] Value Iteration So far, in our DP, each round we cycled through each state exactly once. Implications: online learning as agent randomly roams If for our DP, max (over states) difference between two successive value function estimates is less than, then the value of the greedy policy differs from the optimal policy by no more than

So far: learning optimal policy when we know P(s t s t-1, a t-1 ) What if we don t?

Q learning Define new function, closely related to V* V*(s) is the expected discounted reward of following the optimal policy from time 0 onward. Q(s,a) is the expected discounted reward of first doing action a and then following the optimal policy from the next step onward. If agent knows Q(s,a), it can choose optimal action without knowing P(s t+1 s t,a)! Just chose the action that maximizes the Q value And, it can learn Q without knowing P(s t+1 s t,a) using something very much like the dynamic programming algorithm we used to compute V*.

Immediate rewards r(s,a) State values V*(s) State-action values Q*(s,a) Bellman equation. Consider first the case where P(s s,a) is deterministic

[simplicity assume the transitions and rewards are deterministic. ] Optimal value of a state s is the maximum, over actions a of Q(s,a ). Given current approx Q to Q, if we are in state s and perform action a and get to state s, update our estimate Q(s, a) to the reward r we got plus gamma times the maximum over a of Q(s, a )

Use general fact:

Rather than replacing the old estimate with the new estimate, you want to compute a weighted average of them: (1 α n ) times your old estimate plus α n times your new estimate. This way you average out the probabilistic fluctuations, and one can show that this still converges.

MDP s and RL: What You Should Know Learning to choose optimal actions A From delayed reward By learning evaluation functions like V(S), Q(S,A) Key ideas: If next state function S t x A t S t+1 is known can use dynamic programming to learn V(S) once learned, choose action A t that maximizes V(S t+1 ) If next state function S t x A t S t+1 unknown learn Q(S t,a t ) = E[V(S t+1 )] to learn, sample S t x A t S t+1 in actual world once learned, choose action A t that maximizes Q(S t,a t )