Reinforcement Learning

Similar documents
Lecture 10: Reinforcement Learning

Reinforcement Learning by Comparing Immediate Reward

Exploration. CS : Deep Reinforcement Learning Sergey Levine

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Improving Action Selection in MDP s via Knowledge Transfer

Artificial Neural Networks written examination

TD(λ) and Q-Learning Based Ludo Players

High-level Reinforcement Learning in Strategy Games

AMULTIAGENT system [1] can be defined as a group of

Learning Prospective Robot Behavior

Axiom 2013 Team Description Paper

Laboratorio di Intelligenza Artificiale e Robotica

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Regret-based Reward Elicitation for Markov Decision Processes

Speeding Up Reinforcement Learning with Behavior Transfer

Lecture 1: Machine Learning Basics

Georgetown University at TREC 2017 Dynamic Domain Track

Generative models and adversarial training

An Introduction to Simio for Beginners

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

GCSE Mathematics B (Linear) Mark Scheme for November Component J567/04: Mathematics Paper 4 (Higher) General Certificate of Secondary Education

Laboratorio di Intelligenza Artificiale e Robotica

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

On the Combined Behavior of Autonomous Resource Management Agents

The Evolution of Random Phenomena

A Case Study: News Classification Based on Term Frequency

What to Do When Conflict Happens

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

An OO Framework for building Intelligence and Learning properties in Software Agents

Improving Conceptual Understanding of Physics with Technology

Science Fair Project Handbook

File # for photo

UDL AND LANGUAGE ARTS LESSON OVERVIEW

FF+FPG: Guiding a Policy-Gradient Planner

Acquiring Competence from Performance Data

Meta-Cognitive Strategies

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Managerial Decision Making

Learning and Transferring Relational Instance-Based Policies

University of Victoria School of Exercise Science, Physical and Health Education EPHE 245 MOTOR LEARNING. Calendar Description Units: 1.

(Sub)Gradient Descent

Major Milestones, Team Activities, and Individual Deliverables

Math 1313 Section 2.1 Example 2: Given the following Linear Program, Determine the vertices of the feasible set. Subject to:

Hentai High School A Game Guide

While you are waiting... socrative.com, room number SIMLANG2016

Lecture 6: Applications

GCE. Mathematics (MEI) Mark Scheme for June Advanced Subsidiary GCE Unit 4766: Statistics 1. Oxford Cambridge and RSA Examinations

Game-based formative assessment: Newton s Playground. Valerie Shute, Matthew Ventura, & Yoon Jeon Kim (Florida State University), NCME, April 30, 2013

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

CS177 Python Programming

The Strong Minimalist Thesis and Bounded Optimality

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Cognitive Thinking Style Sample Report

Seminar - Organic Computing

Multi-genre Writing Assignment

What is Teaching? JOHN A. LOTT Professor Emeritus in Pathology College of Medicine

Cooperative Game Theoretic Models for Decision-Making in Contexts of Library Cooperation 1

Teacher Quality and Value-added Measurement

4-3 Basic Skills and Concepts

Task Completion Transfer Learning for Reward Inference

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Elite schools or Normal schools: Secondary Schools and Student Achievement: Regression Discontinuity Evidence from Kenya

Robot Learning Simultaneously a Task and How to Interpret Human Instructions

Decision Making Lesson Review

ALL-IN-ONE MEETING GUIDE THE ECONOMICS OF WELL-BEING

SARDNET: A Self-Organizing Feature Map for Sequences

Task Completion Transfer Learning for Reward Inference

Using focal point learning to improve human machine tacit coordination

Pre-Algebra A. Syllabus. Course Overview. Course Goals. General Skills. Credit Value

Grade 6: Correlated to AGS Basic Math Skills

A Comparison of Annealing Techniques for Academic Course Scheduling

College Pricing. Ben Johnson. April 30, Abstract. Colleges in the United States price discriminate based on student characteristics

Visual CP Representation of Knowledge

Python Machine Learning

A Reinforcement Learning Variant for Control Scheduling

Discriminative Learning of Beam-Search Heuristics for Planning

Developing Grammar in Context

CHAPTER IV RESEARCH FINDING AND DISCUSSION

Understanding Games for Teaching Reflections on Empirical Approaches in Team Sports Research

Speech Recognition at ICSI: Broadcast News and beyond

Business 712 Managerial Negotiations Fall 2011 Course Outline. Human Resources and Management Area DeGroote School of Business McMaster University

AI Agent for Ice Hockey Atari 2600

Proposal of Pattern Recognition as a necessary and sufficient principle to Cognitive Science

Diagnostic Test. Middle School Mathematics

10 TIPS FOR YOUR NEXT PRESENTATION BY BRENT MANKE

Introduction to the Practice of Statistics

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Evolutive Neural Net Fuzzy Filtering: Basic Description

Go fishing! Responsibility judgments when cooperation breaks down

Software Maintenance

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

IN THIS UNIT YOU LEARN HOW TO: SPEAKING 1 Work in pairs. Discuss the questions. 2 Work with a new partner. Discuss the questions.

Earl of March SS Physical and Health Education Grade 11 Summative Project (15%)

Executive Guide to Simulation for Health

ACTL5103 Stochastic Modelling For Actuaries. Course Outline Semester 2, 2014

Encoding. Retrieval. Forgetting. Physiology of Memory. Systems and Types of Memory

AGENDA LEARNING THEORIES LEARNING THEORIES. Advanced Learning Theories 2/22/2016

Transcription:

Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent s utility is defined by the reward function Must (learn to) act so as to maximize expected rewards This slide deck courtesy of Dan Klein at UC Berkeley

Reinforcement Learning Reinforcement learning: Still assume an MDP: A set of states s S A set of actions (per state) A A model T(s,a,s ) A reward function R(s,a,s ) Still looking for a policy π (s) [DEMO] New twist: don t know T or R I.e. don t know which states are good or what the actions do Must actually try actions and states out to learn 2

Example: Animal Learning RL studied experimentally for more than 60 years in psychology Rewards: food, pain, hunger, drugs, etc. Mechanisms and sophistication debated Example: foraging Bees learn near optimal foraging plan in field of artificial flowers with controlled nectar supplies Bees have a direct neural connection from nectar intake measurement to motor planning area 3

Example: Backgammon Reward only for win / loss in terminal states, zero otherwise TD Gammon learns a function approximation to V(s) using a neural network Combined with depth 3 search, one of the top 3 players in the world You could imagine training Pacman this way but it s tricky! (It s also P3) 4

Passive RL Simplified task You are given a policy π (s) You don t know the transitions T(s,a,s ) You don t know the rewards R(s,a,s ) Goal: learn the state values what policy evaluation did In this case: Learner along for the ride No choice about what actions to take Just execute the policy and learn from experience We ll get to the active case soon This is NOT offline planning! You actually take actions in the world and see what happens 5

Example: Direct Evaluation Episodes: y +100 (1,1) up 1 (1,2) up 1 (1,2) up 1 (1,3) right 1 (2,3) right 1 (3,3) right 1 (3,2) up 1 (3,3) right 1 (4,3) exit +100 (done) (1,1) up 1 (1,2) up 1 (1,3) right 1 (2,3) right 1 (3,3) right 1 (3,2) up 1 (4,2) exit 100 (done) 100 γ = 1, R = 1 V(2,3) ~ (96 + 103) / 2 = 3.5 V(3,3) ~ (99 + 97 + 102) / 3 = 31.3 x 6

Recap: Model Based Policy Evaluation Simplified Bellman updates to calculate V for a fixed policy: New V is expected one step lookahead using current V Unfortunately, need T and R s π (s) s, π (s) s, π (s),s s 7

Model Based Learning Idea: Learn the model empirically through experience Solve for values as if the learned model were correct Simple empirical model learning Count outcomes for each s,a Normalize to give estimate of T(s,a,s ) Discover R(s,a,s ) when we experience (s,a,s ) Solving the MDP with the learned model Iterative policy evaluation, for example s π (s) s, π (s) s, π (s),s s 8

Example: Model Based Learning y Episodes: +100 (1,1) up 1 (1,2) up 1 (1,1) up 1 (1,2) up 1 100 (1,2) up 1 (1,3) right 1 (1,3) right 1 (2,3) right 1 (3,3) right 1 (2,3) right 1 (3,3) right 1 (3,2) up 1 γ = 1 x (3,2) up 1 (3,3) right 1 (4,2) exit 100 (done) T(<3,3>, right, <4,3>) = 1 / 3 (4,3) exit +100 T(<2,3>, right, <3,3>) = 2 / 2 (done) 9

Example: Expected Age Goal: Compute expected age of cs343 students Known P(A) Without P(A), instead collect samples [a 1, a 2, a N ] Unknown P(A): Model Based Unknown P(A): Model Free 10

Model Free Learning Want to compute an expectation weighted by P(x): Model based: estimate P(x) from samples, compute expectation Model free: estimate expectation directly from samples Why does this work? Because samples appear with the right frequencies! 11

Sample Based Policy Evaluation? Who needs T and R? Approximate the expectation with samples of s (drawn from T!) s π (s) s, π (s) s, π (s),s s 1 s 3 s 2 Almost! But we can t rewind time to get sample after sample from state s. 12

Temporal Difference Learning Big idea: learn from every experience! Update V(s) each time we experience (s,a,s,r) Likely s will contribute updates more often Temporal difference learning Policy still fixed! Move values toward value of whatever successor occurs: running average! Sample of V(s): s π (s) s, π (s) s Update to V(s): Same update: 13

Exponential Moving Average Exponential moving average The running interpolation update Makes recent samples more important Forgets about the past (distant past values were wrong anyway) Decreasing learning rate can give converging averages 14

Example: TD Policy Evaluation (1,1) up 1 (1,2) up 1 (1,2) up 1 (1,3) right 1 (2,3) right 1 (3,3) right 1 (3,2) up 1 (3,3) right 1 (4,3) exit +100 (done) (1,1) up 1 (1,2) up 1 (1,3) right 1 (2,3) right 1 (3,3) right 1 (3,2) up 1 (4,2) exit 100 (done) Take γ = 1, α = 0.5 15

Problems with TD Value Learning TD value leaning is a model free way to do policy evaluation However, if we want to turn values into a (new) policy, we re sunk: s a s, a s,a,s s Idea: learn Q values directly Makes action selection model free too! 16

Active RL Full reinforcement learning You don t know the transitions T(s,a,s ) You don t know the rewards R(s,a,s ) You can choose any actions you like Goal: learn the optimal policy / values what value iteration did! In this case: Learner makes choices! Fundamental tradeoff: exploration vs. exploitation This is NOT offline planning! You actually take actions in the world and find out what happens 17

Detour: Q Value Iteration Value iteration: find successive approx optimal values Start with V 0* (s) = 0, which we know is right (why?) Given V i*, calculate the values for all states for depth i+1: But Q values are more useful! Start with Q 0* (s,a) = 0, which we know is right (why?) Given Q i*, calculate the q values for all q states for depth i+1: 18

[DEMO Grid Q s] Q Learning Q Learning: sample based Q value iteration Learn Q*(s,a) values Receive a sample (s,a,s,r) Consider your old estimate: Consider your new sample estimate: Incorporate the new estimate into a running average: 19

Q Learning Properties Amazing result: Q learning converges to optimal policy If you explore enough If you make the learning rate small enough but not decrease it too quickly! Basically doesn t matter how you select actions (!) Neat property: off policy learning learn optimal policy without following it (some caveats) S E S E 20

Exploration / Exploitation Several schemes for forcing exploration Simplest: random actions (ε greedy) Every time step, flip a coin With probability ε, act randomly With probability 1 ε, act according to current policy Problems with random actions? You do explore the space, but keep thrashing around once learning is done One solution: lower ε over time Another solution: exploration functions 21

Q Learning Q learning produces tables of q values: 22

Exploration Functions When to explore Random actions: explore a fixed amount Better idea: explore areas whose badness is not (yet) established Exploration function Takes a value estimate and a count, and returns an optimistic utility, e.g. (exact form not important) 23

The Story So Far: MDPs and RL Things we know how to do: If we know the MDP Compute V*, Q*, π * exactly Evaluate a fixed policy π Techniques: Model based DPs Value Iteration Policy evaluation If we don t know the MDP We can estimate the MDP then solve Model based RL We can estimate V for a fixed policy π We can estimate Q*(s,a) for the optimal policy while executing an exploration policy Model free RL Value learning Q learning 24