Reinforcement Learning I: Temporal Differences

Similar documents
Reinforcement Learning by Comparing Immediate Reward

Lecture 10: Reinforcement Learning

TD(λ) and Q-Learning Based Ludo Players

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Improving Action Selection in MDP s via Knowledge Transfer

Axiom 2013 Team Description Paper

High-level Reinforcement Learning in Strategy Games

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

Entrepreneurial Discovery and the Demmert/Klein Experiment: Additional Evidence from Germany

Regret-based Reward Elicitation for Markov Decision Processes

Learning Prospective Robot Behavior

File # for photo

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Georgetown University at TREC 2017 Dynamic Domain Track

AMULTIAGENT system [1] can be defined as a group of

An OO Framework for building Intelligence and Learning properties in Software Agents

Major Milestones, Team Activities, and Individual Deliverables

San José State University Department of Psychology PSYC , Human Learning, Spring 2017

Speeding Up Reinforcement Learning with Behavior Transfer

Artificial Neural Networks written examination

Teacher Quality and Value-added Measurement

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

IAT 888: Metacreation Machines endowed with creative behavior. Philippe Pasquier Office 565 (floor 14)

Laboratorio di Intelligenza Artificiale e Robotica

(Sub)Gradient Descent

An investigation of imitation learning algorithms for structured prediction

ADDIE: A systematic methodology for instructional design that includes five phases: Analysis, Design, Development, Implementation, and Evaluation.

Genevieve L. Hartman, Ph.D.

How People Learn Physics

Generative models and adversarial training

Evolutive Neural Net Fuzzy Filtering: Basic Description

An Introduction to Simio for Beginners

Computational Data Analysis Techniques In Economics And Finance

DIGITAL GAMING & INTERACTIVE MEDIA BACHELOR S DEGREE. Junior Year. Summer (Bridge Quarter) Fall Winter Spring GAME Credits.

Math 1313 Section 2.1 Example 2: Given the following Linear Program, Determine the vertices of the feasible set. Subject to:

Visual CP Representation of Knowledge

Laboratorio di Intelligenza Artificiale e Robotica

How to Do Research. Jeff Chase Duke University

Introduction to Simulation

International Business Bachelor. Corporate Finance. Summer Term Prof. Dr. Ralf Hafner

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Course Content Concepts

Learning and Transferring Relational Instance-Based Policies

IMPACT INSTITUTE BEHAVIOR MANAGEMENT. Krissy Matthaei Gina Schutt

TEAM-BUILDING GAMES, ACTIVITIES AND IDEAS

Pre-Algebra A. Syllabus. Course Overview. Course Goals. General Skills. Credit Value

Running Head: STUDENT CENTRIC INTEGRATED TECHNOLOGY

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

Accounting 312: Fundamentals of Managerial Accounting Syllabus Spring Brown

Week 01. MS&E 273: Technology Venture Formation

CAFE ESSENTIAL ELEMENTS O S E P P C E A. 1 Framework 2 CAFE Menu. 3 Classroom Design 4 Materials 5 Record Keeping

What to Do When Conflict Happens

On Human Computer Interaction, HCI. Dr. Saif al Zahir Electrical and Computer Engineering Department UBC

Monitoring Metacognitive abilities in children: A comparison of children between the ages of 5 to 7 years and 8 to 11 years

PreReading. Lateral Leadership. provided by MDI Management Development International

Teaching Architecture Metamodel-First

CS 100: Principles of Computing

CS177 Python Programming

Remainder Rules. 3. Ask students: How many carnations can you order and what size bunches do you make to take five carnations home?

Improving Conceptual Understanding of Physics with Technology

Intelligent Agents. Chapter 2. Chapter 2 1

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

University of Victoria School of Exercise Science, Physical and Health Education EPHE 245 MOTOR LEARNING. Calendar Description Units: 1.

IMGD Technical Game Development I: Iterative Development Techniques. by Robert W. Lindeman

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Planning a Webcast. Steps You Need to Master When

A Game-based Assessment of Children s Choices to Seek Feedback and to Revise

Automatic Discretization of Actions and States in Monte-Carlo Tree Search

Lecture 6: Applications

Natural Language Processing. George Konidaris

Seminar - Organic Computing

"Be who you are and say what you feel, because those who mind don't matter and

Proposal of Pattern Recognition as a necessary and sufficient principle to Cognitive Science

Developmental coordination disorder DCD. Overview. Gross & fine motor skill. Elisabeth Hill The importance of motor development

PUBLIC CASE REPORT Use of the GeoGebra software at upper secondary school

Software Maintenance

A Case Study: News Classification Based on Term Frequency

INTRODUCTION TO SOCIOLOGY SOCY 1001, Spring Semester 2013

What is Teaching? JOHN A. LOTT Professor Emeritus in Pathology College of Medicine

A Grammar for Battle Management Language

Certified Six Sigma - Black Belt VS-1104

First and Last Name School District School Name School City, State

Discriminative Learning of Beam-Search Heuristics for Planning

Lahore University of Management Sciences. FINN 321 Econometrics Fall Semester 2017

The Timer-Game: A Variable Interval Contingency for the Management of Out-of-Seat Behavior

Introduction to Communication Essentials

Rule Learning With Negation: Issues Regarding Effectiveness

The Strong Minimalist Thesis and Bounded Optimality

Results In. Planning Questions. Tony Frontier Five Levers to Improve Learning 1

Task Completion Transfer Learning for Reward Inference

Motivation to e-learn within organizational settings: What is it and how could it be measured?

To provide students with a formative and summative assessment about their learning behaviours. To reinforce key learning behaviours and skills that

FF+FPG: Guiding a Policy-Gradient Planner

Essentials of Rapid elearning (REL) Design

EVOLVING POLICIES TO SOLVE THE RUBIK S CUBE: EXPERIMENTS WITH IDEAL AND APPROXIMATE PERFORMANCE FUNCTIONS

Stakeholder Debate: Wind Energy

Ruggiero, V. R. (2015). The art of thinking: A guide to critical and creative thought (11th ed.). New York, NY: Longman.

Transcription:

1 Hal Daumé III (me@hal3.name) Reinforcement Learning I: Temporal Differences Hal Daumé III Computer Science University of Maryland me@hal3.name CS 421: Introduction to Artificial Intelligence 23 Feb 2012 Many slides courtesy of Dan Klein, Stuart Russell, or Andrew Moore

2 Hal Daumé III (me@hal3.name) Announcements None...

3 Hal Daumé III (me@hal3.name) Survey Results Pace: Cvg: HW: P1: P2:

4 Hal Daumé III (me@hal3.name) Reinforcement Learning Reinforcement learning: Still have an MDP: A set of states s S A set of actions (per state) A A model T(s,a,s ) A reward function R(s,a,s ) Still looking for a policy π(s) [DEMO] New twist: don t know T or R I.e. don t know which states are good or what the actions do Must actually try actions and states out to learn

5 Hal Daumé III (me@hal3.name) Example: Animal Learning RL studied experimentally for more than 60 years in psychology Rewards: food, pain, hunger, drugs, etc. Mechanisms and sophistication debated Example: foraging Bees learn near-optimal foraging plan in field of artificial flowers with controlled nectar supplies Bees have a direct neural connection from nectar intake measurement to motor planning area

6 Hal Daumé III (me@hal3.name) Example: Backgammon Reward only for win / loss in terminal states, zero otherwise TD-Gammon learns a function approximation to V(s) using a neural network Combined with depth 3 search, one of the top 3 players in the world You could imagine training Pacman this way but it s tricky!

7 Hal Daumé III (me@hal3.name) Passive Learning Simplified task You don t know the transitions T(s,a,s ) You don t know the rewards R(s,a,s ) You are given a policy π(s) Goal: learn the state values (and maybe the model) In this case: No choice about what actions to take Just execute the policy and learn from experience We ll get to the general case soon

8 Hal Daumé III (me@hal3.name) Example: Direct Estimation Episodes: y +100 (1,1) up -1 (1,2) up -1 (1,2) up -1 (1,3) right -1 (2,3) right -1 (3,3) right -1 (3,2) up -1 (3,3) right -1 (4,3) exit +100 (done) (1,1) up -1 (1,2) up -1 (1,3) right -1 (2,3) right -1 (3,3) right -1 (3,2) up -1 (4,2) exit -100 (done) γ = 1, R = -1 U(1,1) ~ (92 + -106) / 2 = -7 U(3,3) ~ (99 + 97 + -102) / 3 = 31.3-100 x

9 Hal Daumé III (me@hal3.name) Model-Based Learning In general, want to learn the optimal policy, not evaluate a fixed policy Idea: adaptive dynamic programming Learn an initial model of the environment: Solve for the optimal policy for this model (value or policy iteration) Refine model through experience and repeat Crucial: we have to make sure we actually learn about all of the model

10 Hal Daumé III (me@hal3.name) Model-Based Learning Idea: Learn the model empirically (rather than values) Solve the MDP as if the learned model were correct Empirical model learning Simplest case: Count outcomes for each s,a Normalize to give estimate of T(s,a,s ) Discover R(s,a,s ) the first time we experience (s,a,s ) More complex learners are possible (e.g. if we know that all squares have related action outcomes, e.g. stationary noise )

11 Hal Daumé III (me@hal3.name) Example: Model-Based Learning Episodes: y +100 (1,1) up -1 (1,2) up -1 (1,2) up -1 (1,3) right -1 (2,3) right -1 (3,3) right -1 (3,2) up -1 (3,3) right -1 (4,3) exit +100 (done) (1,1) up -1 (1,2) up -1 (1,3) right -1 (2,3) right -1 (3,3) right -1 (3,2) up -1 (4,2) exit -100 (done) γ = 1 T(<3,3>, right, <4,3>) = 1 / 3 T(<2,3>, right, <3,3>) = 2 / 2-100 x

12 Hal Daumé III (me@hal3.name) Example: Greedy ADP Imagine we find the lower path to the good exit first Some states will never be visited following this policy from (1,1) We ll keep re-using this policy because following it never collects the regions of the model we need to learn the optimal policy??

13 Hal Daumé III (me@hal3.name) What Went Wrong? Problem with following optimal policy for current model: Never learn about better regions of the space if current policy neglects them?? Fundamental tradeoff: exploration vs. exploitation Exploration: must take actions with suboptimal estimates to discover new rewards and increase eventual utility Exploitation: once the true optimal policy is learned, exploration reduces utility Systems must explore in the beginning and exploit in the limit

14 Hal Daumé III (me@hal3.name) Model-Free Learning Big idea: why bother learning T? Update V each time we experience a transition s Frequent outcomes will contribute more updates a (over time) s, a Temporal difference learning (TD) Policy still fixed! s,a,s Move values toward value of whatever successor occurs s

15 Hal Daumé III (me@hal3.name) Example: Passive TD (1,1) up -1 (1,1) up -1 (1,2) up -1 (1,2) up -1 (1,2) up -1 (1,3) right -1 (1,3) right -1 (2,3) right -1 (2,3) right -1 (3,3) right -1 (3,3) right -1 (3,2) up -1 (3,2) up -1 (4,2) exit -100 (3,3) right -1 (done) (4,3) exit +100 (done) Take γ = 1, α = 0.5

16 Hal Daumé III (me@hal3.name) Problems with TD Value Learning TD value leaning is model-free for policy evaluation However, if we want to turn our value estimates into a policy, we re sunk: a s, a s s,a,s s Idea: learn Q-values directly Makes action selection model-free too!