Reinforcement Learning. CS 188: Artificial Intelligence Fall Example: Backgammon. Example: Animal Learning. Example: Direct Estimation

Similar documents
Lecture 10: Reinforcement Learning

Reinforcement Learning by Comparing Immediate Reward

Exploration. CS : Deep Reinforcement Learning Sergey Levine

A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks

Improving Action Selection in MDP s via Knowledge Transfer

Artificial Neural Networks written examination

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Python Machine Learning

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Axiom 2013 Team Description Paper

INFORMATION SEEKING BEHAVIOR OF USERS OF ICT ORIENTED COLLEGES: A CASE STUDY

TD(λ) and Q-Learning Based Ludo Players

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

PDF hosted at the Radboud Repository of the Radboud University Nijmegen

Speeding Up Reinforcement Learning with Behavior Transfer

Generative models and adversarial training

Major Milestones, Team Activities, and Individual Deliverables

Lecture 1: Machine Learning Basics

AMULTIAGENT system [1] can be defined as a group of

Learning Prospective Robot Behavior

High-level Reinforcement Learning in Strategy Games

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

Laboratorio di Intelligenza Artificiale e Robotica

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Math 1313 Section 2.1 Example 2: Given the following Linear Program, Determine the vertices of the feasible set. Subject to:

Georgetown University at TREC 2017 Dynamic Domain Track

An OO Framework for building Intelligence and Learning properties in Software Agents

An Introduction to Simio for Beginners

(Sub)Gradient Descent

A Syntactic Description of German in a Formalism Designed for Machine Translation

FF+FPG: Guiding a Policy-Gradient Planner

Regret-based Reward Elicitation for Markov Decision Processes

Softprop: Softmax Neural Network Backpropagation Learning

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

Navigating the PhD Options in CMS

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Probabilistic Latent Semantic Analysis

Discriminative Learning of Beam-Search Heuristics for Planning

Hentai High School A Game Guide

Rule Learning With Negation: Issues Regarding Effectiveness

Simple Random Sample (SRS) & Voluntary Response Sample: Examples: A Voluntary Response Sample: Examples: Systematic Sample Best Used When

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

UDL AND LANGUAGE ARTS LESSON OVERVIEW

arxiv: v2 [cs.ro] 3 Mar 2017

Lecture 6: Applications

Multi-genre Writing Assignment

Visual CP Representation of Knowledge

Probability estimates in a scenario tree

University of Victoria School of Exercise Science, Physical and Health Education EPHE 245 MOTOR LEARNING. Calendar Description Units: 1.

Entrepreneurial Discovery and the Demmert/Klein Experiment: Additional Evidence from Germany

Laboratorio di Intelligenza Artificiale e Robotica

DIGITAL GAMING & INTERACTIVE MEDIA BACHELOR S DEGREE. Junior Year. Summer (Bridge Quarter) Fall Winter Spring GAME Credits.

Using focal point learning to improve human machine tacit coordination

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Task Completion Transfer Learning for Reward Inference

Learning From the Past with Experiment Databases

How People Learn Physics

Probability and Game Theory Course Syllabus

The Evolution of Random Phenomena

Go fishing! Responsibility judgments when cooperation breaks down

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Essentials of Rapid elearning (REL) Design

Introduction to Simulation

Learning Methods for Fuzzy Systems

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Lahore University of Management Sciences. FINN 321 Econometrics Fall Semester 2017

SARDNET: A Self-Organizing Feature Map for Sequences

Outline for Session III

K5 Math Practice. Free Pilot Proposal Jan -Jun Boost Confidence Increase Scores Get Ahead. Studypad, Inc.

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

Measuring physical factors in the environment

Honors Mathematics. Introduction and Definition of Honors Mathematics

A Case Study: News Classification Based on Term Frequency

CSL465/603 - Machine Learning

Learning and Transferring Relational Instance-Based Policies

Cooperative Game Theoretic Models for Decision-Making in Contexts of Library Cooperation 1

IAT 888: Metacreation Machines endowed with creative behavior. Philippe Pasquier Office 565 (floor 14)

Section 7, Unit 4: Sample Student Book Activities for Teaching Listening

Knowledge Transfer in Deep Convolutional Neural Nets

South Carolina College- and Career-Ready Standards for Mathematics. Standards Unpacking Documents Grade 5

The Strong Minimalist Thesis and Bounded Optimality

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Foothill College Summer 2016

Software Maintenance

Evolutive Neural Net Fuzzy Filtering: Basic Description

Cooking Matters at the Store Evaluation: Executive Summary

12- A whirlwind tour of statistics

SIMPLY THE BEST! AND MINDSETS. (Growth or fixed?)

Course Content Concepts

Contents. Foreword... 5

Robot Learning Simultaneously a Task and How to Interpret Human Instructions

Rule Learning with Negation: Issues Regarding Effectiveness

Reflective problem solving skills are essential for learning, but it is not my job to teach them

Task Completion Transfer Learning for Reward Inference

How long did... Who did... Where was... When did... How did... Which did...

Examining the Earnings Trajectories of Community College Students Using a Piecewise Growth Curve Modeling Approach

Transcription:

CS 188: Artificial Intelligence Fall 8 Lecture 11: Reinforcement Learning 1/2/8 Reinforcement Learning Reinforcement learning: Still have an MDP: A et of tate S A et of action (per tate) A A model T(,a, ) A reward function R(,a, ) Still looking for a policy π() [DEMO] Dan Klein UC Berkeley Many lide over the coure adapted from either Stuart Ruell or Andrew Moore New twit: don t know T or R I.e. don t know which tate are good or what the action do Mut actually try action and tate out to learn 1 3 Example: Animal Learning Example: Backgammon RL tudied experimentally for more than 6 year in pychology Reward: food, pain, hunger, drug, etc. Mechanim and ophitication debated Example: foraging Bee learn near-optimal foraging plan in field of artificial flower with controlled nectar upplie Bee have a direct neural connection from nectar intake meaurement to motor planning area Reward only for win / lo in terminal tate, zero otherwie TD-Gammon learn a function approximation to V() uing a neural network Combined with depth 3 earch, one of the top 3 player in the world You could imagine training Pacman thi way but it tricky! 4 5 Paive Learning [DEMO Optimal Policy] Example: Direct Etimation Simplified tak You don t know the tranition T(,a, ) You don t know the reward R(,a, ) You are given a policy π() Goal: learn the tate value (and maybe the model) I.e., policy evaluation In thi cae: Learner along for the ride No choice about what action to take Jut execute the policy and learn from experience We ll get to the active cae oon Thi i NOT offline planning! Epiode: (4,3) exit +1 (4,2) exit -1 y +1-1 γ = 1, R = -1 V(1,1) ~ (92 + -16) / 2 = -7 V(3,3) ~ (99 + 97 + -12) / 3 = 31.3 x 6 7 1

Model-Baed Learning Idea: Learn the model empirically (rather than value) Solve the MDP a if the learned model were correct Empirical model learning Simplet cae: Count outcome for each,a Normalize to give etimate of T(,a, ) Dicover R(,a, ) the firt time we experience (,a, ) More complex learner are poible (e.g. if we know that all quare have related action outcome, e.g. tationary noie ) Example: Model-Baed Learning Epiode: (4,2) exit -1 (4,3) exit +1 y +1-1 γ = 1 T(<3,3>, right, <4,3>) = 1 / 3 T(<2,3>, right, <3,3>) = 2 / 2 x 8 9 Recap: Model-Baed Policy Evaluation Sample Avg to Replace Expectation? Simplified Bellman update to calculate V for a fixed policy: New V i expected one-tep-lookahead uing current V Unfortunately, need T and R π(), π(),π(), Who need T and R? Approximate the expectation with ample (drawn from T!) π(), π() 1 2 3 1 11 Model-Free Learning Example: TD Policy Evaluation Big idea: why bother learning T? Update V each time we experience a tranition Frequent outcome will contribute more update (over time) Temporal difference learning (TD) Policy till fixed! Move value toward value of whatever ucceor occur: running average! π(), π() (4,2) exit -1 (4,3) exit +1 Take γ = 1, α =.5 12 13 2

Problem with TD Value Learning Active Learning TD value leaning i model-free for policy evaluation However, if we want to turn our value etimate into a policy, we re unk: Idea: learn Q-value directly Make action election model-free too! a, a,a, Full reinforcement learning You don t know the tranition T(,a, ) You don t know the reward R(,a, ) You can chooe any action you like Goal: learn the optimal policy (maybe value) In thi cae: Learner make choice! Fundamental tradeoff: exploration v. exploitation Thi i NOT offline planning! 14 15 Model-Baed Learning Example: Greedy ADP In general, want to learn the optimal policy, not evaluate a fixed policy Idea: adaptive dynamic programming Learn an initial model of the environment: Solve for the optimal policy for thi model (value or policy iteration) Refine model through experience and repeat Crucial: we have to make ure we actually learn about all of the model Imagine we find the lower path to the good exit firt Some tate will never be viited following thi policy from (1,1) We ll keep re-uing thi policy becaue following it never collect the region of the model we need to learn the optimal policy?? 16 17 What Went Wrong? Q-Value Iteration Problem with following optimal policy for current model: Never learn about better region of the pace if current policy neglect them Fundamental tradeoff: exploration v. exploitation Exploration: mut take action with uboptimal etimate to dicover new reward and increae eventual utility Exploitation: once the true optimal policy i learned, exploration reduce utility Sytem mut explore in the beginning and exploit in the limit?? Value iteration: find ucceive approx optimal value Start with V * () =, which we know i right (why?) Given V i*, calculate the value for all tate for depth i+1: But Q-value are more ueful! Start with Q * (,a) =, which we know i right (why?) Given Q i*, calculate the q-value for all q-tate for depth i+1: 18 19 3

Q-Learning Learn Q*(,a) value Receive a ample (,a,,r) Conider your old etimate: Conider your new ample etimate: [DEMO Grid Q ] Q-Learning Propertie Will converge to optimal policy If you explore enough If you make the learning rate mall enough But not decreae it too quickly! Baically doen t matter how you elect action (!) [DEMO Grid Q ] Neat property: learn optimal q-value regardle of action election noie (ome caveat) Incorporate the new etimate into a running average: S E S E 21 Exploration / Exploitation [DEMO RL Pacman] Several cheme for forcing exploration Simplet: random action (ε greedy) Every time tep, flip a coin With probability ε, act randomly With probability 1-ε, act according to current policy Problem with random action? You do explore the pace, but keep thrahing around once learning i done One olution: lower ε over time Another olution: exploration function Exploration Function When to explore Random action: explore a fixed amount Better idea: explore area whoe badne i not (yet) etablihed Exploration function Take a value etimate and a count, and return an optimitic utility, e.g. (exact form not important) 22 23 Q-Learning [DEMO Crawler Q ] Q-learning produce table of q-value: Q-Learning In realitic ituation, we cannot poibly learn about every ingle tate! Too many tate to viit them all in training Too many tate to hold the q-table in memory Intead, we want to generalize: Learn about ome mall number of training tate from experience Generalize that experience to new, imilar tate Thi i a fundamental idea in machine learning, and we ll ee it over and over again 24 25 4

Example: Pacman Let ay we dicover through experience that thi tate i bad: In naïve q learning, we know nothing about thi tate or it q tate: Or even thi one! Feature-Baed Repreentation Solution: decribe a tate uing a vector of feature Feature are function from tate to real number (often /1) that capture important propertie of the tate Example feature: Ditance to cloet ghot Ditance to cloet dot Number of ghot 1 / (dit to dot) 2 I Pacman in a tunnel? (/1) etc. Can alo decribe a q-tate (, a) with feature (e.g. action move cloer to food) 26 27 Linear Feature Function Uing a feature repreentation, we can write a q function (or value function) for any tate uing a few weight: Function Approximation Q-learning with linear q-function: Advantage: our experience i ummed up in a few powerful number Diadvantage: tate may hare feature but be very different in value! Intuitive interpretation: Adjut weight of active feature E.g. if omething unexpectedly bad happen, diprefer all tate with that tate feature Formal jutification: online leat quare 28 29 Example: Q-Pacman Linear regreion 4 26 24 22 1 3 1 1 3 4 Given example Predict given a new point 3 31 5

Linear regreion Ordinary Leat Square (OLS) 4 26 24 22 3 1 1 3 4 Obervation Prediction Error or reidual Prediction Prediction 32 33 Minimizing Error 3 25 Overfitting 15 Degree 15 polynomial 1 5-5 Value update explained: -1-15 2 4 6 8 1 12 14 16 18 34 [DEMO] 35 Policy Search Policy Search Problem: often the feature-baed policie that work well aren t the one that approximate V / Q bet E.g. your value function from project 2 were probably horrible etimate of future reward, but they till produced good deciion We ll ee thi ditinction between modeling and prediction again later in the coure Solution: learn the policy that maximize reward rather than the value that predict reward Thi i the idea behind policy earch, uch a what controlled the upide-down helicopter 36 37 6

Policy Search Simplet policy earch: Start with an initial linear value function or q-function Nudge each feature weight up and down and ee if your policy i better than before Problem: How do we tell the policy got better? Need to run many ample epiode! If there are a lot of feature, thi can be impractical Policy Search* Advanced policy earch: Write a tochatic (oft) policy: Turn out you can efficiently approximate the derivative of the return with repect to the parameter w (detail in the book, but you don t have to know them) Take uphill tep, recalculate derivative, etc. 38 39 Take a Deep Breath We re done with earch and planning! Next, we ll look at how to reaon with probabilitie Diagnoi Tracking object Speech recognition Robot mapping lot more! Lat part of coure: machine learning 4 7