Reinforcement Learning. CS 188: Artificial Intelligence Reinforcement Learning. Reinforcement Learning. Example: Learning to Walk. The Crawler!

Similar documents
Lecture 10: Reinforcement Learning

Exploration. CS : Deep Reinforcement Learning Sergey Levine

PDF hosted at the Radboud Repository of the Radboud University Nijmegen

Reinforcement Learning by Comparing Immediate Reward

Artificial Neural Networks written examination

A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks

INFORMATION SEEKING BEHAVIOR OF USERS OF ICT ORIENTED COLLEGES: A CASE STUDY

Improving Action Selection in MDP s via Knowledge Transfer

Python Machine Learning

AMULTIAGENT system [1] can be defined as a group of

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Speeding Up Reinforcement Learning with Behavior Transfer

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Regret-based Reward Elicitation for Markov Decision Processes

Axiom 2013 Team Description Paper

The Evolution of Random Phenomena

High-level Reinforcement Learning in Strategy Games

12- A whirlwind tour of statistics

Major Milestones, Team Activities, and Individual Deliverables

Lecture 1: Machine Learning Basics

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

ALL-IN-ONE MEETING GUIDE THE ECONOMICS OF WELL-BEING

TD(λ) and Q-Learning Based Ludo Players

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

FF+FPG: Guiding a Policy-Gradient Planner

Georgetown University at TREC 2017 Dynamic Domain Track

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

A Syntactic Description of German in a Formalism Designed for Machine Translation

Hentai High School A Game Guide

Go fishing! Responsibility judgments when cooperation breaks down

Characteristics of Functions

(Sub)Gradient Descent

IN THIS UNIT YOU LEARN HOW TO: SPEAKING 1 Work in pairs. Discuss the questions. 2 Work with a new partner. Discuss the questions.

Assignment 1: Predicting Amazon Review Ratings

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

An OO Framework for building Intelligence and Learning properties in Software Agents

Generative models and adversarial training

Laboratorio di Intelligenza Artificiale e Robotica

Task Completion Transfer Learning for Reward Inference

Rule Learning With Negation: Issues Regarding Effectiveness

Softprop: Softmax Neural Network Backpropagation Learning

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Software Maintenance

Probabilistic Latent Semantic Analysis

Learning Prospective Robot Behavior

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

Managerial Decision Making

Knowledge Transfer in Deep Convolutional Neural Nets

UDL AND LANGUAGE ARTS LESSON OVERVIEW

Simple Random Sample (SRS) & Voluntary Response Sample: Examples: A Voluntary Response Sample: Examples: Systematic Sample Best Used When

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Task Completion Transfer Learning for Reward Inference

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Quantitative analysis with statistics (and ponies) (Some slides, pony-based examples from Blase Ur)

Sight Word Assessment

An Introduction to Simio for Beginners

K5 Math Practice. Free Pilot Proposal Jan -Jun Boost Confidence Increase Scores Get Ahead. Studypad, Inc.

Learning Methods for Fuzzy Systems

Discriminative Learning of Beam-Search Heuristics for Planning

Cognitive Thinking Style Sample Report

CSL465/603 - Machine Learning

Foothill College Summer 2016

arxiv: v1 [cs.lg] 15 Jun 2015

P-4: Differentiate your plans to fit your students

Machine Learning and Development Policy

Using focal point learning to improve human machine tacit coordination

How long did... Who did... Where was... When did... How did... Which did...

Evolutive Neural Net Fuzzy Filtering: Basic Description

Learning to Schedule Straight-Line Code

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

On the Combined Behavior of Autonomous Resource Management Agents

How People Learn Physics

Learning From the Past with Experiment Databases

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Navigating the PhD Options in CMS

Renaissance Learning 32 Harbour Exchange Square London, E14 9GE +44 (0)

Multi-genre Writing Assignment

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Probability and Statistics Curriculum Pacing Guide

a) analyse sentences, so you know what s going on and how to use that information to help you find the answer.

A Case Study: News Classification Based on Term Frequency

Truth Inference in Crowdsourcing: Is the Problem Solved?

Purdue Data Summit Communication of Big Data Analytics. New SAT Predictive Validity Case Study

A Neural Network GUI Tested on Text-To-Phoneme Mapping

You re not a princess... But you can still rule the world.

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

CS 446: Machine Learning

DIGITAL GAMING & INTERACTIVE MEDIA BACHELOR S DEGREE. Junior Year. Summer (Bridge Quarter) Fall Winter Spring GAME Credits.

PREPARATION GUIDE FOR LEVEL 1 REGULATORY EXAMINATIONS Applicants and/or Key Individuals in Category III (RE4)

arxiv: v2 [cs.ro] 3 Mar 2017

CS 100: Principles of Computing

ICTCM 28th International Conference on Technology in Collegiate Mathematics

Outline for Session III

Running Head: STUDENT CENTRIC INTEGRATED TECHNOLOGY

Planning for Preassessment. Kathy Paul Johnston CSD Johnston, Iowa

Laboratorio di Intelligenza Artificiale e Robotica

This map-tastic middle-grade story from Andrew Clements gives the phrase uncharted territory a whole new meaning!

Transcription:

CS 188: rtificial Intelligence Dan Klein, Pieter bbeel Univerity of California, Berkeley xample: Learning to Walk gent State: Reward: r ction: a nvironment Baic idea: Receive feedback in the form of reward gent utility i defined by the reward function Mut (learn to) act o a to maximize expected reward ll learning i baed on oberved ample of outcome! Before Learning Learning Trial fter Learning [1K Trial] [Kohl and Stone, ICR 24] The Crawler! Still aume a Markov deciion proce (MDP): et of tate S et of action (per tate) model T(,a, ) reward function R(,a, ) Still looking for a policy π() New twit: don t know T or R I.e. we don t know which tate are good or what the action do Mut actually try action and tate out to learn [You, in Project 3] 1

Offline (MDP) v. Online (RL) Paive Offline Solution Online Learning Paive Simplified tak: policy evaluation Input: a fixed policy π() You don t know the tranition T(,a, ) You don t know the reward R(,a, ) Goal: learn the tate value In thi cae: Learner i along for the ride No choice about what action to take Jut execute the policy and learn from experience Thi i NOT offline planning! You actually take action in the world. Direct valuation Goal: Compute value for each tate under π Idea: verage together oberved ample value ct according to π very time you viit a tate, write down what the um of dicounted reward turned out to be verage thoe ample Thi i called direct evaluation xample: Direct valuation Problem with Direct valuation Input Policy π ume: γ= 1 Oberved piode (Training) piode 1 piode 2 B, eat, C, -1 D, exit, x, +1 B, eat, C, -1 D, exit, x, +1 piode 3 piode 4, north, C, -1 D, exit, x, +1, north, C, -1 C, eat,, -1, exit, x, -1 Output Value -1 +8 +4 +1-2 What good about direct evaluation? It eay to undertand It doen t require any knowledge of T, R It eventually compute the correct average value, uing jut ample tranition What bad about it? It wate information about tate connection ach tate mut be learned eparately So, it take a long time to learn Output Value -1 +8 +4 +1-2 If B and both go to C under thi policy, how can their value be different? 2

Why Not Ue Policy valuation? Simplified Bellman update calculate V for a fixed policy: ach round, replace V with a one-tep-look-ahead layer over V π(), π() xample: xpected ge Goal: Compute expected age of c188 tudent Known P() Thi approach fully exploited the connection between the tate Unfortunately, we need T and R to do it!, π(), Key quetion: how can we do thi update to V without knowing T and R? In other word, how to we take a weighted average without knowing the weight? Why doe thi work? Becaue eventually you learn the right model. Without P(), intead collect ample [a 1, a 2, a N ] Unknown P(): Model Baed Unknown P(): Model Free Why doe thi work? Becaue ample appear with the right frequencie. Model-Baed Learning Model-Baed Learning Model-Baed Idea: Learn an approximate model baed on experience Solve for value a if the learned model were correct Step 1: Learn empirical MDP model Count outcome for each, a Normalize to give an etimate of Dicover each when we experience (, a, ) Step 2: Solve the learned MDP For example, ue policy evaluation xample: Model-Baed Learning Model-Free Learning Input Policy π ume: γ= 1 Oberved piode (Training) piode 1 piode 2 B, eat, C, -1 D, exit, x, +1 B, eat, C, -1 D, exit, x, +1 piode 3 piode 4, north, C, -1 D, exit, x, +1, north, C, -1 C, eat,, -1, exit, x, -1 Learned Model T(,a, ). T(B, eat, C) = 1. T(C, eat, D) =.75 T(C, eat, ) =.25 R(,a, ). R(B, eat, C) = -1 R(C, eat, D) = -1 R(D, exit, x) = +1 3

Sample-Baed Policy valuation? Temporal Difference Learning We want to improve our etimate of V by computing thee average: Idea: Take ample of outcome (by doing the action!) and average π() Big idea: learn from every experience! Update V() each time we experience a tranition (, a,, r) Likely outcome will contribute update more often Temporal difference learning of value Policy till fixed, till doing evaluation! Move value toward value of whatever ucceor occur: running average π(), π(), π(), π(), 2' ' 1' 3' Sample of V(): Update to V(): lmot! But we can t rewind time to get ample after ample from tate. Same update: xponential Moving verage xample: Temporal Difference Learning xponential moving average The running interpolation update: State Oberved Tranition B, eat, C, -2 C, eat, D, -2 Make recent ample more important: 8-1 8-1 3 8 Forget about the pat (ditant pat value were wrong anyway) Decreaing learning rate (alpha) can give converging average ume: γ= 1, α = 1/2 Problem with TD Value Learning ctive TD value leaning i a model-free way to do policy evaluation, mimicking Bellman update with running ample average However, if we want to turn value into a (new) policy, we re unk: a, a Idea: learn Q-value, not value Make action election model-free too!,a, 4

ctive Full reinforcement learning: optimal policie (like value iteration) You don t know the tranition T(,a, ) You don t know the reward R(,a, ) You chooe the action now Goal: learn the optimal policy / value In thi cae: Learner make choice! Fundamental tradeoff: exploration v. exploitation Thi i NOT offline planning! You actually take action in the world and find out what happen Detour: Q-Value Iteration Value iteration: find ucceive (depth-limited) value Start with V () =, which we know i right Given V k, calculate the depth k+1 value for all tate: But Q-value are more ueful, o compute them intead Start with Q (,a) =, which we know i right Given Q k, calculate the depth k+1 q-value for all q-tate: Q-Learning Q-Learning: ample-baed Q-value iteration Learn Q(,a) value a you go Receive a ample (,a,,r) Conider your old etimate: Conider your new ample etimate: Incorporate the new etimate into a running average: Q-Learning Propertie mazing reult: Q-learning converge to optimal policy --even if you re acting uboptimally! Thi i called off-policy learning Caveat: You have to explore enough You have to eventually make the learning rate mall enough but not decreae it too quickly Baically, in the limit, it doen t matter how you elect action (!) [demo grid, crawler Q ] CS 188: rtificial Intelligence II We till aume an MDP: et of tate S et of action (per tate) model T(,a, ) reward function R(,a, ) Still looking for a policy π() Dan Klein, Pieter bbeel Univerity of California, Berkeley New twit: don t know T or R I.e. don t know which tate are good or what the action do Mut actually try action and tate out to learn 5

The Story So Far: MDP and RL Known MDP: Offline Solution Goal Technique Compute V*, Q*, π* Value / policy iteration valuate a fixed policy π Policy evaluation Unknown MDP: Model-Baed Unknown MDP: Model-Free Goal Technique Goal Technique Compute V*, Q*, π* VI/PI on approx. MDP Compute V*, Q*, π* Q-learning valuate a fixed policy π P on approx. MDP valuate a fixed policy π Value Learning Model-free (temporal difference) learning xperience world through epiode Update etimate each tranition Over time, update will mimic Bellman update Model-Free Learning Q-Value Iteration (model-baed, require known MDP) Q-Learning (model-free, require only experienced tranition) a, a r a, a Q-Learning We d like to do Q-value update to each Q-tate: But can t compute thi update without knowing T, R Intead, compute average a we go Receive a ample tranition (,a,r, ) Thi ample ugget But we want to average over reult from (,a) (Why?) So keep a running average Q-Learning Propertie mazing reult: Q-learning converge to optimal policy --even if you re acting uboptimally! Thi i called off-policy learning Caveat: You have to explore enough You have to eventually make the learning rate mall enough but not decreae it too quickly Baically, in the limit, it doen t matter how you elect action (!) [demo off policy] xploration v. xploitation How to xplore? Several cheme for forcing exploration Simplet: random action (ε-greedy) very time tep, flip a coin With (mall) probability ε, act randomly With (large) probability 1-ε, act on current policy Problem with random action? You do eventually explore the pace, but keep thrahing around once learning i done One olution: lower εover time nother olution: exploration function [demo crawler] 6

xploration Function Regret When to explore? Random action: explore a fixed amount Better idea: explore area whoe badne i not (yet) etablihed, eventually top exploring xploration function Take a value etimate u and a viit count n, and return an optimitic utility, e.g. Regular Q-Update: Modified Q-Update: Note: thi propagate the bonu back to tate that lead to unknown tate a well! [demo crawler] ven if you learn the optimal policy, you till make mitake along the way! Regret i a meaure of your total mitake cot: the difference between your (expected) reward, including youthful uboptimality, and optimal (expected) reward Minimizing regret goe beyond learning to be optimal it require optimally learning to be optimal xample: random exploration and exploration function both end up optimal, but random exploration ha higher regret pproximate Q-Learning Generalizing cro State Baic Q-Learning keep a table of all q-value In realitic ituation, we cannot poibly learn about every ingle tate! Too many tate to viit them all in training Too many tate to hold the q-table in memory Intead, we want to generalize: Learn about ome mall number of training tate from experience Generalize that experience to new, imilar ituation Thi i a fundamental idea in machine learning, and we ll ee it over and over again xample: Pacman Feature-Baed Repreentation Let ay we dicover through experience that thi tate i bad: In naïve q-learning, we know nothing about thi tate: Or even thi one! Solution: decribe a tate uing a vector of feature (propertie) Feature are function from tate to real number (often /1) that capture important propertie of the tate xample feature: Ditance to cloet ghot Ditance to cloet dot Number of ghot 1 / (dit to dot) 2 I Pacmanin a tunnel? (/1) etc. I it the exact tate on thi lide? Can alo decribe a q-tate (, a) with feature (e.g. action move cloer to food) [demo RL pacman] 7

Linear Value Function pproximate Q-Learning Uing a feature repreentation, we can write a q function (or value function) for any tate uing a few weight: Q-learning with linear Q-function: dvantage: our experience i ummed up in a few powerful number Diadvantage: tate may hare feature but actually be very different in value! Intuitive interpretation: djut weight of active feature.g., if omething unexpectedly bad happen, blame the feature that were on: dipreferall tate with that tate feature Formal jutification: online leat quare xact Q pproximate Q xample: Q-Pacman Q-Learning and Leat Square [demo RL pacman] Linear pproximation: Regreion* Optimization: Leat Square* 4 26 24 2 22 2 Obervation rror or reidual 2 3 2 1 1 2 3 4 Prediction Prediction: Prediction: 2 8

Minimizing rror* Imagine we had only one point x, with feature f(x), target value y, and weight w: Overfitting: Why Limiting Capacity Can Help* 3 25 2 Degree 15 polynomial 15 1 5 pproximate q update explained: -5 target prediction -1-15 2 4 6 8 1 12 14 16 18 2 Policy Search Policy Search Problem: often the feature-baed policie that work well (win game, maximize utilitie) aren t the one that approximate V / Q bet.g. your value function from project 2 were probably horrible etimate of future reward, but they till produced good deciion Q-learning priority: get Q-value cloe (modeling) ction election priority: get ordering of Q-value right (prediction) We ll ee thi ditinction between modeling and prediction again later in the coure Solution: learn policie that maximize reward, not the value that predict them Policy earch: tart with an ok olution (e.g. Q-learning) then fine-tune by hill climbing on feature weight Policy Search Simplet policy earch: Start with an initial linear value function or Q-function Nudge each feature weight up and down and ee if your policy i better than before Problem: How do we tell the policy got better? Need to run many ample epiode! If there are a lot of feature, thi can be impractical Better method exploit lookaheadtructure, ample wiely, change multiple parameter Concluion We re done with Part I: Search and Planning! We ve een how I method can olve problem in: Search Contraint Satifaction Problem Game Markov Deciion Problem Next up: Part II: Uncertainty and Learning! 9