CSE 573: Artificial Intelligence Reinforcement Learning

Similar documents
Lecture 10: Reinforcement Learning

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Reinforcement Learning by Comparing Immediate Reward

Artificial Neural Networks written examination

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Laboratorio di Intelligenza Artificiale e Robotica

Improving Action Selection in MDP s via Knowledge Transfer

High-level Reinforcement Learning in Strategy Games

Axiom 2013 Team Description Paper

TD(λ) and Q-Learning Based Ludo Players

Regret-based Reward Elicitation for Markov Decision Processes

Laboratorio di Intelligenza Artificiale e Robotica

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

An Introduction to Simio for Beginners

Speeding Up Reinforcement Learning with Behavior Transfer

AMULTIAGENT system [1] can be defined as a group of

Georgetown University at TREC 2017 Dynamic Domain Track

Lecture 6: Applications

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

FF+FPG: Guiding a Policy-Gradient Planner

An OO Framework for building Intelligence and Learning properties in Software Agents

DIGITAL GAMING & INTERACTIVE MEDIA BACHELOR S DEGREE. Junior Year. Summer (Bridge Quarter) Fall Winter Spring GAME Credits.

Lecture 1: Machine Learning Basics

Seminar - Organic Computing

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

SARDNET: A Self-Organizing Feature Map for Sequences

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

Go fishing! Responsibility judgments when cooperation breaks down

Hentai High School A Game Guide

Major Milestones, Team Activities, and Individual Deliverables

Rover Races Grades: 3-5 Prep Time: ~45 Minutes Lesson Time: ~105 minutes

Session 2B From understanding perspectives to informing public policy the potential and challenges for Q findings to inform survey design

Learning Prospective Robot Behavior

College Pricing. Ben Johnson. April 30, Abstract. Colleges in the United States price discriminate based on student characteristics

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Generative models and adversarial training

ACTL5103 Stochastic Modelling For Actuaries. Course Outline Semester 2, 2014

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Task Completion Transfer Learning for Reward Inference

(Sub)Gradient Descent

Shockwheat. Statistics 1, Activity 1

IAT 888: Metacreation Machines endowed with creative behavior. Philippe Pasquier Office 565 (floor 14)

Task Completion Transfer Learning for Reward Inference

Python Machine Learning

Lecture 2: Quantifiers and Approximation

Learning Methods for Fuzzy Systems

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

The Evolution of Random Phenomena

Probability and Game Theory Course Syllabus

Software Maintenance

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

LEGO MINDSTORMS Education EV3 Coding Activities

Automatic Discretization of Actions and States in Monte-Carlo Tree Search

ECE-492 SENIOR ADVANCED DESIGN PROJECT

Visual CP Representation of Knowledge

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Leader s Guide: Dream Big and Plan for Success

Using focal point learning to improve human machine tacit coordination

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

How long did... Who did... Where was... When did... How did... Which did...

TotalLMS. Getting Started with SumTotal: Learner Mode

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Discriminative Learning of Beam-Search Heuristics for Planning

AI Agent for Ice Hockey Atari 2600

Evolutive Neural Net Fuzzy Filtering: Basic Description

A Case Study: News Classification Based on Term Frequency

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

CS 100: Principles of Computing

Grade 6: Correlated to AGS Basic Math Skills

Truth Inference in Crowdsourcing: Is the Problem Solved?

A Comparison of Annealing Techniques for Academic Course Scheduling

An investigation of imitation learning algorithms for structured prediction

Outline for Session III

College Pricing and Income Inequality

Probabilistic Latent Semantic Analysis

Understanding and Changing Habits

Designing a Computer to Play Nim: A Mini-Capstone Project in Digital Design I

College Pricing and Income Inequality

Introduction to Simulation

Introduction to the Practice of Statistics

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

Improving Fairness in Memory Scheduling

Functional Skills Mathematics Level 2 assessment

Cooperative Game Theoretic Models for Decision-Making in Contexts of Library Cooperation 1

What to Do When Conflict Happens

Knowledge Transfer in Deep Convolutional Neural Nets

12- A whirlwind tour of statistics

CS Machine Learning

Remainder Rules. 3. Ask students: How many carnations can you order and what size bunches do you make to take five carnations home?

Intelligent Agents. Chapter 2. Chapter 2 1

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

File # for photo

General Physics I Class Syllabus

Math 1313 Section 2.1 Example 2: Given the following Linear Program, Determine the vertices of the feasible set. Subject to:

Genevieve L. Hartman, Ph.D.

Ohio s Learning Standards-Clear Learning Targets

Executive Guide to Simulation for Health

Cross Language Information Retrieval

INTRODUCTION TO SOCIOLOGY SOCY 1001, Spring Semester 2013

On the Combined Behavior of Autonomous Resource Management Agents

Transcription:

CSE 573: Artificial Intelligence Reinforcement Learning Dan Weld/ University of Washington [Many slides taken from Dan Klein and Pieter Abbeel / CS188 Intro to AI at UC Berkeley materials available at http://ai.berkeley.edu.]

Logistics PS 3 due today PS 4 due in one week (Thurs 2/16) Research paper comments due on Tues Paper itself will be on Web calendar after class 2

Reinforcement Learning

Reinforcement Learning Agent State: s Reward: r Actions: a Environment Basic idea: Receive feedback in the form of rewards Agent s utility is defined by the reward function Must (learn to) act so as to maximize expected rewards All learning is based on observed samples of outcomes!

Example: Animal Learning RL studied experimentally for more than 6 years in psychology Rewards: food, pain, hunger, drugs, etc. Mechanisms and sophistication debated Example: foraging Bees learn near-optimal foraging plan in field of artificial flowers with controlled nectar supplies Bees have a direct neural connection from nectar intake measurement to motor planning area

Example: Backgammon Reward only for win / loss in terminal states, zero otherwise TD-Gammon learns a function approximation to V(s) using a neural network Combined with depth 3 search, one of the top 3 players in the world You could imagine training Pacman this way but it s tricky! (It s also PS 4)

Example: Learning to Walk [Kohl and Stone, ICRA 24] Initial [Video: AIBO WALK initial]

Example: Learning to Walk [Kohl and Stone, ICRA 24] Finished [Video: AIBO WALK finished]

Example: Sidewinding [Andrew Ng] [Video: SNAKE climbstep+sidewinding]

Few driving tasks are as intimidating as parallel parking. https://www.youtube.com/watch?v=pb_ify2jidi 12

Parallel Parking Few driving tasks are as intimidating as parallel parking. https://www.youtube.com/watch?v=pb_ify2jidi 13

Other Applications Go playing Robotic control helicopter maneuvering, autonomous vehicles Mars rover - path planning, oversubscription planning elevator planning Game playing - backgammon, tetris, checkers Neuroscience Computational Finance, Sequential Auctions Assisting elderly in simple tasks Spoken dialog management Communication Networks switching, routing, flow control War planning, evacuation planning

Reinforcement Learning Still assume a Markov decision process (MDP): A set of states s Î S A set of actions (per state) A A model T(s,a,s ) A reward function R(s,a,s ) & discount γ Still looking for a policy p(s)? New twist: don t know T or R I.e. we don t know which states are good or what the actions do Must actually try actions and states out to learn

Offline (MDPs) vs. Online (RL) Simulator Offline Solution (Planning) Monte Carlo Planning Diff: 1) dying ok; 2) (re)set button Online Learning (RL)

Credit-Assignment Problem Four Key Ideas for RL What was the real cause of reward? Exploration-exploitation tradeoff Model-based vs model-free learning What function is being learned? Approximating the Value Function Smaller à easier to learn & better generalization

Credit Assignment Problem 18

Exploration-Exploitation tradeoff You have visited part of the state space and found a reward of 1 is this the best you can hope for??? Exploitation: should I stick with what I know and find a good policy w.r.t. this knowledge? at risk of missing out on a better reward somewhere Exploration: should I look for states w/ more reward? at risk of wasting time & getting some negative reward 19

Model-Based Learning

Model-Based Learning Model-Based Idea: Learn an approximate model based on experiences Solve for values as if the learned model were correct Step 1: Learn empirical MDP model Explore (e.g., move randomly) Count outcomes s for each s, a Normalize to give an estimate of Discover each when we experience (s, a, s ) Step 2: Solve the learned MDP For example, use value iteration, as before

Example: Model-Based Learning Random p A B C D E Assume: g = 1 Observed Episodes (Training) Episode 1 Episode 2 B, east, C, -1 C, east, D, -1 D, exit, x, +1 B, east, C, -1 C, east, D, -1 D, exit, x, +1 Episode 3 Episode 4 E, north, C, -1 C, east, D, -1 D, exit, x, +1 E, north, C, -1 C, east, A, -1 A, exit, x, -1 Learned Model T(s,a,s ). T(B, east, C) = 1. T(C, east, D) =.75 T(C, east, A) =.25 R(s,a,s ). R(B, east, C) = -1 R(C, east, D) = -1 R(D, exit, x) = +1

Convergence If policy explores enough doesn t starve any state Then T & R converge So, VI, PI, Lao* etc. will find optimal policy Using Bellman Equations When can agent start exploiting?? (We ll answer this question later) 23

Two main reinforcement learning approaches Model-based approaches: explore environment & learn model, T=P(s s,a) and R(s,a), (almost) everywhere use model to plan policy, MDP-style approach leads to strongest theoretical results often works well when state-space is manageable Model-free approach: don t learn a model of T&R; instead, learn Q-function (or policy) directly weaker theoretical results often works better when state space is large 24

Two main reinforcement learning approaches Model-based approaches: Learn T + R S 2 A + S A parameters (4,4) Model-free approach: Learn Q S A parameters (4) 25

Model-Free Learning

Nothing is Free in Life! What exactly is Free??? No model of T No model of R (Instead, just model Q) 27

Reminder: Q-Value Iteration Forall s, a Initialize Q (s, a) = no time steps left means an expected reward of zero K = Repeat do Bellman backups For every (s,a) pair: a s, a Q k+1 (s,a) s,a,s K += 1 Until convergence I.e., Q values don t change much We can sample this This is easy. V k (s )=Max a Q k (s,a )

Puzzle: Q-Learning Forall s, a Initialize Q (s, a) = no time steps left means an expected reward of zero K = Repeat do Bellman backups For every (s,a) pair: a s, a Q k+1 (s,a) K += 1 Until convergence I.e., Q values don t change much Q: How can we compute without R, T?!? s,a,s A: Compute averages using sampled outcomes V k (s )=Max a Q k (s,a )

Simple Example: Expected Age Goal: Compute expected age of CSE students Known P(A) Without P(A), instead collect samples [a 1, a 2, a N ] Note: never know P(age=22) Unknown P(A): Model Based Unknown P(A): Model Free Why does this work? Because eventually you learn the right model. Why does this work? Because samples appear with the right frequencies.

Anytime Model-Free Expected Age Let A= Loop for i = 1 to a i ß ask what is your age? A ß (1-α)*A + α*a i Let A= Loop for i = 1 to a i ß ask what is your age? A ß (i-1)/i * A + (1/i) * a i Goal: Compute expected age of CSE students Without P(A), instead collect samples [a 1, a 2, a N ] Unknown P(A): Model Free

Sampling Q-Values Big idea: learn from every experience! Follow exploration policy a ß π(s) Update Q(s,a) each time we experience a transition (s, a, s, r) Likely outcomes s will contribute updates more often Update towards running average: Get a sample of Q(s,a): sample = R(s,a,s ) + γ Max a Q(s, a ) s p(s), r s Update to Q(s,a): Same update: Rearranging: Q(s,a) ß (1-α)Q(s,a) + (α)sample Q(s,a) ß Q(s,a) + α(sample Q(s,a)) Q(s,a) ß Q(s,a) + α(difference) Where difference = (R(s,a,s ) + γ Max a Q(s, a )) - Q(s,a)

Q Learning Forall s, a Initialize Q(s, a) = Repeat Forever Where are you? s. Choose some action a Execute it in real world: (s, a, r, s ) Do update: difference ß [R(s,a,s ) + γ Max a Q(s, a )] - Q(s,a) Q(s,a) ß Q(s,a) + α(difference)

Example Assume: g = 1, α = 1/2 Observed Transition: B, east, C, -2 C 8 D B A E In state B. What should you do? Suppose (for now) we follow a random exploration policy à Go east

Example Assume: g = 1, α = 1/2 Observed Transition: B, east, C, -2 C 8 D B A E C 8 D? B A E ½ ½ -2-1

Example Assume: g = 1, α = 1/2 Observed Transition: B, east, C, -2 C 8 D B A E C 8 D -1 B A E ½ ½ -2 8 3? C 8 D B A E C, east, D, -2

Example Assume: g = 1, α = 1/2 Observed Transition: B, east, C, -2 C 8 D B A E C 8 D -1 B A E 3 C 8 D -1 B A E C, east, D, -2

Q-Learning Properties Q-learning converges to optimal Q function (and hence learns optimal policy) even if you re acting suboptimally! This is called off-policy learning Caveats: You have to explore enough You have to eventually shrink the learning rate, α but not decrease it too quickly And if you want to act optimally You have to switch from explore to exploit [Demo: Q-learning auto cliff grid (L11D1)]

Video of Demo Q-Learning Auto Cliff Grid

Forall s, a Initialize Q(s, a) = Repeat Forever Where are you? s. Choose some action a Execute it in real world: (s, a, r, s ) Do update: Q Learning

Exploration vs. Exploitation

Questions How to explore? a Exploration Uniform exploration Epsilon Greedy With (small) probability e, act randomly With (large) probability 1-e, act on current policy Exploration Functions (such as UCB) Thompson Sampling When to exploit? How to even think about this tradeoff?

Questions How to explore? Random Exploration Uniform exploration Epsilon Greedy With (small) probability e, act randomly With (large) probability 1-e, act on current policy Exploration Functions (such as UCB) Thompson Sampling When to exploit? How to even think about this tradeoff?

When to explore? Exploration Functions Random actions: explore a fixed amount Better idea: explore areas whose badness is not (yet) established, eventually stop exploring Exploration function Takes a value estimate u and a visit count n, and returns an optimistic utility, e.g. Regular Q-Update: Modified Q-Update: Note: this propagates the bonus back to states that lead to unknown states as well!

Video of Demo Crawler Bot More demos at: http://inst.eecs.berkeley.edu/~ee128/fa11/videos.html

Approximate Q-Learning

Generalizing Across States Basic Q-Learning keeps a table of all q-values In realistic situations, we cannot possibly learn about every single state! Too many states to visit them all in training Too many states to hold the q-tables in memory Instead, we want to generalize: Learn about some small number of training states from experience Generalize that experience to new, similar situations This is a fundamental idea in machine learning, and we ll see it over and over again [demo RL pacman]

Example: Pacman Let s say we discover through experience that this state is bad: In naïve q-learning, we know nothing about this state:

Example: Pacman Let s say we discover through experience that this state is bad: Or even this one!

Feature-Based Representations Solution: describe a state using a vector of features (aka properties ) Features = functions from states to R (often /1) capturing important properties of the state Example features: Distance to closest ghost or dot Number of ghosts 1 / (dist to dot) 2 Is Pacman in a tunnel? (/1) etc. Is it the exact state on this slide? Can also describe a q-state (s, a) with features (e.g. action moves closer to food)

Linear Combination of Features Using a feature representation, we can write a q function (or value function) for any state using a few weights: Advantage: our experience is summed up in a few powerful numbers Disadvantage: states sharing features may actually have very different values!

Approximate Q-Learning Q-learning with linear Q-functions: Exact Q s Forall i do: Approximate Q s Intuitive interpretation: Adjust weights of active features E.g., if something unexpectedly bad happens, blame the features that were on: disprefer all states with that state s features Formal justification: in a few slides!

Q Learning Forall s, a Initialize Q(s, a) = Repeat Forever Where are you? s. Choose some action a Execute it in real world: (s, a, r, s ) Do update: difference ß [R(s,a,s ) + γ Max a Q(s, a )] - Q(s,a) Q(s,a) ß Q(s,a) + α(difference)

Forall i Initialize w i = Repeat Forever Where are you? s. Choose some action a Execute it in real world: (s, a, r, s ) Do update: difference ß [R(s,a,s ) + γ Max a Q(s, a )] - Q(s,a) Q(s,a) ß Q(s,a) + α(difference)