Learning and Planning with Tabular Methods

Similar documents
Lecture 10: Reinforcement Learning

Reinforcement Learning by Comparing Immediate Reward

Axiom 2013 Team Description Paper

Exploration. CS : Deep Reinforcement Learning Sergey Levine

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

High-level Reinforcement Learning in Strategy Games

Georgetown University at TREC 2017 Dynamic Domain Track

Laboratorio di Intelligenza Artificiale e Robotica

Lecture 1: Machine Learning Basics

Introduction to Simulation

CS Machine Learning

Regret-based Reward Elicitation for Markov Decision Processes

Laboratorio di Intelligenza Artificiale e Robotica

FF+FPG: Guiding a Policy-Gradient Planner

Automatic Discretization of Actions and States in Monte-Carlo Tree Search

Session 2B From understanding perspectives to informing public policy the potential and challenges for Q findings to inform survey design

Artificial Neural Networks written examination

Generative models and adversarial training

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

TD(λ) and Q-Learning Based Ludo Players

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

Seminar - Organic Computing

Rule-based Expert Systems

Go fishing! Responsibility judgments when cooperation breaks down

AMULTIAGENT system [1] can be defined as a group of

Speeding Up Reinforcement Learning with Behavior Transfer

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Adaptive Generation in Dialogue Systems Using Dynamic User Modeling

Learning and Transferring Relational Instance-Based Policies

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

The Evolution of Random Phenomena

Planning with External Events

An OO Framework for building Intelligence and Learning properties in Software Agents

MYCIN. The MYCIN Task

Lecture 6: Applications

Improving Action Selection in MDP s via Knowledge Transfer

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Python Machine Learning

Software Maintenance

EVOLVING POLICIES TO SOLVE THE RUBIK S CUBE: EXPERIMENTS WITH IDEAL AND APPROXIMATE PERFORMANCE FUNCTIONS

Acquiring Competence from Performance Data

An Introduction to Simio for Beginners

Task Completion Transfer Learning for Reward Inference

Emergency Management Games and Test Case Utility:

Major Milestones, Team Activities, and Individual Deliverables

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

Lecture 2: Quantifiers and Approximation

A Comparison of Annealing Techniques for Academic Course Scheduling

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Probability and Game Theory Course Syllabus

Using Deep Convolutional Neural Networks in Monte Carlo Tree Search

Model Ensemble for Click Prediction in Bing Search Ads

DIGITAL GAMING & INTERACTIVE MEDIA BACHELOR S DEGREE. Junior Year. Summer (Bridge Quarter) Fall Winter Spring GAME Credits.

Active Learning. Yingyu Liang Computer Sciences 760 Fall

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

(Sub)Gradient Descent

give every teacher everything they need to teach mathematics

Probability estimates in a scenario tree

Lecture 1: Basic Concepts of Machine Learning

AI Agent for Ice Hockey Atari 2600

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

The Strong Minimalist Thesis and Bounded Optimality

SARDNET: A Self-Organizing Feature Map for Sequences

BMBF Project ROBUKOM: Robust Communication Networks

A simulated annealing and hill-climbing algorithm for the traveling tournament problem

How long did... Who did... Where was... When did... How did... Which did...

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Radius STEM Readiness TM

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

On-Line Data Analytics

Task Completion Transfer Learning for Reward Inference

Introduction to Questionnaire Design

Teachable Robots: Understanding Human Teaching Behavior to Build More Effective Robot Learners

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

STA 225: Introductory Statistics (CT)

Lahore University of Management Sciences. FINN 321 Econometrics Fall Semester 2017

An Introduction to Simulation Optimization

Rule Learning With Negation: Issues Regarding Effectiveness

Getting Started with Deliberate Practice

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Ericsson Wallet Platform (EWP) 3.0 Training Programs. Catalog of Course Descriptions

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge

Statistical Analysis of Climate Change, Renewable Energies, and Sustainability An Independent Investigation for Introduction to Statistics

A Reinforcement Learning Variant for Control Scheduling

Rule Learning with Negation: Issues Regarding Effectiveness

Why Did My Detector Do That?!

Concept mapping instrumental support for problem solving

Conceptual and Procedural Knowledge of a Mathematics Problem: Their Measurement and Their Causal Interrelations

TOKEN-BASED APPROACH FOR SCALABLE TEAM COORDINATION. by Yang Xu PhD of Information Sciences

IAT 888: Metacreation Machines endowed with creative behavior. Philippe Pasquier Office 565 (floor 14)

Universal Design for Learning Lesson Plan

Investigate the program components

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

PreReading. Lateral Leadership. provided by MDI Management Development International

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Ministry of Education, Republic of Palau Executive Summary

Designing a Computer to Play Nim: A Mini-Capstone Project in Digital Design I

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Truth Inference in Crowdsourcing: Is the Problem Solved?

Transcription:

Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Learning and Planning with Tabular Methods Lecture 6, CMU 10703 Katerina Fragkiadaki

What can I learn by interacting with the world? Previous week: the agent learned to estimate value functions and optimal policies from experience. state action S t A t reward R t

Model-free RL Model Model learning Simulation Planning Interaction with Environment Experience Direct RL methods Value function Greedification Policy

What can I learn by interacting with the world? Two weeks ago: we did not interact with the environment! We knew the true environment (dynamics and rewards) and just used them to plan and estimate value functions (value iteration, policy iteration using exhaustive state sweeps of Bellman back-up operations..very slow when many states..) v*,q* Planning: any computational process that uses a model to create or improve a policy Planning Model Policy

Planning Model Model learning Simulation Planning Interaction with Environment Experience Direct RL methods Value function Greedification Policy

What can I learn by interacting with the world? This lecture: Model-based RL, we will combine both, learning from experience and planning: 1. If the model is unknown, we will learn the model. R t eward R t state action S t A t

What can I learn by interacting with the world? This lecture: Model-based RL, we will combine both, learning from experience and planning: 1. If the model is unknown, we will learn the model. 2. Learn value functions using both real and simulated experience R t eward R t state action S t A t

What can I learn by interacting with the world? This lecture: we will combine both, learning from experience and planning: 1. If the model is unknown, we will learn the model. 2. Learn value functions using both real and simulated experience 3. Learning value functions online using model-based lookahead search R t eward R t state action S t A t

What can I learn by interacting with the world? This lecture: we will combine both, learning from experience and planning: 1. If the model is unknown, we will learn the model. 2. Learn value functions using both real and simulated experience 3. Learning value functions online using model-based lookahead search R t eward R t state action S t A t

Model-based RL Model Model learning Simulation Planning Interaction with Environment Experience Direct RL methods Value function Greedification Policy

Advantages of Model-Based RL Advantages: Model learning transfers across tasks and environment configurations (learning physics) Better exploits experience in case of sparse rewards It is probably what the brain does (more to come) Helps exploration: Can reason about model uncertainty Disadvantages: First learn model, then construct a value function: Two sources of approximation error

What is a Model? Model: anything the agent can use to predict how the environment will respond to its actions, concretely, the transition (dynamics) T(s s,a) and reward functions R(s,a). r s a s 0 this includes transitions of the state of the environment and the state of the agent..

What is a Model? Model: anything the agent can use to predict how the environment will respond to its actions, concretely: 1. the transition function (dynamics) 2. reward function Distribution model: description of all possibilities and their probabilities, T(s s,a) for all (s, a, s ) Sample model, a.k.a. a simulation model: produces sample experiences for given s, often much easier to come by Both types of models can be used to produce hypothetical experience (what if )

Model Learning 1. If the model is unknown, we will learn the model. 2. Learn value functions using both real and simulated experience 3. Learning value functions online using model-based lookahea search R t eward R t state action S t A t

Model Learning Goal: estimate model M from experience This can be thought as a supervised learning problem S 1, A 1! R 2, S 2 {S 1, A 1,R 2,...,S T } r S 2, A 2! R 3, S 3... S T 1, A T 1! R T, S T s a s 0 Learning Learning s, a! r s, a! s 0 is a regression problem is a density estimation problem Pick loss function, e.g. mean-squared error, KL divergence, Find parameters that minimize empirical loss

Examples of Models for T(s s,a) Table lookup model (tabular): bookkeeping a probability of occurrence for each transition (s,a,s ) Transition function is approximated through some function approximator A S 0.5 Curre S

A supervised learning problem? To look ahead far in the future you need to chain your dynamic predictions Data is sequential i.i.d. assumptions break, errors accumulate in time! Solutions: r Hierarchical dynamics models Linear local approximations, etc (later lectures) s a s 0

Examples of Models for T(s s,a) Table lookup model (tabular): bookkeeping a probability of occurrence for each transition (s,a,s ) Transition function is approximated through some features 0.5 Curre This Lecture Later..

Model Model learning Simulation Planning Interaction with Environment Experience Direct RL methods Value function Greedification Policy

Table Lookup Model Model is an explicit MDP, Count visits Alternatively N(s, a) ˆT (s 0 s, a) = ˆR(s, a) = ˆT, ˆR to each state action pair 1 N(s, a) 1 N(s, a) TX t=1 1(S t,a t,s t+1 = s, a, s 0 ) At each time-step, record experience tuple hs t,a t,r t+1,s t+1 i t TX t=1 1(S t,a t = s, a)r t To sample model, randomly pick tuple matching hs, a,, i

A simple Example Two states A,B; no discounting; 8 episodes of experience A, 0, B, 0! B, 1! B, 1! B, 1! B, 1! B, 1! B, 1! B, 0! We have constructed a table lookup model from the experience

Planning with a Model Given a model Solve the MDP M = ht, R i hs, A,T R i Using favorite planning algorithm Value iteration Policy iteration Tree search

Planning with a Model Given a model Solve the MDP M = ht, R i hs, A,T R i Using favorite planning algorithm Value iteration Policy iteration Tree search curse of dimensionality!

Planning with a Model Given a model Solve the MDP M = ht, R i hs, A,T R i Using favorite planning algorithm Value iteration Policy iteration Tree search Sample-based planning (right next)

Sample-based Planning Use the model only to generate samples, not using its transition probabilities and expected immediate rewards Sample experience from model Apply model-free RL to samples, e.g.: Monte-Carlo control Sarsa Q-learning S t+1 T (S t+1 S t,a t ) R t+1 = R (R t+1 S t,a t ) Sample-based planning methods are often more efficient: rather than exhaustive state sweeps we focus on what is likely to happen

Sample-based planning Model Model learning Simulation Planning Interaction with Environment Experience Direct RL methods Value function Greedification Policy

A Simple Example Construct a table-lookup model from real experience Apply model-free RL to sampled experience Real experience A, 0, B, 0 B, 1 B, 1 B, 1 B, 1 B, 1 B, 1 B, 0 Sampled experience B, 1 B, 0 B, 1 A, 0, B, 1 B, 1 A, 0, B, 1 B, 1 B, 0 e.g. Monte-Carlo learning: v(a) =1, v(b) =0.75

Planning with an Inaccurate Model Given an imperfect model < T, R >6=< T, R > Performance of model-based RL is limited to optimal policy for approximate MDP < S, A,T, R > i.e. Model-based RL is only as good as the estimated model When the model is inaccurate, planning process will compute a suboptimal policy Solution 1: when model is wrong, use model-free RL Solution 2: reason explicitly about model uncertainty

Combine real and simulated experience 1. If the model is unknown, we will learn the model. 2. Learn value functions using both real and simulated experience 3. Learning value functions online using model-based lookahead search R t eward R t state action S t A t

Real and Simulated Experience We consider two sources of experience Real experience - Sampled from environment (true MDP) S 0 T (s 0 s, a) R = r(s, a) Simulated experience - Sampled from model (approximate MD) S 0 T (S 0 S, A) R = R (R S, A)

Integrating Learning and Planning Model-Free RL No model Learn value function (and/or policy) from real experience

Integrating Learning and Planning Model-Free RL No model Learn value function (and/or policy) from real experience Model-Based RL (using Sample-Based Planning) Learn a model from real experience Plan value function (and/or policy) from simulated experience

Integrating Learning and Planning Model-Free RL No model Learn value function (and/or policy) from real experience Model-Based RL (using Sample-Based Planning) Dyna Learn a model from real experience Plan value function (and/or policy) from simulated experience Learn a model from real experience Learn and plan value function (and/or policy) from real and simulated experience

Dyna Model Model learning Simulation Planning Interaction with Environment Experience Direct RL methods Value function Greedification Policy

Dyna-Q Algorithm

Dyna-Q on a Simple Maze

Midway in 2nd Episode WITHOUT PLANNING (N=0) n G S

Midway in 2nd Episode WITHOUT PLANNING (N=0) n G n WITH PLANNING (N=50) G S S

Dyna-Q with an Inaccurate Model The changed environment is harder

Dyna-Q with an Inaccurate model Cont. The changed environment is easier

Sampling-based look-ahead search 1. If the model is unknown, we will learn the model. 2. Learn value functions using both real and simulated experience 3. Learning value functions online using model-based lookahead search R t eward R t state action S t A t

Forward Search Model Model learning Simulation Planning Interaction with Environment Experience Direct RL methods Value function Greedification Action (from a given state s)

Forward Search Prioritizes the state the agent is currently in! Using a model of the MDP to look ahead (exhaustively) Builds a search tree with the current state at the root Focus on sub-mdp starting from now, often dramatically easier than solving the whole MDP s t Image T! T! T! T! T! T! T! T! T! T!

Why Forward search? Why don t we learn a value function directly for every state offline, so that we do not waste time online? Because the environment has many many states (consider Go 10^170, ChessL 10^48, real world.) Very hard to compute a good value function for each one of them, most you will never even visit Thus, it makes sense, condition on the current state you are in, to try to estimate the value function of the relevant part of the state space online! Focus your resources. Use the the online forward search to pick the best action Disadvantages: Nothing is learnt from episode to episode

Simulation-based Search I Forward search paradigm using sample-based planning Simulate episodes of experience starting from now with the model Apply model-free RL to simulated episodes Image s t T! T! T! T! T! T! T! T! T! T!

Simulation-Based Search II Simulate episodes of experience from now with the model {s k t, A k t,r k t+1,...,s k T } K k=1 M Apply model-free RL to simulated episodes! Monte-Carlo control Monte-Carlo search

Simple Monte-Carlo Search Given a model For each action K M a 2 A and a simulation policy Simulate episodes from current (real) state : s {s t,a,r k t+1,s k t+1,a k t+1,...,s k T } K k=1 M, Evaluate action value function of the root by mean return (Monte-Carlo evaluation) Q(s t,a)= 1 K KX k=1 G t P! q (s t,a) Select current (real) action with maximum value a t = argmax a2a Q(s t,a)

Monte-Carlo Tree Search (Evaluation) Given a model M Simulate K episodes from current state s t using current simulation policy {s t,a k t,r k t+1,s k t+1,...,s k T } K k=1 M, Build a search tree containing visited states and actions Evaluate states Q(s, a) by mean return of episodes from s, a for all states and actions in the tree Q(s t,a)= 1 N(s, a) KX k=1 TX u=t 1(S u,a u = s, a)g u P! q (s, a) After search is finished, select current (real) action with maximum value in search tree a t = argmax a2a Q(s t,a)

Monte-Carlo Tree Search (Simulation) In MCTS, the simulation policy improves Each simulation consists of two phases (in-tree, out-of-tree) Tree policy (improves): pick actions to maximize Default policy (fixed): pick actions randomly Repeat (each simulation) Q(s, a) Evaluate states by Monte-Carlo evaluation Improve there policy, e.g. by Monte-Carlo control applied to simulated experience Converges on the optimal search tree, greedy(q) Q(s, a) Q(S, A)! q (S, A)

Case Study: the Game of Go The ancient oriental game of Go is 2500 years old Considered to be the hardest classic board game Considered a grand challenge task for AI (John McCarthy) Traditional game-tree search has failed in Go

Rules of Go Usually played on 19x19, also 13x13 or 9x9 board Simple rules, complex strategy Black and white place down stones alternately Surrounded stones are captured and removed The player with more territory wins the game

Position Evaluation in Go How good is a position s? Reward function (undiscounted): R t =0 R T = for all non-terminal steps ( 1, if Black wins. 0, if White wins. t<t Policy = h B, W i Value function (how good is position s ): selects moves for both players v (s) =E [R T S = s] =P[Black wins S = s] v (s) = max B min W v (s)

Monte-Carlo Evaluation in Go V(s) = 2/4 = 0.5 Current position s Simulation 1 1 0 0 Outcomes

Applying Monte-Carlo Tree Search

Applying Monte-Carlo Tree Search Cont.

Applying Monte-Carlo Tree Search (Cont.

Applying Monte-Carlo Tree Search (Cont.

Applying Monte-Carlo Tree Search (Cont.

Advantages of MC Tree Search Highly selective best-first search Evaluate states dynamically (unlike e.g. DP) Uses sampling to break curse of dimensionality Computationally efficient, anytime, parallelizable

Combining offline and online value function estimation Use policy networks to have priors on Q(s,a): a t = argmax a (Q(s t,a)+u(s t,a)) u(s, a) / P (s, a) 1+N(s, a) P (s, a) = (a s) Use fast and light policy networks for rollouts (instead of random policy) Use value function approximation computed offline to evaluate nodes in the tree: v(s L )=(1 )v (s L )+ z L

Combining offline and online value function estimation