Reinforcement Learning

Similar documents
Exploration. CS : Deep Reinforcement Learning Sergey Levine

Reinforcement Learning by Comparing Immediate Reward

TD(λ) and Q-Learning Based Ludo Players

Axiom 2013 Team Description Paper

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Lecture 10: Reinforcement Learning

Generative models and adversarial training

Artificial Neural Networks written examination

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

FF+FPG: Guiding a Policy-Gradient Planner

Speeding Up Reinforcement Learning with Behavior Transfer

Laboratorio di Intelligenza Artificiale e Robotica

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

Lecture 1: Machine Learning Basics

Introduction to Simulation

High-level Reinforcement Learning in Strategy Games

CSC200: Lecture 4. Allan Borodin

STA 225: Introductory Statistics (CT)

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Discriminative Learning of Beam-Search Heuristics for Planning

University of Victoria School of Exercise Science, Physical and Health Education EPHE 245 MOTOR LEARNING. Calendar Description Units: 1.

Radius STEM Readiness TM

Why Did My Detector Do That?!

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Edexcel GCSE. Statistics 1389 Paper 1H. June Mark Scheme. Statistics Edexcel GCSE

Improving Fairness in Memory Scheduling

Probability estimates in a scenario tree

Learning Methods for Fuzzy Systems

Seminar - Organic Computing

Laboratorio di Intelligenza Artificiale e Robotica

Georgetown University at TREC 2017 Dynamic Domain Track

Introduction to the Practice of Statistics

Managerial Decision Making

A study of speaker adaptation for DNN-based speech synthesis

Learning to Schedule Straight-Line Code

SELF-STUDY QUESTIONNAIRE FOR REVIEW of the COMPUTER SCIENCE PROGRAM

Forget catastrophic forgetting: AI that learns after deployment

Comparison of network inference packages and methods for multiple networks inference

CS Machine Learning

IAT 888: Metacreation Machines endowed with creative behavior. Philippe Pasquier Office 565 (floor 14)

Visual CP Representation of Knowledge

A Reinforcement Learning Variant for Control Scheduling

Knowledge Transfer in Deep Convolutional Neural Nets

ACTL5103 Stochastic Modelling For Actuaries. Course Outline Semester 2, 2014

GCSE Mathematics B (Linear) Mark Scheme for November Component J567/04: Mathematics Paper 4 (Higher) General Certificate of Secondary Education

Go fishing! Responsibility judgments when cooperation breaks down

AMULTIAGENT system [1] can be defined as a group of

Chapter 2. Intelligent Agents. Outline. Agents and environments. Rationality. PEAS (Performance measure, Environment, Actuators, Sensors)

BMBF Project ROBUKOM: Robust Communication Networks

INPE São José dos Campos

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

Changing User Attitudes to Reduce Spreadsheet Risk

Getting Started with Deliberate Practice

COMPUTER-ASSISTED INDEPENDENT STUDY IN MULTIVARIATE CALCULUS

An OO Framework for building Intelligence and Learning properties in Software Agents

Python Machine Learning

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Development of Multistage Tests based on Teacher Ratings

BAYESIAN ANALYSIS OF INTERLEAVED LEARNING AND RESPONSE BIAS IN BEHAVIORAL EXPERIMENTS

Shockwheat. Statistics 1, Activity 1

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

CS224d Deep Learning for Natural Language Processing. Richard Socher, PhD

AI Agent for Ice Hockey Atari 2600

Dialog-based Language Learning

Designing a Computer to Play Nim: A Mini-Capstone Project in Digital Design I

Major Milestones, Team Activities, and Individual Deliverables

An Online Handwriting Recognition System For Turkish

Navigating the PhD Options in CMS

Active Learning. Yingyu Liang Computer Sciences 760 Fall

A Comparison of Annealing Techniques for Academic Course Scheduling

Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method

Toward Probabilistic Natural Logic for Syllogistic Reasoning

Create Quiz Questions

Automatic Discretization of Actions and States in Monte-Carlo Tree Search

Surprise-Based Learning for Autonomous Systems

Cal s Dinner Card Deals

A GENERIC SPLIT PROCESS MODEL FOR ASSET MANAGEMENT DECISION-MAKING

Guide to the Uniform mark scale (UMS) Uniform marks in A-level and GCSE exams

Introduction to Questionnaire Design

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Data Structures and Algorithms

Process to Identify Minimum Passing Criteria and Objective Evidence in Support of ABET EC2000 Criteria Fulfillment

Welcome to the session on ACCUPLACER Policy Development. This session will touch upon common policy decisions an institution may encounter during the

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Lecture 2: Quantifiers and Approximation

GCSE English Language 2012 An investigation into the outcomes for candidates in Wales

Phonological Processing for Urdu Text to Speech System

Agents and environments. Intelligent Agents. Reminders. Vacuum-cleaner world. Outline. A vacuum-cleaner agent. Chapter 2 Actuators

12- A whirlwind tour of statistics

S T A T 251 C o u r s e S y l l a b u s I n t r o d u c t i o n t o p r o b a b i l i t y

An Introduction to Simio for Beginners

Learning and Transferring Relational Instance-Based Policies

Probabilistic Latent Semantic Analysis

Dream Team Resources. Monday June 26th 2:30-3:45 PM 4:00-5:15 PM

Stochastic Calculus for Finance I (46-944) Spring 2008 Syllabus

Lahore University of Management Sciences. FINN 321 Econometrics Fall Semester 2017

INTERNATIONAL BACCALAUREATE AT IVANHOE GRAMMAR SCHOOL. An Introduction to the International Baccalaureate Diploma Programme For Students and Families

Improving Action Selection in MDP s via Knowledge Transfer

School of Innovative Technologies and Engineering

Transcription:

Reinforcement Learning Policy Op4miza4on and Planning (Material not examinable) Subramanian Ramamoorthy School of Informa4cs 31 March, 2017

Plan for Lecture: Policies and Plans Policy Op5miza5on Policies can be op5mized directly, without learning value func5ons Policy-gradient methods Special case: how could we learn with real-valued (con5nuous) ac5ons Planning Uses of environment models Integra5on of planning, learning, and execu5on Model-based reinforcement learning 31/03/2017 2

Policy-gradient methods (Note: slightly different nota5on in this sec5on, following 2 nd ed. of S+B)

Approaches to control 1. Previous approach: Ac5on-value methods: learn the value of each (state-)ac5on; pick the max, usually 2. New approach: Policy-gradient methods: learn the parameters of a stochas5c policy update by gradient ascent in performance includes actor-cri5c methods, which learn both value and policy parameters 31/03/2017 4

Actor-cri5c architecture World 31/03/2017 5

Why Approximate Policies rather than Values? In many problems, the policy is simpler to approximate than the value func5on In many problems, the op5mal policy is stochas5c e.g., bluffing, POMDPs To enable smoother change in policies To avoid a search on every step (the max) To be^er relate to biology 31/03/2017 6

Policy Approxima5on Policy = a func5on from state to ac5on How does the agent select ac5ons? In such a way that it can be affected by learning? In such a way as to assure explora5on? Approxima5on: there are too many states and/or ac5ons to represent all policies To handle large/con5nuous ac5on spaces 31/03/2017 7

Gradient Bandit Algorithm 31/03/2017 8

Core Principle: Policy Gradient Methods Parameterized policy selects ac5ons without consul5ng a value func5on VF can s5ll be used to learn the policy weights But not needed for ac5on selec5on Gradient ascent on a performance measure η(θ) with respect to policy weights t+1 = t + \ r ( t ) Expectation approximates the gradient (hence policy gradient ) 31/03/2017 9

Linear-exponen5al policies (discrete ac5ons) Factor to modulate TD update, going beyond TD(0) to TD(λ) 31/03/2017 10

eg, linear-gaussian policies (con5nuous ac5ons) Action prob. density μ and σ linear in the state action 31/03/2017 11

eg, linear-gaussian policies (con5nuous ac5ons) 31/03/2017 12

Gaussian eligibility func5ons 31/03/2017 13

Policy Gradient Setup 31/03/2017 14

REINFORCE: Monte-Carlo Policy Gradient, from Policy Gradient Theorem 31/03/2017 15

The generality of the policy-gradient strategy Can be applied whenever we can compute the effect of parameter changes on the ac5on probabili5es, e.g., has been applied to spiking neuron models There are many possibili5es other than linear-exponen5al and linear-gaussian e.g., mixture of random, argmax, and fixed-width gaussian; learn the mixing weights, drij/diffusion models 31/03/2017 16

Planning

Paths to a Policy 31/03/2017 18

Schematic 31/03/2017 19

Models Model: anything the agent can use to predict how the environment will respond to its actions Distribution model: description of all possibilities and their probabilities e.g., a for all s, s ʹ, and a A(s) P ss ʹ and R a s ʹ Sample model: produces sample experiences e.g., a simulation model Both types of models can be used to produce simulated experience Often sample models are much easier to come by 31/03/2017 20

Planning Planning: any computational process that uses a model to create or improve a policy Planning in AI: state-space planning model planning plan-space planning (e.g., partial-order planner) We take the following (unusual) view: all state-space planning methods involve computing value functions, either explicitly or implicitly they all apply backups to simulated experience policy model simulated experience backups values policy 31/03/2017 21

Planning Cont. Classical DP methods are state-space planning methods Heuristic search methods are state-space planning methods A planning method based on Q-learning: Random-Sample One-Step Tabular Q-Planning 31/03/2017 22

Paths to a Policy: Dyna 31/03/2017 23

Learning, Planning, and Acting Two uses of real experience: model learning: to improve the model direct RL: to directly improve the value function and policy Improving value function and/or policy via a model is sometimes called indirect RL or model-based RL. Here, we call it planning. planning model value/policy direct RL model learning acting experience 24

Direct vs. Indirect RL Indirect methods: make fuller use of experience: get better policy with fewer environment interactions Direct methods simpler not affected by bad models But they are very closely related and can be usefully combined: planning, acting, model learning, and direct RL can occur simultaneously and in parallel 31/03/2017 25

The Dyna Architecture (Sutton 1990) Policy/value functions planning update direct RL update real experience model learning simulated experience search control Environment Model 31/03/2017 26

The Dyna-Q Algorithm direct RL model learning planning 31/03/2017 27

Dyna-Q on a Simple Maze rewards = 0 until goal, when =1 31/03/2017 28

Dyna-Q Snapshots: Midway in 2nd Episode WITHOUT PLANNING (N=0) G WITH PLANNING (N=50) G S S 31/03/2017 29

When the Model is Wrong: The changed envirnoment is harder Blocking Maze G G S S 150 Dyna-Q+ Dyna-Q Cumulative reward Dyna-AC 31/03/2017 0 0 1000 2000 3000 Time steps 30

The changed environment is easier Shortcut Maze G G S S 400 Cumulative reward Dyna-Q+ Dyna-Q Dyna-AC 31/03/2017 0 0 3000 6000 Time steps 31

What is Dyna-Q +? Uses an exploration bonus : Keeps track of time since each state-action pair was tried for real An extra reward is added for transitions caused by state-action pairs related to how long ago they were tried: the longer unvisited, the more reward for visiting The agent actually plans how to visit long unvisited states 31/03/2017 32