ECE 517: Reinforcement Learning in Artificial Intelligence

Similar documents
Reinforcement Learning by Comparing Immediate Reward

Exploration. CS : Deep Reinforcement Learning Sergey Levine

TD(λ) and Q-Learning Based Ludo Players

Laboratorio di Intelligenza Artificiale e Robotica

Axiom 2013 Team Description Paper

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Lecture 10: Reinforcement Learning

FF+FPG: Guiding a Policy-Gradient Planner

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Speeding Up Reinforcement Learning with Behavior Transfer

High-level Reinforcement Learning in Strategy Games

Laboratorio di Intelligenza Artificiale e Robotica

Learning and Transferring Relational Instance-Based Policies

AMULTIAGENT system [1] can be defined as a group of

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Learning to Schedule Straight-Line Code

A Reinforcement Learning Variant for Control Scheduling

Introduction to Simulation

Surprise-Based Learning for Autonomous Systems

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

Ericsson Wallet Platform (EWP) 3.0 Training Programs. Catalog of Course Descriptions

Seminar - Organic Computing

Artificial Neural Networks written examination

On the Combined Behavior of Autonomous Resource Management Agents

A Case-Based Approach To Imitation Learning in Robotic Agents

The Evolution of Random Phenomena

Automatic Discretization of Actions and States in Monte-Carlo Tree Search

What s in a Step? Toward General, Abstract Representations of Tutoring System Log Data

Modeling user preferences and norms in context-aware systems

Planning with External Events

University of Groningen. Systemen, planning, netwerken Bosman, Aart

Lecture 1: Machine Learning Basics

DIGITAL GAMING & INTERACTIVE MEDIA BACHELOR S DEGREE. Junior Year. Summer (Bridge Quarter) Fall Winter Spring GAME Credits.

Thesis-Proposal Outline/Template

Georgetown University at TREC 2017 Dynamic Domain Track

Evolutive Neural Net Fuzzy Filtering: Basic Description

Event on Teaching Assignments October 7, 2015

DOCTOR OF PHILOSOPHY HANDBOOK

Knowledge-Based - Systems

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Contents. Foreword... 5

Case Acquisition Strategies for Case-Based Reasoning in Real-Time Strategy Games

Create Quiz Questions

Chapter 4 - Fractions

Learning From the Past with Experiment Databases

Executive Guide to Simulation for Health

INTERMEDIATE ALGEBRA PRODUCT GUIDE

Shockwheat. Statistics 1, Activity 1

An Introduction to Simio for Beginners

Probability estimates in a scenario tree

Improving Fairness in Memory Scheduling

Rover Races Grades: 3-5 Prep Time: ~45 Minutes Lesson Time: ~105 minutes

Acquiring Competence from Performance Data

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

IAT 888: Metacreation Machines endowed with creative behavior. Philippe Pasquier Office 565 (floor 14)

Enduring Understandings: Students will understand that

Ohio s Learning Standards-Clear Learning Targets

Knowledge based expert systems D H A N A N J A Y K A L B A N D E

FUZZY EXPERT. Dr. Kasim M. Al-Aubidy. Philadelphia University. Computer Eng. Dept February 2002 University of Damascus-Syria

Emergency Management Games and Test Case Utility:

Stimulating Techniques in Micro Teaching. Puan Ng Swee Teng Ketua Program Kursus Lanjutan U48 Kolej Sains Kesihatan Bersekutu, SAS, Ulu Kinta

Ricochet Robots - A Case Study for Human Complex Problem Solving

Lecture 6: Applications

Citrine Informatics. The Latest from Citrine. Citrine Informatics. The data analytics platform for the physical world

Learning Methods for Fuzzy Systems

Discriminative Learning of Beam-Search Heuristics for Planning

Probability and Game Theory Course Syllabus

IMGD Technical Game Development I: Iterative Development Techniques. by Robert W. Lindeman

Science Fair Rules and Requirements

GCSE. Mathematics A. Mark Scheme for January General Certificate of Secondary Education Unit A503/01: Mathematics C (Foundation Tier)

Visual CP Representation of Knowledge

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

An Empirical and Computational Test of Linguistic Relativity

Python Machine Learning

Action Models and their Induction

Business 712 Managerial Negotiations Fall 2011 Course Outline. Human Resources and Management Area DeGroote School of Business McMaster University

SMALL GROUPS AND WORK STATIONS By Debbie Hunsaker 1

Similar Triangles. Developed by: M. Fahy, J. O Keeffe, J. Cooper

How to make an A in Physics 101/102. Submitted by students who earned an A in PHYS 101 and PHYS 102.

An OO Framework for building Intelligence and Learning properties in Software Agents

AQUA: An Ontology-Driven Question Answering System

Game-based formative assessment: Newton s Playground. Valerie Shute, Matthew Ventura, & Yoon Jeon Kim (Florida State University), NCME, April 30, 2013

An ICT environment to assess and support students mathematical problem-solving performance in non-routine puzzle-like word problems

The Implementation of a Consecutive Giving Recognition Program at the University of Florida

Go fishing! Responsibility judgments when cooperation breaks down

CS Machine Learning

Chapter 2. Intelligent Agents. Outline. Agents and environments. Rationality. PEAS (Performance measure, Environment, Actuators, Sensors)

Assessing System Agreement and Instance Difficulty in the Lexical Sample Tasks of SENSEVAL-2

Multidisciplinary Engineering Systems 2 nd and 3rd Year College-Wide Courses

GCSE Mathematics B (Linear) Mark Scheme for November Component J567/04: Mathematics Paper 4 (Higher) General Certificate of Secondary Education

Pedagogical Content Knowledge for Teaching Primary Mathematics: A Case Study of Two Teachers

Telekooperation Seminar

Running Head: STUDENT CENTRIC INTEGRATED TECHNOLOGY

Predicting Future User Actions by Observing Unmodified Applications

Lecture 2: Quantifiers and Approximation

Improving Action Selection in MDP s via Knowledge Transfer

AI Agent for Ice Hockey Atari 2600

Intelligent Agents. Chapter 2. Chapter 2 1

Transcription:

ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 14: Planning and Learning October 27, 2015 Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science The University of Tennessee Fall 2015 1

Final projects - logistics Projects can be done in groups of up to 3 students Details on projects will be posted soon Students are encouraged to propose a topic Please email me your top three choices for a project along with a preferred date for your presentation Presentation dates: Nov. 17, 19, 24 and Dec. 1 + additional time slot (TBD) Format: 20 min presentation + 5 min Q&A ~5 min for background and motivation ~15 for description of your work, results, conclusions Written report due: Monday, Dec. 7 Format similar to project report ECE 517: Reinforcement Learning in AI 2

Final projects sample topics DQN Playing Atari games using RL Teris player using RL (and NN) Curiosity based TD learning* Reinforcement Learning of Local Shape in the Game of Go AIBO learning to walk Study of value function definitions for TD learning Imitation learning in RL ECE 517: Reinforcement Learning in AI 3

Outline Introduction Use of environment models Integration of planning and learning methods ECE 517: Reinforcement Learning in AI 4

Introduction Earlier we discussed Monte Carlo and temporal-difference methods as distinct alternatives Then showed how they can be seamlessly integrated by using eligibility traces such as in TD(l) Planning methods: e.g. Dynamic Programming and heuristic search Rely on knowledge of a model Model any information that helps the agent predict the way the environment will behave Learning methods: Monte Carlo and Temporal Difference Learning Do not require a model Our goal: Explore the extent to which the two methods can be intermixed ECE 517: Reinforcement Learning in AI 5

The original idea ECE 517: Reinforcement Learning in AI 6

The original idea (cont.) ECE 517: Reinforcement Learning in AI 7

Models Model: anything the agent can use to predict how the environment will respond to its actions Distribution models: provide description of all possibilities (of next states and rewards) and their probabilities e.g. Dynamic Programming Example - sum of a dozen dice produce all possible sums and their probabilities of occurring Sample models: produce just one sample experience In our example - produce individual sums drawn according to this probability distribution Both types of models can be used to (mimic) produce simulated experience Often sample models are much easier to come by ECE 517: Reinforcement Learning in AI 8

Planning Planning: any computational process that uses a model to create or improve a policy Planning in AI: State-space planning (such as in RL) search for policy Plan-space planning (e.g., partial-order planner) e.g. evolutionary methods We take the following (unusual) view: All state-space planning methods involve computing value functions, either explicitly or implicitly They all apply backups to simulated experience ECE 517: Reinforcement Learning in AI 9

Planning (cont.) Classical DP methods are state-space planning methods Heuristic search methods are state-space planning methods Planning methods rely on real experience as input, but in many cases they can be applied to simulated experience just as well Example: a planning method based on Q-learning: Random-Sample One-Step Tabular Q-Planning ECE 517: Reinforcement Learning in AI 10

Learning, Planning, and Acting Two uses of real experience: Model learning: to improve the model Direct RL: to directly improve the value function and policy Improving value function and/or policy via a model is sometimes called indirect RL or model-based RL. Here, we call it planning. Q: What are the advantages/disadvantages of each? ECE 517: Reinforcement Learning in AI 11

Direct vs. Indirect RL Indirect methods: make fuller use of experience: get better policy with fewer environment interactions Direct methods simpler not affected by bad models But they are very closely related and can be usefully combined: planning, acting, model learning, and direct RL can occur simultaneously and in parallel Q: Which scheme do you think applies to humans? ECE 517: Reinforcement Learning in AI 12

The Dyna-Q Architecture (Sutton 1990) ECE 517: Reinforcement Learning in AI 13

The Dyna-Q Algorithm Random-sample single-step tabular Q-planning method direct RL model learning (update) planning ECE 517: Reinforcement Learning in AI 14

Dyna-Q on a Simple Maze rewards = 0 until goal reached, when reward = 1 ECE 517: Reinforcement Learning in AI 15

Dyna-Q Snapshots: Midway in 2 nd Episode Recall that in a planning context Exploration trying actions that improve the model Exploitation Behaving in the optimal way given the current model Balance between the two is always a key challenge! ECE 517: Reinforcement Learning in AI 16

Variations on the Dyna-Q agent (Regular) Dyna-Q Soft exploration/exploitation with constant rewards Dyna-Q+ Encourages exploration of state-action pairs that have not been visited in a long time (in real interaction with the environment) If n is the number of steps elapsed between two consecutive visits to (s,a), then the reward is larger as a function of n Dyna-AC Actor-Critic learning rather that Q-learning ECE 517: Reinforcement Learning in AI 17

More on Dyna-Q+? Uses an exploration bonus : Keeps track of time since each state-action pair was tried for real An extra reward is added for transitions caused by state-action pairs related to how long ago they were tried: the longer unvisited, the more reward for visiting The agent (indirectly) plans how to visit long unvisited states ECE 517: Reinforcement Learning in AI 18

When the Model is Wrong: Blocking Maze (cont.) The maze example was oversimplified In reality many things could go wrong Environment could be stochastic Model can be imperfect (local minimum, stochasticity or no convergence) Partial experience could be misleading When the model is incorrect, the planning process will compute a suboptimal policy This is actually a learning opportunity Discovery and correction of the modeling error ECE 517: Reinforcement Learning in AI 19

When the Model is Wrong: Blocking Maze (cont.) The changed environment is harder ECE 517: Reinforcement Learning in AI 20

Shortcut Maze The changed environment is easier ECE 517: Reinforcement Learning in AI 21

Prioritized Sweeping In the Dyna agents presented, simulated transitions are started in uniformly chosen state-action pairs Probably not optimal Which states or state-action pairs should be generated during planning? Work backwards from states whose values have just changed: Maintain a queue of state-action pairs whose values would change a lot if backed up, prioritized by the size of the change When a new backup occurs, insert predecessors according to their priorities Always perform backups from first in queue Moore and Atkeson 1993; Peng and Williams, 1993 ECE 517: Reinforcement Learning in AI 22

Prioritized Sweeping ECE 517: Reinforcement Learning in AI 23

Prioritized Sweeping vs. Dyna-Q Both use N = 5 backups per environmental interaction ECE 517: Reinforcement Learning in AI 24

Trajectory Sampling Trajectory sampling: perform backups along simulated trajectories This samples from the on-policy distribution Distribution constructed from experience (visits) Advantages when function approximation is used Focusing of computation: can cause vast uninteresting parts of the state space to be (usefully) ignored: Initial states Reachable under optimal control Irrelevant states ECE 517: Reinforcement Learning in AI 25

Summary Discussed close relationship between planning and learning Important distinction between distribution models and sample models Looked at some ways to integrate planning and learning synergy among planning, acting, model learning Distribution of backups: focus of the computation prioritized sweeping trajectory sampling: backup along trajectories ECE 517: Reinforcement Learning in AI 26