Reinforcement learning CS434

Similar documents
Lecture 10: Reinforcement Learning

Reinforcement Learning by Comparing Immediate Reward

Exploration. CS : Deep Reinforcement Learning Sergey Levine

AMULTIAGENT system [1] can be defined as a group of

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Session 2B From understanding perspectives to informing public policy the potential and challenges for Q findings to inform survey design

Improving Action Selection in MDP s via Knowledge Transfer

Regret-based Reward Elicitation for Markov Decision Processes

High-level Reinforcement Learning in Strategy Games

Axiom 2013 Team Description Paper

Artificial Neural Networks written examination

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

IMGD Technical Game Development I: Iterative Development Techniques. by Robert W. Lindeman

TD(λ) and Q-Learning Based Ludo Players

Intelligent Agents. Chapter 2. Chapter 2 1

Speeding Up Reinforcement Learning with Behavior Transfer

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

Laboratorio di Intelligenza Artificiale e Robotica

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Georgetown University at TREC 2017 Dynamic Domain Track

While you are waiting... socrative.com, room number SIMLANG2016

On the Combined Behavior of Autonomous Resource Management Agents

(Sub)Gradient Descent

Professor Christina Romer. LECTURE 24 INFLATION AND THE RETURN OF OUTPUT TO POTENTIAL April 20, 2017

The Good Judgment Project: A large scale test of different methods of combining expert predictions

First Grade Standards

GCSE Mathematics B (Linear) Mark Scheme for November Component J567/04: Mathematics Paper 4 (Higher) General Certificate of Secondary Education

Mathematics process categories

Visual CP Representation of Knowledge

Teachable Robots: Understanding Human Teaching Behavior to Build More Effective Robot Learners

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Lecture 6: Applications

Arizona s College and Career Ready Standards Mathematics

Cal s Dinner Card Deals

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Story Problems with. Missing Parts. s e s s i o n 1. 8 A. Story Problems with. More Story Problems with. Missing Parts

The Strong Minimalist Thesis and Bounded Optimality

An investigation of imitation learning algorithms for structured prediction

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

CS Machine Learning

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

Stopping rules for sequential trials in high-dimensional data

Task Completion Transfer Learning for Reward Inference

Learning Prospective Robot Behavior

A Reinforcement Learning Variant for Control Scheduling

Case study Norway case 1

CS 100: Principles of Computing

Task Completion Transfer Learning for Reward Inference

Seminar - Organic Computing

Extending Place Value with Whole Numbers to 1,000,000

First Grade Curriculum Highlights: In alignment with the Common Core Standards

Lecture 1: Machine Learning Basics

Software Maintenance

An OO Framework for building Intelligence and Learning properties in Software Agents

Grade 6: Correlated to AGS Basic Math Skills

Improving Conceptual Understanding of Physics with Technology

A Case Study: News Classification Based on Term Frequency

Laboratorio di Intelligenza Artificiale e Robotica

Generative models and adversarial training

Extending Learning Across Time & Space: The Power of Generalization

Managerial Decision Making

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

Cooperative Game Theoretic Models for Decision-Making in Contexts of Library Cooperation 1

Radius STEM Readiness TM

Instructor: Mario D. Garrett, Ph.D. Phone: Office: Hepner Hall (HH) 100

Chapter 2. Intelligent Agents. Outline. Agents and environments. Rationality. PEAS (Performance measure, Environment, Actuators, Sensors)

Learning and Transferring Relational Instance-Based Policies

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

A Comparison of Annealing Techniques for Academic Course Scheduling

A Neural Network GUI Tested on Text-To-Phoneme Mapping

The One Minute Preceptor: 5 Microskills for One-On-One Teaching

Predicting Students Performance with SimStudent: Learning Cognitive Skills from Observation

Ohio s Learning Standards-Clear Learning Targets

Probability and Game Theory Course Syllabus

Data Structures and Algorithms

K5 Math Practice. Free Pilot Proposal Jan -Jun Boost Confidence Increase Scores Get Ahead. Studypad, Inc.

Lecturing Module

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

FF+FPG: Guiding a Policy-Gradient Planner

Decision Making Lesson Review

Mathematics Scoring Guide for Sample Test 2005

AN EXAMPLE OF THE GOMORY CUTTING PLANE ALGORITHM. max z = 3x 1 + 4x 2. 3x 1 x x x x N 2

Introduction to Simulation

LEGO MINDSTORMS Education EV3 Coding Activities

Computer Science 141: Computing Hardware Course Information Fall 2012

Learning Cases to Resolve Conflicts and Improve Group Behavior

Probability estimates in a scenario tree

An Introduction to Simio for Beginners

Entrepreneurial Discovery and the Demmert/Klein Experiment: Additional Evidence from Germany

Edexcel GCSE. Statistics 1389 Paper 1H. June Mark Scheme. Statistics Edexcel GCSE

Critical Thinking in Everyday Life: 9 Strategies

Multiagent Simulation of Learning Environments

1 3-5 = Subtraction - a binary operation

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Objectives. Chapter 2: The Representation of Knowledge. Expert Systems: Principles and Programming, Fourth Edition

Disambiguation of Thai Personal Name from Online News Articles

Activity 2 Multiplying Fractions Math 33. Is it important to have common denominators when we multiply fraction? Why or why not?

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

College Pricing and Income Inequality

College Pricing and Income Inequality

Transcription:

Reinforcement learning CS434

Review: MDP Critical components of MDPs State space: S Action space: A Transition model: T: S x A x S > [0,1], such that Reward function: R(S)

Review: Value Iteration ' ') ( '),, ( max ) ( ) ( s a s U s a s T s R s U ' 1 ) ' ( '),, ( max ) ( ) ( s i a i s U s a s T s R s U ' * * ') ( '),, ( arg max ) ( s a s U s a s T s Bellman equation: defines the utility of the states what is the maximum expected discounted reward we can get by starting at state s Bellman iteration: what is the maximum expected discounted reward we can get by starting at state s if agent has i steps to live Optimal policy: given the converged U*, what is the best action to take at each state

Review: Policy Iteration Start with a randomly chosen initial policy π 0 Iterate until no change in utilities: 1. Policy evaluation: given a policy π i, calculate the utility U i (s) of every state s using policy π i by solving the system of equations: 2. Policy improvement: calculate the new policy π i+1 using one step look ahead based on : i,, a 1 ( s) arg max T ( s, a, s') U ( s' ) s' i

So far. Given an MDP model we know how to find optimal policies Value Iteration Policy Iteration But what if we don t have any form of the model of the world (e.g., T, and R) Like when we were babies... All we can do is wander around the world observing what happens, getting rewarded and punished This is what reinforcement learning about

Why not supervised learning In supervised learning, we had a teacher providing us with training examples with class labels Has Fever Has Cough Has Breathing Problems Ate Chicken Recently true true true false true true true true false false false true Has Asian Bird Flu false true false The agent figures out how to predict the class label given the features.

Can We Use Supervised Learning? Now imagine a complex task such as learning to play a board game Suppose we took a supervised learning approach to learning an evaluation function For every possible position of your pieces, you need a teacher to provide an accurate and consistent evaluation of that position This is not feasible

Trial and Error A better approach: imagine we don t have a teacher Instead, the agent gets to experiment in its environment The agent tries out actions and discovers by itself which actions lead to a win or loss The agent can learn an evaluation function that can estimate the probability of winning from any given position

Reinforcement/Reward The key to this trial and error approach is having some sort of feedback about what is good and what is bad We call this feedback reward or reinforcement In some environment, rewards are frequent Ping pong: each point scored Learning to crawl: forward motion In other environments, reward is delayed Chess: reward only happens at the end of the game

Importance of Credit Assignment

Reinforcement This is very similar to what happens in nature with animals and humans Positive reinforcement: Happiness, Pleasure, Food Negative reinforcement: Pain, Hunger, Lonelinesss What happens if we get agents to learn in this way? This leads us to the world of Reinforcement Learning

Reinforcement Learning in a nutshell Imagine playing a new game whose rules you don t know; after a hundred or so moves, your opponent announces, You lose. Russell and Norvig Introduction to Artificial Intelligence

Reinforcement Learning Agent placed in an environment and must learn to behave optimally in it Assume that the world behaves like an MDP, except: Agent can act but does not know the transition model Agent observes its current state its reward but doesn t know the reward function Goal: learn an optimal policy

Factors that Make RL Difficult Actions have non deterministic effects which are initially unknown and must be learned Rewards / punishments can be infrequent Often at the end of long sequences of actions How do we determine what action(s) were really responsible for reward or punishment? (credit assignment problem) World is large and complex

Passive vs. Active learning Passive learning The agent acts based on a fixed policy π and tries to learn how good the policy is by observing the world go by Analogous to policy evaluation in policy iteration Active learning The agent attempts to find an optimal (or at least good) policy by exploring different actions in the world Analogous to solving the underlying MDP

Model Based vs. Model Free RL Model based approach to RL: learn the MDP model (T and R), or an approximation of it use it to find the optimal policy Model free approach to RL: derive the optimal policy without explicitly learning the model We will consider both types of approaches

Passive Reinforcement Learning Suppose agent s policy π is fixed It wants to learn how good that policy is in the world ie. it wants to learn U π (s) This is just like the policy evaluation part of policy iteration The big difference: the agent doesn t know the transition model or the reward function (but it gets to observe the reward in each state it is in)

Passive RL Suppose we are given a policy Want to determine how good it is Given π: Need to learn U π (S):

Adaptive Dynamic Programming (A Model based approach) Basically it learns the transition model T and the reward function R from the training sequences Based on the learned MDP (T and R) we can perform policy evaluation (which is part of policy iteration previously taught)

Adaptive Dynamic Programming Recall that policy evaluation in policy iteration involves solving the utility for each state if policy π i is followed. This leads to the equations: U i ( s) R( s) s' T ( s, ( s), s') U ( s') The equations above are linear, so they can be solved with linear algebra in time O(n 3 ) where n is the number of states i i

Adaptive Dynamic Programming Make use of policy evaluation to learn the utilities of states In order to use the policy evaluation eqn: U ( s) R( s) s' T ( s, ( s), s') U ( s') the agent needs to learn the transition model T(s,a,s ) and the reward function R(s) How do we learn these models?

Adaptive Dynamic Programming Learning the reward function R(s): Easy because it s deterministic. Whenever you see a new state, store the observed reward value as R(s) Learning the transition model T(s,a,s ): Keep track of how often you get to state s given that you re in state s and do action a. eg. if you are in s = (1,3) and you execute Right three times and you end up in s =(2,3) twice, then T(s,Right,s ) = 2/3.

ADP Algorithm function PASSIVE ADP AGENT(percept) returns an action inputs: percept, a percept indicating the current state s and reward signal r static: π, a fixed policy mdp, an MDP with model T, rewards R, discount γ U, a table of utilities, initially empty N sa a table of frequencies for state action pairs, initially zero N sas, a table of frequencies for state action state triples, initially zero s, a the previous state and action, initially null if s is new then do U[s ] r ; R[s ] r Update reward if s is not null, then do function increment N sa [s,a] and N sas [s,a,s ] Update transition for each t such that N sas [s,a,t] is nonzero do model T[s,a,t] N sas [s,a,t] / N sa [s,a] U POLICY EVALUATION(π, U, mdp) if TERMINAL?[s ] then s, a null else s, a s, π[s ] return a

The Problem with ADP Need to solve a system of simultaneous equations costs O(n 3 ) Very hard to do if you have 10 50 states like in Backgammon Can we avoid the computational expense of full policy evaluation?

Temporal Difference Learning Instead of calculating the exact utility for a state can we approximate it and possibly make it less computationally expensive? Yes we can! Using Temporal Difference (TD) learning U ( s) R( s) s' T ( s, ( s), s') U ( s') Instead of doing this sum over all successors, only adjust the utility of the state based on the successor observed in the trial. It does not estimate the transition model model free

TD Learning Example: Suppose you see that U π (1,3) = 0.84 and U π (2,3) = 0.92 after the first trial. If the transition (1,3) (2,3) happens all the time, you would expect to see (assuming : U π (1,3) = R(1,3) + U π (2,3) U π (1,3) = 0.04 + U π (2,3) U π (1,3) = 0.04 + 0.92 = 0.88 Since you observe U π (1,3) = 0.84 in the first trial, it is a little lower than 0.88, so you might want to bump it towards 0.88.

U Temporal Difference Update When we move from state s to s, we apply the following update rule: ( s) U ( s) ( R( s) U ( s') U ( s)) This is similar to one step of value iteration We call this equation a backup

Convergence Since we re using the observed successor s instead of all the successors, what happens if the transition s s is very rare and there is a big jump in utilities from s to s? How can U π (s) converge to the true equilibrium value? Answer: The average value of U π (s) will converge to the correct value This means we need to observe enough trials that have transitions from s to its successors Essentially, the effects of the TD backups will be averaged over a large number of transitions Rare transitions will be rare in the set of transitions observed

ADP and TD Learning curves for the 4x3 maze world, given the optimal policy Which figure is ADP?

Comparison between ADP and TD Advantages of ADP: Converges to the true utilities faster Utility estimates don t vary as much from the true utilities Advantages of TD: Simpler, less computation per observation Crude but efficient first approximation to ADP Don t need to build a transition model in order to perform its updates (this is important because we can interleave computation with exploration rather than having to wait for the whole model to be built first)

What You Should Know How reinforcement learning differs from supervised learning and from MDPs Pros and cons of: Adaptive Dynamic Programming Temporal Difference Learning Note: Learning U π (s) does not lead to a optimal policy, why?