Reinforcement Learning cont. Dec

Similar documents
Lecture 10: Reinforcement Learning

Reinforcement Learning by Comparing Immediate Reward

Exploration. CS : Deep Reinforcement Learning Sergey Levine

AMULTIAGENT system [1] can be defined as a group of

Artificial Neural Networks written examination

High-level Reinforcement Learning in Strategy Games

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Georgetown University at TREC 2017 Dynamic Domain Track

Lecture 1: Machine Learning Basics

Improving Action Selection in MDP s via Knowledge Transfer

TD(λ) and Q-Learning Based Ludo Players

Laboratorio di Intelligenza Artificiale e Robotica

Regret-based Reward Elicitation for Markov Decision Processes

Axiom 2013 Team Description Paper

Task Completion Transfer Learning for Reward Inference

Task Completion Transfer Learning for Reward Inference

A Comparison of Annealing Techniques for Academic Course Scheduling

On the Combined Behavior of Autonomous Resource Management Agents

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

Major Milestones, Team Activities, and Individual Deliverables

Laboratorio di Intelligenza Artificiale e Robotica

The Strong Minimalist Thesis and Bounded Optimality

Designing a Rubric to Assess the Modelling Phase of Student Design Projects in Upper Year Engineering Courses

Generative models and adversarial training

A Reinforcement Learning Variant for Control Scheduling

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

College Pricing and Income Inequality

BMBF Project ROBUKOM: Robust Communication Networks

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Active Learning. Yingyu Liang Computer Sciences 760 Fall

(Sub)Gradient Descent

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Machine Learning and Development Policy

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

GCSE Mathematics B (Linear) Mark Scheme for November Component J567/04: Mathematics Paper 4 (Higher) General Certificate of Secondary Education

Elite schools or Normal schools: Secondary Schools and Student Achievement: Regression Discontinuity Evidence from Kenya

Learning Prospective Robot Behavior

College Pricing and Income Inequality

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

Evidence for Reliability, Validity and Learning Effectiveness

Shockwheat. Statistics 1, Activity 1

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Using focal point learning to improve human machine tacit coordination

TABLE OF CONTENTS TABLE OF CONTENTS COVER PAGE HALAMAN PENGESAHAN PERNYATAAN NASKAH SOAL TUGAS AKHIR ACKNOWLEDGEMENT FOREWORD

FF+FPG: Guiding a Policy-Gradient Planner

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

Software Maintenance

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Pod Assignment Guide

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Introduction to Simulation

Speeding Up Reinforcement Learning with Behavior Transfer

An empirical study of learning speed in backpropagation

9.85 Cognition in Infancy and Early Childhood. Lecture 7: Number

Rover Races Grades: 3-5 Prep Time: ~45 Minutes Lesson Time: ~105 minutes

Lesson plan for Maze Game 1: Using vector representations to move through a maze Time for activity: homework for 20 minutes

Truth Inference in Crowdsourcing: Is the Problem Solved?

Softprop: Softmax Neural Network Backpropagation Learning

Towards a Robuster Interpretive Parsing

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Individual Differences & Item Effects: How to test them, & how to test them well

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Session 2B From understanding perspectives to informing public policy the potential and challenges for Q findings to inform survey design

What to Do When Conflict Happens

Getting Started with Deliberate Practice

CONCEPT MAPS AS A DEVICE FOR LEARNING DATABASE CONCEPTS

AN EXAMPLE OF THE GOMORY CUTTING PLANE ALGORITHM. max z = 3x 1 + 4x 2. 3x 1 x x x x N 2

Transfer Learning Action Models by Measuring the Similarity of Different Domains

Learning to Schedule Straight-Line Code

Lecture 6: Applications

Comment-based Multi-View Clustering of Web 2.0 Items

Go fishing! Responsibility judgments when cooperation breaks down

CS Machine Learning

P-4: Differentiate your plans to fit your students

MGT/MGP/MGB 261: Investment Analysis

ADDIE: A systematic methodology for instructional design that includes five phases: Analysis, Design, Development, Implementation, and Evaluation.

Cognitive Thinking Style Sample Report

Cooperative Game Theoretic Models for Decision-Making in Contexts of Library Cooperation 1

Professor Christina Romer. LECTURE 24 INFLATION AND THE RETURN OF OUTPUT TO POTENTIAL April 20, 2017

MENTORING. Tips, Techniques, and Best Practices

Understanding Fair Trade

Inside the mind of a learner

Self Study Report Computer Science

Probability and Game Theory Course Syllabus

Teachable Robots: Understanding Human Teaching Behavior to Build More Effective Robot Learners

PM tutor. Estimate Activity Durations Part 2. Presented by Dipo Tepede, PMP, SSBB, MBA. Empowering Excellence. Powered by POeT Solvers Limited

Story Problems with. Missing Parts. s e s s i o n 1. 8 A. Story Problems with. More Story Problems with. Missing Parts

Results In. Planning Questions. Tony Frontier Five Levers to Improve Learning 1

Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models

Financial Accounting Concepts and Research

Seminar - Organic Computing

Using AMT & SNOMED CT-AU to support clinical research

Alpha provides an overall measure of the internal reliability of the test. The Coefficient Alphas for the STEP are:

Contents. Foreword... 5

Managerial Decision Making

Experience Corps. Mentor Toolkit

AGENDA. Truths, misconceptions and comparisons. Strategies and sample problems. How The Princeton Review can help

Improving Conceptual Understanding of Physics with Technology

Purdue Data Summit Communication of Big Data Analytics. New SAT Predictive Validity Case Study

Conceptual and Procedural Knowledge of a Mathematics Problem: Their Measurement and Their Causal Interrelations

Transcription:

Reinforcement Learning cont. Dec 01 2008

Refresh Your Memory Last class, we assumed that the agent executes a fixed policy π The goal is to evaluate how good π is, based on some sequence of trials performed by the agent

Passive Learning Methods: DUE: Directly estimate the utility of the states U π (s) by averaging reward to go from each state slow convergence, not using the Bellmanequation equation constraints ADP: learn the transition model T and the reward function R, then do policy evaluation to learn U π (s) few updates, but each update is expensive (O(n^3)) TD learning: maintain a running average of the state utilitiesby doing online mean estimation cheap updates but needs more updates than ADP

Goal of active learning Let s suppose we still have access to some sequence of trials performed by the agent The goal is to learn an optimal policy

Active Reinforcement Learning Agents We will describe two types of Active Reinforcement Learning agents: Active ADP agent Q learner (based on TD algorithm)

Active ADP Agent (Model based) d) Using the data from its trials, the agent learns a transition model T and a reward function R With T(s,a,s ) and R(s), it has an estimate of the underlying MDP It can compute the optimal policy by solving the Bll Bellman equations using value iteration i or policy iteration U ( s) = R( s) + γ max a s' T ( s, a, s') U ( s')

Active ADP Agent Now that we ve got a policy that is optimal based on our current understanding of the world, what should we do? Greedy agent: an agent that executes the optimal policy forthelearned model at each time step Let s see what happens in the maze world

The Greedy Agent The agent finds the lower route to get to the goal state but never finds the optimal upper route. The agent is stubborn and doesn t change so it doesn t learn the true utilities or the true optimal policy

What happened? How can choosing an optimal action lead to suboptimal results? The learned model is not the same as the true environment In fact, the set of trials observed by the agent was insufficient to build a good model of the environment How can we address this issue? We need more training experience

Exploitation vs Exploration Actions are always taken for one of the two following purposes: Exploitation: Execute the current optimal policy to get high payoff py Exploration: Try new sequences of (possibly random) actions to improve the agent s knowledge of the environment even though h current model dld doesn t tbli believe they have high payoff Pure exploitation: gets stuck in a rut Pure exploration: not much use if you don t put that knowledge into practice

Optimal Exploration Strategy? What is the optimal exploration strategy? Greedy? Random? Mixed? (Sometimes use greedy sometimes use random) It turns out that the optimal exploration strategy has been studied in depth in the N armed bandit problem

N armed Bandits We have N slot machines, each can yield $1 with some probability (different for each machine) What order should we try the machines? Stay with the machine with the highest observed probability so far? Random? Something else? Bottom line: It s not obvious In fact, an exact solution is usually intractable

GLIE Fortunately it is possible to come up with a reasonable exploration method that eventually leads to optimal behavior by the agent Any such exploration method needs to be Greedy in the Limit of Infinite Exploration (GLIE) Properties: Must try each action in each state an unbounded number of times so that it doesn t miss any optimal actions Must eventually become greedy

Examples of GLIE schemes ε greedy: Choose optimal action with probability (1 ε) Choose a random action with probability ε/(number of actions 1) Active ε greedy ADP agent 1. Start with initial T and R learned from the original sequence of trials 2. Compute the utilities of states U(s) using value iteration 3. Take action use the ε greedy exploitation exploration strategy 4. Update T and R, go to 2

Another approach Favor actions the agent has not tried very often, avoid actions believed to be of low utility We can achieve this by altering value iteration to use U + (s), which is an optimistic estimate of the utility of the state s (using an exploration function)

Exploration Function Exploration function f(u,n): R + if n < N f ( u, n ) = u otherwise - Trades off greedy (preference for high utilities u) against curiosity (preference for low values of n the number of times a state-action pair has been tried) - R+ is an optimistic estimate of the best possible reward obtainable in any state - If a hasn t been tried enough in s, you assume it will somehow lead to gold optimistic - N e is a limit on the number of tries for a state-action pair e

Using Exploration Functions 1. Start with initial T and R learned from the original sequence of trials 2. Perform value iteration to compute U+ using the exploration function 3. Take the greedy action 4. Update estimated model and goto 2

Q learning Previously, we needed to store utility values for a state ie. U(s) = utility of state s = expected sum of future rewards Now, we will store Q values, which are defined as: Q(a,s) = value of taking action a at state s = expected maximum sum of future discounted rewards after taking action a at state s

Q learning Now, instead of storing a table of U(s) values, we store a table of Q(a,s) values Note the relationship: U ( s) = maxq( a, s) a Note that if you estimate Q(a,s) for all a and s, we can simply choose the action that maximize Q, without using the model

Q learning At equilibrium when the Q values are correct, we can write the constraint equation: Q( a, s) = R( s) + γ s' Note that this requires learning a transition model T ( s, a, s') maxq( a', s') a'

Q learning At equilibrium when the Q values are correct, we can write the constraint equation: Q( a, s) = R( s) + γ s' Note that this requires learning a transition model T ( s, a, s') maxq( a', s') a'

Q learning At equilibrium when the Q values are correct, we can write the constraint equation: Q( a, s) = R( s) + γ s' T ( s, a, s') maxq( a', s') a' Reward at state s Best expected value for action-state pair (a, s) Best value averaged over all possible states s that can be reached from s after executing action a Best value at the next state = max over all actions in state s

Q learning Without a Model We can use a temporal differencing approach which is model free After moving from state s to state s using action a: Q( a, s) Q( a, s) + α( R( s) + γ maxq( a', a' s') Q( a, s)) New estimate of Q(a,s) Learning rate 0 < α < 1 Old estimate of Q(a,s) Difference between old estimate Q(a,s) and the new noisy sample after taking action a

Q learning: Estimating the Policy Q-Update: After moving from state s to state s using action a: Q ( a, s ) Q ( a, s ) + α ( R ( s ) + γ max Q ( a', s') Q ( a, s )) Policy estimation: a' π ( s) = maxq( a, s) a Note that T(s,a,s ) does not appear anywhere! This is a model-free learning algorithm

Q learning Convergence Guaranteed to converge to an optimal policy [Watkins] Very general procedure (because it s model free) Converges slower than ADP agent (because it is completely model free and it doesn t enforce consistency among values through the model)

Q learning: Exploration Strategies How to choose the next action while we re learning? Random Greedy ε Greedy G Boltzmann: Choose the next action with probability: (T is a temperature parameter that is decayed over time) e Q ( a, s) T

Model based/model free Two broad categories of reinforcement learning algorithms: 1. Model based eg. ADP 2. Model free dlf eg. TD, Q learning Which is better? Model baesed dapproach is a knowledge based d approach (ie. model represents known aspects of the environment) Book claims that as environment becomes more complex, a knowledge based approach is better

What You Should Know Exploration vs exploitation GLIE schemes Difference between model free and modelbased methods Q learning