Reinforcement Learning cont. CS434

Similar documents
Lecture 10: Reinforcement Learning

Reinforcement Learning by Comparing Immediate Reward

Exploration. CS : Deep Reinforcement Learning Sergey Levine

AMULTIAGENT system [1] can be defined as a group of

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

High-level Reinforcement Learning in Strategy Games

Georgetown University at TREC 2017 Dynamic Domain Track

Artificial Neural Networks written examination

Regret-based Reward Elicitation for Markov Decision Processes

Improving Action Selection in MDP s via Knowledge Transfer

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

Lecture 1: Machine Learning Basics

Session 2B From understanding perspectives to informing public policy the potential and challenges for Q findings to inform survey design

TD(λ) and Q-Learning Based Ludo Players

Speeding Up Reinforcement Learning with Behavior Transfer

A Comparison of Annealing Techniques for Academic Course Scheduling

CS Machine Learning

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Axiom 2013 Team Description Paper

Laboratorio di Intelligenza Artificiale e Robotica

The Strong Minimalist Thesis and Bounded Optimality

Introduction to Simulation

Teachable Robots: Understanding Human Teaching Behavior to Build More Effective Robot Learners

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

Evidence for Reliability, Validity and Learning Effectiveness

P-4: Differentiate your plans to fit your students

An empirical study of learning speed in backpropagation

Getting Started with Deliberate Practice

Pod Assignment Guide

College Pricing and Income Inequality

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

(Sub)Gradient Descent

GCSE Mathematics B (Linear) Mark Scheme for November Component J567/04: Mathematics Paper 4 (Higher) General Certificate of Secondary Education

Software Maintenance

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Using focal point learning to improve human machine tacit coordination

Laboratorio di Intelligenza Artificiale e Robotica

Visual CP Representation of Knowledge

College Pricing and Income Inequality

Active Learning. Yingyu Liang Computer Sciences 760 Fall

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

An Introduction to Simio for Beginners

How to make an A in Physics 101/102. Submitted by students who earned an A in PHYS 101 and PHYS 102.

Inside the mind of a learner

A Reinforcement Learning Variant for Control Scheduling

Learning and Transferring Relational Instance-Based Policies

On the Combined Behavior of Autonomous Resource Management Agents

Machine Learning and Development Policy

Shockwheat. Statistics 1, Activity 1

EDIT 576 DL1 (2 credits) Mobile Learning and Applications Fall Semester 2014 August 25 October 12, 2014 Fully Online Course

Task Completion Transfer Learning for Reward Inference

Self Study Report Computer Science

Softprop: Softmax Neural Network Backpropagation Learning

Designing a Rubric to Assess the Modelling Phase of Student Design Projects in Upper Year Engineering Courses

Learning Cases to Resolve Conflicts and Improve Group Behavior

Acquiring Competence from Performance Data

CONCEPT MAPS AS A DEVICE FOR LEARNING DATABASE CONCEPTS

Major Milestones, Team Activities, and Individual Deliverables

Conceptual and Procedural Knowledge of a Mathematics Problem: Their Measurement and Their Causal Interrelations

Generative models and adversarial training

Learning Prospective Robot Behavior

9.85 Cognition in Infancy and Early Childhood. Lecture 7: Number

Cognitive Thinking Style Sample Report

Towards a Robuster Interpretive Parsing

Practitioner s Lexicon What is meant by key terminology.

Entrepreneurial Discovery and the Demmert/Klein Experiment: Additional Evidence from Germany

Using AMT & SNOMED CT-AU to support clinical research

Executive Guide to Simulation for Health

Task Completion Transfer Learning for Reward Inference

EDIT 576 (2 credits) Mobile Learning and Applications Fall Semester 2015 August 31 October 18, 2015 Fully Online Course

STUDENTS' RATINGS ON TEACHER

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

IMGD Technical Game Development I: Iterative Development Techniques. by Robert W. Lindeman

How to Do Research. Jeff Chase Duke University

Results In. Planning Questions. Tony Frontier Five Levers to Improve Learning 1

Improving Conceptual Understanding of Physics with Technology

First Line Manager Development. Facilitated Blended Accredited

Number Line Moves Dash -- 1st Grade. Michelle Eckstein

Level 1 Mathematics and Statistics, 2015

CPS122 Lecture: Identifying Responsibilities; CRC Cards. 1. To show how to use CRC cards to identify objects and find responsibilities

Pedagogical Content Knowledge for Teaching Primary Mathematics: A Case Study of Two Teachers

ADDIE: A systematic methodology for instructional design that includes five phases: Analysis, Design, Development, Implementation, and Evaluation.

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Individual Differences & Item Effects: How to test them, & how to test them well

Learning to Schedule Straight-Line Code

Greedy Decoding for Statistical Machine Translation in Almost Linear Time

Firms and Markets Saturdays Summer I 2014

Automatic Discretization of Actions and States in Monte-Carlo Tree Search

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Community Rhythms. Purpose/Overview NOTES. To understand the stages of community life and the strategic implications for moving communities

Rover Races Grades: 3-5 Prep Time: ~45 Minutes Lesson Time: ~105 minutes

Backwards Numbers: A Study of Place Value. Catherine Perez

10.2. Behavior models

The Effect of Discourse Markers on the Speaking Production of EFL Students. Iman Moradimanesh

GERM 3040 GERMAN GRAMMAR AND COMPOSITION SPRING 2017

While you are waiting... socrative.com, room number SIMLANG2016

Seminar - Organic Computing

ABC of Programming Linda

TABLE OF CONTENTS TABLE OF CONTENTS COVER PAGE HALAMAN PENGESAHAN PERNYATAAN NASKAH SOAL TUGAS AKHIR ACKNOWLEDGEMENT FOREWORD

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

Transcription:

Reinforcement Learning cont. CS434

Passive learning Assume that the agent executes a fixed policy π Goal is to compute U π (s), based on some sequence of training trials performed by the agent ADP: model based learning With each observation, update the underlying MDP model Solve the resulting policy evaluation problem under the current MDP model TD: model free learning Directly estimate using online estimation of mean When observe a transition S -> S, the update rule is:

Comparison between ADP and TD Advantages of ADP: Converges to the true utilities faster Utility estimates don t vary as much from the true utilities Advantages of TD: Simpler, less computation per observation Crude but efficient first approximation to ADP Don t need to build a transition model in order to perform its updates (this is important because we can interleave computation with exploration rather than having to wait for the whole model to be built first)

Passive learning Learning U π (s) does not lead to a optimal policy, why? the models are incomplete/inaccurate the agent has only tried limited actions, we cannot gain a good overall understanding of T This is why we need active learning

Goal of active learning Let s first assume that we still have access to some sequence of trials performed by the agent The agent is not following any specific policy We can assume for now that the sequences should include a thorough exploration of the space We will talk about how to get such sequences later The goal is to learn an optimal policy from such sequences

Active Reinforcement Learning Agents We will describe two types of Active Reinforcement Learning agents: Active ADP agent Q learner (based on TD algorithm)

Active ADP Agent (Model based) Using the data from its trials, the agent learns a transition model and a reward function With (s,s ) and (s), it has an estimate of the underlying MDP It can compute the optimal policy by solving the Bellman equations using value iteration or policy iteration U ( s) Rˆ( s) max a If and are accurate estimation of the underlying MDP model, we can find the optimal policy this way s' Tˆ( s, s') U ( s')

Issues with ADP approach Need to maintain MDP model can be very large Also, finding the optimal action requires solving the bellman equations time consuming Can we avoid this large computational complexity both in terms of time and space?

Q learning So far, we have focused on the utilities for states U(s) = utility of state s = expected maximum future rewards An alternative is to store Q values, which are defined as: Q(s) = utility of taking action a at state s = expected maximum future reward if action a at state s Relationship between U(s) and Q( s)? U ( s) maxq( s) a

Q learning can be model free Note that after computing U(s), to obtain the optimal policy, we need to compute: ' ( s) max T ( s, s ) U ( s' ) a s' This requires T, the model of world So even if we use TD learning (model free), we still need the model to get the optimal policy However, if you successfully estimate Q(s) for all a and s, we can compute the optimal policy without using the model: ( s) maxq( s) a

Q learning At equilibrium when the Q values are correct, we can write the constraint equation: Q( s) R( s) s' Note that this requires learning a transition model T ( s, s') maxq( a', s') a'

Q learning At equilibrium when the Q values are correct, we can write the constraint equation: Q( s) R( s) s' T ( s, s') maxq( a', s') a' Reward at state s Best expected value for action-state pair ( s) Best value averaged over all possible states s that can be reached from s after executing action a Best value at the next state = max over all actions in state s

Q learning Without a Model We can use a temporal differencing approach which is model free After moving from state s to state s using action a: Q( s) Q( s) ( R( s) maxq( a', a' s') Q( s)) New estimate of Q(s) Learning rate 0 < α < 1 Old estimate of Q(s) Difference between old estimate Q(s) and the new noisy sample after taking action a

Q learning: Estimating the Policy Q-Update: After moving from state s to state s using action a: Q( s) Q( s) ( R( s) maxq( a', a' s') Q( s)) Note that T(s,s ) does not appear anywhere! Further, once we converge, the optimal policy can be computed without T. This is a completely model-free learning algorithm.

Q learning Convergence Guaranteed to converge to the true Q values given enough exploration Very general procedure (because it s model free) Converges slower than ADP agent (because it is completely model free and it doesn t enforce consistency among values through the model)

So far, we have assumed that all training sequences are given and they fully explore the state space and action space But how do we generate all the training trials? We can have the agents random explore first, to collect training trials Once we accumulate enough trials, we perform the learning (eith ADP, or Q learning) We then choose the optimal policy How much exploration do we need to do? What if the agent is expected to learn and perform reasonably constantly, not just at the end

A greedy agent At any point, the agent has a current set of training trials, and we ve got a policy that is optimal based on our current understanding of the world A greedy agent can execute the optimal policy for the learned model at each time step

A greedy Q learning agent function Q learning agent(percept) returns an action inputs: percept, a percept indicating the current state s and reward signal r static: Q, a table of action values index by state and action N sa a table of frequencies for state action pairs, initially zero s, r the previous state and action, initially null if s is not null, then do increment N sa [s,a] Q( s, a) Q( s, a) ( r max Q( s', a') Q( s, a)) a' if TERMINAL?[s ] then s, a null else s, r s, arg maxq( s', a), r return a a Always choose the action that is deemed the best based on current Q table

The Greedy Agent The agent finds the lower route to get to the goal state but never finds the optimal upper route. The agent is stubborn and doesn t change so it doesn t learn the true utilities or the true optimal policy

What happened? How can choosing an optimal action lead to suboptimal results? What we have learned (T/R, or Q) may not truly reflect the true environment In fact, the set of trials observed by the agent was often insufficient How can we address this issue? We need good training experience

Exploitation vs Exploration Actions are always taken for one of the two following purposes: Exploitation: Execute the current optimal policy to get high payoff Exploration: Try new sequences of (possibly random) actions to improve the agent s knowledge of the environment even though current model doesn t believe they have high payoff Pure exploitation: gets stuck in a rut Pure exploration: not much use if you don t put that knowledge into practice

Optimal Exploration Strategy? What is the optimal exploration strategy? Greedy? Random? Mixed? (Sometimes use greedy sometimes use random) It turns out that the optimal exploration strategy has been studied in depth in the N armed bandit problem

N armed Bandits We have N slot machines, each can yield $1 with some probability (different for each machine) What order should we try the machines? Stay with the machine with the highest observed probability so far? Random? Something else? Bottom line: It s not obvious In fact, an exact solution is usually intractable

GLIE Fortunately it is possible to come up with a reasonable exploration method that eventually leads to optimal behavior by the agent Any such exploration method needs to be Greedy in the Limit of Infinite Exploration (GLIE) Properties: Must try each action in each state an unbounded number of times so that it doesn t miss any optimal actions Must eventually become greedy

Examples of GLIE schemes greedy: Choose optimal action with probability (1 ) Choose a random action with probability /(number of actions 1) Active ε greedy agent 1. Start from the original sequence of trials 2. Compute the optimal policy under the current understanding of the world 3. Take action use the ε greedy exploitation exploration strategy 4. Update learning, go to 2

Another approach Favor actions the agent has not tried very often, avoid actions believed to be of low utility (based on past experience) We can achieve this using an exploration function

An exploratory Q learning agent function Q learning agent(percept) returns an action inputs: percept, a percept indicating the current state s and reward signal r static: Q, a table of action values index by state and action N sa a table of frequencies for state action pairs, initially zero s, r the previous state and action, initially null if s is not null, then do increment N sa [s,a] Q( s) Q( s) ( r maxq( a', s') Q( s)) a' if TERMINAL?[s ] then s, a null else s, r s, arg max f ( Q( a', s'), N sa[ s', a']), r a' return a Exploration function: R if n N e f ( u, n) u otherwise

Exploration Function Exploration function f(q,n): f ( q, n) R if n N q otherwise e - Trades off greedy (preference for high utilities q) against curiosity (preference for low values of n the number of times a state-action pair has been tried) - R+ is an optimistic estimate of the best possible reward obtainable in any state with any action - If a hasn t been tried enough in s, you assume it will somehow lead to gold optimistic - N e is a limit on the number of tries for a state-action pair

Model based/model free Two broad categories of reinforcement learning algorithms: 1. Model based eg. ADP 2. Model free eg. TD, Q learning Which is better? Model baesed approach is a knowledge based approach (ie. model represents known aspects of the environment) Book claims that as environment becomes more complex, a knowledge based approach is better

What You Should Know Exploration vs exploitation GLIE schemes Difference between model free and modelbased methods Q learning