Reinforcement Learning I

Similar documents
Lecture 10: Reinforcement Learning

Reinforcement Learning by Comparing Immediate Reward

Improving Action Selection in MDP s via Knowledge Transfer

Laboratorio di Intelligenza Artificiale e Robotica

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Artificial Neural Networks written examination

Regret-based Reward Elicitation for Markov Decision Processes

Task Completion Transfer Learning for Reward Inference

Georgetown University at TREC 2017 Dynamic Domain Track

TD(λ) and Q-Learning Based Ludo Players

Task Completion Transfer Learning for Reward Inference

High-level Reinforcement Learning in Strategy Games

Laboratorio di Intelligenza Artificiale e Robotica

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

Speeding Up Reinforcement Learning with Behavior Transfer

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Axiom 2013 Team Description Paper

AMULTIAGENT system [1] can be defined as a group of

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

BMBF Project ROBUKOM: Robust Communication Networks

An OO Framework for building Intelligence and Learning properties in Software Agents

Teachable Robots: Understanding Human Teaching Behavior to Build More Effective Robot Learners

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

Learning and Transferring Relational Instance-Based Policies

Evolutive Neural Net Fuzzy Filtering: Basic Description

How long did... Who did... Where was... When did... How did... Which did...

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

FF+FPG: Guiding a Policy-Gradient Planner

Learning Prospective Robot Behavior

On the Combined Behavior of Autonomous Resource Management Agents

Lecture 1: Machine Learning Basics

Transfer Learning Action Models by Measuring the Similarity of Different Domains

A Comparison of Annealing Techniques for Academic Course Scheduling

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

IAT 888: Metacreation Machines endowed with creative behavior. Philippe Pasquier Office 565 (floor 14)

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

An Introduction to Simio for Beginners

Chapter 2 Rule Learning in a Nutshell

Software Maintenance

College Pricing. Ben Johnson. April 30, Abstract. Colleges in the United States price discriminate based on student characteristics

AI Agent for Ice Hockey Atari 2600

The Strong Minimalist Thesis and Bounded Optimality

Planning with External Events

Session 2B From understanding perspectives to informing public policy the potential and challenges for Q findings to inform survey design

A Reinforcement Learning Variant for Control Scheduling

Introduction to Simulation

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Case Acquisition Strategies for Case-Based Reasoning in Real-Time Strategy Games

On-Line Data Analytics

Discriminative Learning of Beam-Search Heuristics for Planning

Grade 5 + DIGITAL. EL Strategies. DOK 1-4 RTI Tiers 1-3. Flexible Supplemental K-8 ELA & Math Online & Print

Python Machine Learning

(Sub)Gradient Descent

An Introduction to Simulation Optimization

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Major Milestones, Team Activities, and Individual Deliverables

Using focal point learning to improve human machine tacit coordination

Adaptive Generation in Dialogue Systems Using Dynamic User Modeling

Truth Inference in Crowdsourcing: Is the Problem Solved?

An Effective Framework for Fast Expert Mining in Collaboration Networks: A Group-Oriented and Cost-Based Method

Lecture 6: Applications

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Computer Science 1015F ~ 2016 ~ Notes to Students

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

The Evolution of Random Phenomena

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

The Good Judgment Project: A large scale test of different methods of combining expert predictions

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Learning From the Past with Experiment Databases

ALL-IN-ONE MEETING GUIDE THE ECONOMICS OF WELL-BEING

CSL465/603 - Machine Learning

Finding Your Friends and Following Them to Where You Are

Generative models and adversarial training

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

Lahore University of Management Sciences. FINN 321 Econometrics Fall Semester 2017

Online Updating of Word Representations for Part-of-Speech Tagging

AGENDA LEARNING THEORIES LEARNING THEORIES. Advanced Learning Theories 2/22/2016

Chapter 2. Intelligent Agents. Outline. Agents and environments. Rationality. PEAS (Performance measure, Environment, Actuators, Sensors)

A Reinforcement Learning Approach for Adaptive Single- and Multi-Document Summarization

Using AMT & SNOMED CT-AU to support clinical research

LEARNING TO PLAY IN A DAY: FASTER DEEP REIN-

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

1.11 I Know What Do You Know?

Go fishing! Responsibility judgments when cooperation breaks down

DIDACTIC MODEL BRIDGING A CONCEPT WITH PHENOMENA

An investigation of imitation learning algorithms for structured prediction

Extending Learning Across Time & Space: The Power of Generalization

TOKEN-BASED APPROACH FOR SCALABLE TEAM COORDINATION. by Yang Xu PhD of Information Sciences

ACTL5103 Stochastic Modelling For Actuaries. Course Outline Semester 2, 2014

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Team Formation for Generalized Tasks in Expertise Social Networks

P a g e 1. Grade 4. Grant funded by: MS Exemplar Unit English Language Arts Grade 4 Edition 1

Generating Test Cases From Use Cases

White Paper. The Art of Learning

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

Corrective Feedback and Persistent Learning for Information Extraction

EVOLVING POLICIES TO SOLVE THE RUBIK S CUBE: EXPERIMENTS WITH IDEAL AND APPROXIMATE PERFORMANCE FUNCTIONS

Transcription:

CSC411 Fall 2014 Machine Learning & Data Mining Reinforcement Learning I Slides from Rich Zemel

Reinforcement Learning Learning classes differ in information available to learner Supervised: correct outputs Unsupervised: no feedback, must construct measure of good output Reinforcement learning More realistic learning scenario: Continuous stream of input information, and actions Effects of action depend on state of the world Obtain reward that depends on world state and actions not correct response, just some feedback

Formula2ng Reinforcement Learning World described by a discrete, Einite set of states and actions At every time step t, we are in a state s t, and we: Take an action a t (possibly null action) Receive some reward r t+1 Move into a new state s t+1 Decisions can be described by a policy a selection of which action to take, based on the current state Aim is to maximize the total reward we receive over time Sometimes a future reward is discounted by γ k- 1, where k is the number of time- steps in the future when it is received

Tic- Tac- Toe Make this concrete by considering specieic example Consider the game tic- tac- toe: reward: win/lose/tie the game (+1/- 1/0) [only at Einal move in given game] state: positions of Xs and Os on the board policy: mapping from states to actions based on rules of game: choice of one open position value function: prediction of reward in future, based on current state In tic- tac- toe, since state space is tractable, can use a table to represent value function

RL & Tic- Tac- Toe Each board position (taking into account symmetry) has associated probability Simple learning process: start with all values = 0.5 policy: choose move with highest probability of winning given current legal moves from current state update entries in table based on outcome of each game After many games value function will represent true probability of winning from each state Can try alternative policy: sometimes select moves randomly (exploration)

Ac2ng Under Uncertainty The world and the actor may not be deterministic, or our model of the world may be incomplete We assume the Markov property: the future depends on the past only through the current state We describe the environment by a distribution over rewards and state transitions: The policy can also be non- deterministic: Policy is not a Eixed sequence of actions, but instead a conditional plan

Basic Problems Markov Decision Problem (MDP): tuple <S,A,P,γ> where P is Standard MDP problems: 1. Planning: given complete Markov decision problem as input, compute policy with optimal expected return 2. Learning: Only have access to experience in the MDP, learn a near- optimal strategy

Example of Standard MDP Problem 1. Planning: given complete Markov decision problem as input, compute policy with optimal expected return 2. Learning: Only have access to experience in the MDP, learn a near- optimal strategy We will focus on learning, but discuss planning along the way

Explora2on vs. Exploita2on If we knew how world works (embodied in P), then the policy should be deterministic just select optimal action in each state. But if we do not have complete knowledge of the world, taking what appears to be the optimal action may prevent us from Einding better states/actions Interesting trade- off: immediate reward (exploitation) vs. gaining knowledge that might enable higher future reward (exploration)

Bellman Equa2on Decision theory: maximize expected utility (related to rewards) DeEine the value function V(s): measures accumulated future rewards (value) from state s The relationship between a current state and its successor state is deeined by the Bellman equation: Discount factor γ: controls whether care only about immediate reward, or can appreciate delayed gratieication Can show that if value functions updated via Bellman equation, and γ < 1, V() will converge to optimal (estimate of expected reward given best policy)

Expected value of a policy Key recursive relationship between value function at successive states If we Eix some policy, π (deeines the distribution over actions for each state), then the value of a state is the expected discounted reward for following that policy from that state on: This value function will satisfy the following consistency equation (generalized Bellman equation):

RL: Some Examples Many natural problems have structure required for RL: 1. Game playing: know win/lose but not specieic moves (TD- gammon) 2. Control: for trafeic lights, can measure delay of cars, but not how to decrease it 3. Robot juggling 4. Robot path planning: can tell distance traveled, but not how to minimize

MDP formula2on Goal: Eind policy π that maximizes expected accumulated future rewards V π (s t ), obtained by following π from state s t : Game show example: assume series of questions, increasingly difeicult, but increasing payoff choice: accept accumulated earnings and quit; or continue and risk losing everything

We might try to learn the value function V (which we write as V*) We could then do a lookahead search to choose best action from any state s: where What to Learn V *(s) = max a [r(s, a)+γv *(δ(s, a))] π *(s) = argmax a [r(s,a)+γv *(δ(s,a))] P(s t +1 = s',r t +1 = r' s t = s,a t = a) = P(s t +1 = s' s t = s,a t = a)p(r t +1 = r' s t = s,a t = a) = δ(s,a)r(s,a) But there s a problem: This works well if we know δ() and r() But when we don t, we cannot choose actions this way

What to Learn Let us Eirst assume that δ() and r() are deterministic: V *(s) = max a [r(s, a)+γv *(δ(s, a))] π *(s) = argmax a [r(s,a)+γv *(δ(s,a))] Remember: Reward function At every time step t, we are in a state s t, and we: Take an action a t (possibly null action) Receive some reward r t+1 r : (s,a) r Move into a new state s t+1 δ : (s,a) s How can we do learning? Transition function

Q Learning DeEine a new function very similar to V* Q(s, a) r(s, a)+γv *(δ(s, a)) If we learn Q, we can choose the optimal action even without knowing δ! π *(s) = argmax a [r(s, a)+γv *(δ(s, a))] Q is then the evaluation function we will learn

Q and V* are closely related: So we can write Q recursively: Training Rule to Learn Q Let Q^ denote the learner s current approximation to Q Consider training rule ˆQ(s, a) r(s, a)+γ max a' ˆQ(s', a') where s is state resulting from applying action a in state s

Q Learning for Determinis2c World For each s,a initialize table entry Q^(s,a) ß 0 Start in some initial state s Do forever: Select an action a and execute it Receive immediate reward r Observe the new state s Update the table entry for Q^(s,a) using Q learning rule: s ß s ˆQ(s, a) r(s, a)+γ max a' ˆQ(s', a') If get to absorbing state, restart to initial state, and run thru Do forever loop until reach absorbing state

Upda2ng Es2mated Q Assume Robot is in state s 1 ; some of its current estimates of Q are as shown; executes rightward move Notice that if rewards are non- negative, then Q^ values only increase from 0, approach true Q

Q Learning: Summary training set consists of series of intervals (episodes): sequence of (state, action, reward) triples, end at absorbing state Each executed action a results in transition from state s i to s j ; algorithm updates Q^(s i,a) using the learning rule Intuition for simple grid world, reward only upon entering goal state à Q estimates improve from goal state back 1. All Q^(s,a) start at 0 2. First episode only update Q^(s,a) for transition leading to goal state 3. Next episode if go thru this next- to- last transition, will update Q^(s,a) another step back 4. Eventually propagate information from transitions with non- zero reward throughout state- action space