Fundamentals of Reinforcement Learning

Similar documents
Lecture 10: Reinforcement Learning

Reinforcement Learning by Comparing Immediate Reward

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

High-level Reinforcement Learning in Strategy Games

Improving Action Selection in MDP s via Knowledge Transfer

Axiom 2013 Team Description Paper

TD(λ) and Q-Learning Based Ludo Players

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Georgetown University at TREC 2017 Dynamic Domain Track

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Artificial Neural Networks written examination

Laboratorio di Intelligenza Artificiale e Robotica

Regret-based Reward Elicitation for Markov Decision Processes

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

AMULTIAGENT system [1] can be defined as a group of

Task Completion Transfer Learning for Reward Inference

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

A Reinforcement Learning Variant for Control Scheduling

Speeding Up Reinforcement Learning with Behavior Transfer

Task Completion Transfer Learning for Reward Inference

Laboratorio di Intelligenza Artificiale e Robotica

FF+FPG: Guiding a Policy-Gradient Planner

Learning Prospective Robot Behavior

Lecture 1: Machine Learning Basics

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

Teachable Robots: Understanding Human Teaching Behavior to Build More Effective Robot Learners

Transfer Learning Action Models by Measuring the Similarity of Different Domains

On the Combined Behavior of Autonomous Resource Management Agents

Intelligent Agents. Chapter 2. Chapter 2 1

TABLE OF CONTENTS TABLE OF CONTENTS COVER PAGE HALAMAN PENGESAHAN PERNYATAAN NASKAH SOAL TUGAS AKHIR ACKNOWLEDGEMENT FOREWORD

Case Acquisition Strategies for Case-Based Reasoning in Real-Time Strategy Games

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

Lecture 1: Basic Concepts of Machine Learning

AI Agent for Ice Hockey Atari 2600

BMBF Project ROBUKOM: Robust Communication Networks

An investigation of imitation learning algorithms for structured prediction

An Introduction to Simio for Beginners

Finding Your Friends and Following Them to Where You Are

Summary / Response. Karl Smith, Accelerations Educational Software. Page 1 of 8

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

Truth Inference in Crowdsourcing: Is the Problem Solved?

A Comparison of Annealing Techniques for Academic Course Scheduling

A Reinforcement Learning Approach for Adaptive Single- and Multi-Document Summarization

Learning and Transferring Relational Instance-Based Policies

SARDNET: A Self-Organizing Feature Map for Sequences

Agents and environments. Intelligent Agents. Reminders. Vacuum-cleaner world. Outline. A vacuum-cleaner agent. Chapter 2 Actuators

Robot Learning Simultaneously a Task and How to Interpret Human Instructions

Chapter 2. Intelligent Agents. Outline. Agents and environments. Rationality. PEAS (Performance measure, Environment, Actuators, Sensors)

Surprise-Based Learning for Autonomous Systems

An OO Framework for building Intelligence and Learning properties in Software Agents

Firms and Markets Saturdays Summer I 2014

Designing a Rubric to Assess the Modelling Phase of Student Design Projects in Upper Year Engineering Courses

Discriminative Learning of Beam-Search Heuristics for Planning

Planning with External Events

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

B. How to write a research paper

AGENDA LEARNING THEORIES LEARNING THEORIES. Advanced Learning Theories 2/22/2016

arxiv: v2 [cs.ro] 3 Mar 2017

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Ericsson Wallet Platform (EWP) 3.0 Training Programs. Catalog of Course Descriptions

While you are waiting... socrative.com, room number SIMLANG2016

LEARNING TO PLAY IN A DAY: FASTER DEEP REIN-

Lahore University of Management Sciences. FINN 321 Econometrics Fall Semester 2017

Adaptive Generation in Dialogue Systems Using Dynamic User Modeling

Seven Keys to a Positive Learning Environment in Your Classroom. Study Guide

ACTL5103 Stochastic Modelling For Actuaries. Course Outline Semester 2, 2014

Pre-Algebra A. Syllabus. Course Overview. Course Goals. General Skills. Credit Value

Go fishing! Responsibility judgments when cooperation breaks down

Python Machine Learning

Natural Language Processing. George Konidaris

IAT 888: Metacreation Machines endowed with creative behavior. Philippe Pasquier Office 565 (floor 14)

Evolutive Neural Net Fuzzy Filtering: Basic Description

Corrective Feedback and Persistent Learning for Information Extraction

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

A General Class of Noncontext Free Grammars Generating Context Free Languages

Proof Theory for Syntacticians

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Improving Fairness in Memory Scheduling

An Online Handwriting Recognition System For Turkish

Visual CP Representation of Knowledge

Lecture 6: Applications

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Generative models and adversarial training

Language properties and Grammar of Parallel and Series Parallel Languages

Managerial Decision Making

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

On-the-Fly Customization of Automated Essay Scoring

Grade 6: Correlated to AGS Basic Math Skills

The Strong Minimalist Thesis and Bounded Optimality

*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN

College Pricing. Ben Johnson. April 30, Abstract. Colleges in the United States price discriminate based on student characteristics

A Case-Based Approach To Imitation Learning in Robotic Agents

Dynamic Tournament Design: An Application to Prediction Contests

Spring 2016 Stony Brook University Instructor: Dr. Paul Fodor

UNIVERSITY OF CALIFORNIA SANTA CRUZ TOWARDS A UNIVERSAL PARAMETRIC PLAYER MODEL

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

STA 225: Introductory Statistics (CT)

Transcription:

Fundamentals of Reinforcement Learning December 9, 2013 - Techniques of AI Yann-Michaël De Hauwere - ydehauwe@vub.ac.be December 9, 2013 - Techniques of AI

Course material Slides online T. Mitchell Machine Learning, chapter 13 McGraw Hill, 1997 Richard S. Sutton and Andrew G. Barto Reinforcement Learning: An Introduction MIT Press, 1998 Available on-line for free! Reinforcement Learning - 2/33

Why reinforcement learning? Based on ideas from psychology Edward Thorndike s law of effect Satisfaction strengthens behavior, discomfort weakens it B.F. Skinner s principle of reinforcement Skinner Box: train animals by providing (positive) feedback Learning by interacting with the environment Reinforcement Learning - 3/33

Why reinforcement learning? Control learning Robot learning to dock on battery charger Learning to choose actions to optimize factory output Learning to play Backgammon/other games Reinforcement Learning - 4/33

The RL setting Learning from interactions Learning what to do - how to map situations to actions - so as to maximize a numerical reward signal Reinforcement Learning - 5/33

Key features of RL Learner is not told which action to take Trial-and-error approach Possibility of delayed reward Sacrifice short-term gains for greater long-term gains Need to balance exploration and exploitation Possible that states are only partially observable Possible needs to learn multiple tasks with same sensors In between supervised and unsupervised learning Reinforcement Learning - 6/33

AGENT-ENVIRONMENT INTERFACE The agent-environment interface Agent interacts at discrete time steps t = 0, 1, 2,... Observes state s t S Selects action a t A(s t ) Obtains immediate reward r t+1 R Observes resulting state s t+1 state s t s t reward r t r t r t+1 s t+1 r t+1 s t+1 Agent Agent Environment action a t a t... r t +1 s s t a t +1 a t t +1 r t +2 s t +2 a t +2 r t +3 st +3... a t +3 14 Reinforcement Learning - 7/33

Elements of RL Time steps need not refer to fixed intervals of real time Actions can be low level (voltage to motors) high level (go left, go right) mental (shift focus of attention) States can be low level sensations (temperature, (x, y) coordinates) high level abstractions, symbolic subjective, internal ( surprised, lost ) The environment is not necessarily known to the agent Reinforcement Learning - 8/33

Elements of RL State transitions are changes to the internal state of the agent changes in the environment as a result of the agent s action can be nondeterministic Rewards are goals, subgoals duration... Reinforcement Learning - 9/33

Learning how to behave The agent s policy π at time t is a mapping from states to action probabilities π t (s, a) = P (a t = a s t = s) Reinforcement learning methods specify how the agent changes its policy as a result of experience Roughly, the agent s goal is to get as much reward as it can over the long run Reinforcement Learning - 10/33

The objective Use discounted return instead of total reward R t = r t+1 + γr t+2 + γ 2 r t+3 +... = γ k r t+k+1 k=0 where γ [0, 1] is the discount factor such that shortsighted 0 γ 1 farsighted Reinforcement Learning - 11/33

Example: backgammon Learn to play backgammon Immediate reward: +100 if win -100 if lose 0 for all other states Trained by playing 1.5 million games against itself Now approximately equal to best human player. Reinforcement Learning - 12/33

Example: pole balancing A continuing task with discounted return: reward = -1 upon failure return = γ k, for k steps before failure Return is maximized by avoiding failure for as long as possible R t = γ k r t+k+1 k=0 Reinforcement Learning - 13/33

Examples: pole balancing (movie) Reinforcement Learning - 14/33

Markov decision processes It is often useful to a assume that all relevant information is present in the current state: Markov property P (s t+1, r t+1 s t, a t ) = P (s t+1, r t+1 s t, a t, r t, s t 1, a t 1,..., r 1, s 0, a 0 ) If a reinforcement learning task has the Markov property, it is basically a Markov Decision Process (MDP) Assuming finite state and action spaces, it is a finite MDP Reinforcement Learning - 15/33

AGENT-ENVIRONMENT INTERFACE Markov decision processes An MDP is defined by State and action sets a Transition function Agent Pss a = P (s t+1 = s state s t = reward s, a t = a) s t r t action a t a Reward function r t+1 s t+1 Environment R a ss = E(r t+1 s t = s, a t = a, s t+1 = s )... r t +1 s s t a t +1 a t t +1 r t +2 s t +2 a t +2 r t +3 st +3... a t +3 14 Reinforcement Learning - 16/33

Value functions Goal: learn π : S A, given s, a, r When following a fixed policy π we can define the value of a state s under that policy as V π (s) = E π (R t s t = s) = E π ( γ k r t+k+1 s t = s) k=0 Similarly we can define the value of taking action a in state s as Q π (s, a) = E π (R t s t = s, a t = a) Optimal π = argmax π V π (s) Reinforcement Learning - 17/33

Reinforcement Learning - 18/33

Value functions The value function has a particular recursive relationship, expressed by the Bellman equation V π (s) = Pss a [Ra ss + γv π (s )] a A(s) π(s, a) s S The equation expresses the recursive relation between the value of a state and its successor states, and averages over all possibilities, weighting each by its probability of occurring Reinforcement Learning - 19/33

Learning an optimal policy online Often transition and reward functions are unknown Using temporal difference (TD) methods is one way of overcoming this problem Learn directly from raw experience No model of the environment required (model-free) E.g.: Q-learning Update predicted state values based on new observations of immediate rewards and successor states Reinforcement Learning - 20/33

Q-function Q(s, a) = r(s, a) + γv (δ(s, a))with s t+1 = δ(s t, a t ) if we know Q, we do not have to know δ. π (s) = argmax a [r(s, a) + γv (δ(s, a))] π (s) = argmax a Q(s, a) Reinforcement Learning - 21/33

Training rule to learn Q Q and V are closely related: V (s) = max a Q(s, a ) which allows us to write Q as: Q(s t, a t ) = r(s t, a t ) + γv (δ(s t, a t )) Q(s t, a t ) = r(s t, a t ) + γmax a Q(s t+1, a ) So if ˆQ represents the learner s current approximation of Q: ˆQ(s, a) r + γmax a ˆQ(s, a ) Reinforcement Learning - 22/33

Q-learning Q-learning updates state-action values based on the immediate reward and the optimal expected return [ ] Q(s t, a t ) Q(s t, a t )+α r t+1 + γ max Q(s t+1, a) Q(s t, a t ) a Directly learns the optimal value function independent of the policy being followed Proven to converge to the optimal policy given sufficient updates for each state-action pair, and decreasing learning rate α [Watkins92,Tsitsiklis94] Reinforcement Learning - 23/33

Q-learning Reinforcement Learning - 24/33

Action selection How to select an action based on the values of the states or state-action pairs? Success of RL depends on a trade-off Exploration Exploitation Exploration is needed to prevent getting stuck in local optima To ensure convergence you need to exploit Reinforcement Learning - 25/33

Action selection Two common choices ɛ-greedy Choose the best action with probability 1 ɛ Choose a random action with probability ɛ Boltzmann exploration (softmax) uses a temperature parameter τ to balance exploration and exploitation π t (s, a) = e Qt(s,a)/τ a A eqt(s,a )/τ pure exploitation 0 τ pure exploration Reinforcement Learning - 26/33

Updating Q: in practice Reinforcement Learning - 27/33

Convergence of deterministic Q-learning ˆQ converges to Q when each s, a is visited infinitely often Proof: Let a full interval be an interval during which each s, a is visited Let ˆQ n be the Q-table after n-updates n is the maximum error in ˆQ n : n = max s,a ˆQ n (s, a) Q(s, a) Reinforcement Learning - 28/33

Convergence of deterministic Q-learning For any table entry ˆQ n (s, a) updated on iteration n + 1, the error in the revised estimate is ˆQ n+1 (s, a) ˆQ n+1 (s, a) Q(s, a) = (r + γmax a ˆQn (s, a )) (r + γmax a Q(s, a )) = γmax a ˆQn (s, a )) γmax a Q(s, a )) γmax a ˆQ n (s, a ) Q(s, a )) γmax s,a ˆQ n (s, a ) Q(s, a )) ˆQ n+1 (s, a) Q(s, a) γ n < n Reinforcement Learning - 29/33

Extensions Multi-step TD Instead of observing one immediate reward, use n consecutive rewards for the value update Intuition: your current choice of action may have implications for the future Eligibility traces State-action pairs are eligible for future rewards, with more recent states getting more credit Reinforcement Learning - 30/33

Extensions Reward shaping Incorporate domain knowledge to provide additional rewards during an episode Guide the agent to learn faster (Optimal) policies preserved given a potential-based shaping function [Ng99] Function approximation So far we have used a tabular notation for value functions For large state and actions spaces this approach becomes intractable Function approximators can be used to generalize over large or even continuous state and action spaces Reinforcement Learning - 31/33

Demo http://wilma.vub.ac.be:3000 Reinforcement Learning - 32/33

Questions? Reinforcement Learning - 33/33