Introduction to Reinforcement Learning. MAL Seminar

Similar documents
Lecture 10: Reinforcement Learning

Reinforcement Learning by Comparing Immediate Reward

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Artificial Neural Networks written examination

Improving Action Selection in MDP s via Knowledge Transfer

High-level Reinforcement Learning in Strategy Games

Georgetown University at TREC 2017 Dynamic Domain Track

TD(λ) and Q-Learning Based Ludo Players

Axiom 2013 Team Description Paper

AMULTIAGENT system [1] can be defined as a group of

AI Agent for Ice Hockey Atari 2600

Regret-based Reward Elicitation for Markov Decision Processes

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Laboratorio di Intelligenza Artificiale e Robotica

Speeding Up Reinforcement Learning with Behavior Transfer

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

FF+FPG: Guiding a Policy-Gradient Planner

A Reinforcement Learning Variant for Control Scheduling

Laboratorio di Intelligenza Artificiale e Robotica

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

Task Completion Transfer Learning for Reward Inference

Session 2B From understanding perspectives to informing public policy the potential and challenges for Q findings to inform survey design

Introduction to Simulation

Lecture 1: Machine Learning Basics

Learning Prospective Robot Behavior

Generative models and adversarial training

Adaptive Generation in Dialogue Systems Using Dynamic User Modeling

Teachable Robots: Understanding Human Teaching Behavior to Build More Effective Robot Learners

Task Completion Transfer Learning for Reward Inference

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Seminar - Organic Computing

An investigation of imitation learning algorithms for structured prediction

BMBF Project ROBUKOM: Robust Communication Networks

On the Combined Behavior of Autonomous Resource Management Agents

Robot Learning Simultaneously a Task and How to Interpret Human Instructions

Softprop: Softmax Neural Network Backpropagation Learning

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

Chapter 2. Intelligent Agents. Outline. Agents and environments. Rationality. PEAS (Performance measure, Environment, Actuators, Sensors)

Agents and environments. Intelligent Agents. Reminders. Vacuum-cleaner world. Outline. A vacuum-cleaner agent. Chapter 2 Actuators

Truth Inference in Crowdsourcing: Is the Problem Solved?

Python Machine Learning

Intelligent Agents. Chapter 2. Chapter 2 1

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Transfer Learning Action Models by Measuring the Similarity of Different Domains

EVOLVING POLICIES TO SOLVE THE RUBIK S CUBE: EXPERIMENTS WITH IDEAL AND APPROXIMATE PERFORMANCE FUNCTIONS

A Comparison of Annealing Techniques for Academic Course Scheduling

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Lecture 6: Applications

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

While you are waiting... socrative.com, room number SIMLANG2016

A Reinforcement Learning Approach for Adaptive Single- and Multi-Document Summarization

An Introduction to Simulation Optimization

LEARNING TO PLAY IN A DAY: FASTER DEEP REIN-

CS Machine Learning

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

Automatic Discretization of Actions and States in Monte-Carlo Tree Search

Using focal point learning to improve human machine tacit coordination

PELLISSIPPI STATE TECHNICAL COMMUNITY COLLEGE MASTER SYLLABUS APPLIED MECHANICS MET 2025

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Visual CP Representation of Knowledge

Learning and Transferring Relational Instance-Based Policies

The Evolution of Random Phenomena

(Sub)Gradient Descent

Improving Fairness in Memory Scheduling

The Strong Minimalist Thesis and Bounded Optimality

Individual Differences & Item Effects: How to test them, & how to test them well

WHEN THERE IS A mismatch between the acoustic

Learning Methods for Fuzzy Systems

The Good Judgment Project: A large scale test of different methods of combining expert predictions

SARDNET: A Self-Organizing Feature Map for Sequences

Mike Cohn - background

Team Formation for Generalized Tasks in Expertise Social Networks

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

CONCEPT MAPS AS A DEVICE FOR LEARNING DATABASE CONCEPTS

TABLE OF CONTENTS TABLE OF CONTENTS COVER PAGE HALAMAN PENGESAHAN PERNYATAAN NASKAH SOAL TUGAS AKHIR ACKNOWLEDGEMENT FOREWORD

What is a Mental Model?

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Comment-based Multi-View Clustering of Web 2.0 Items

Modeling user preferences and norms in context-aware systems

An OO Framework for building Intelligence and Learning properties in Software Agents

ACTL5103 Stochastic Modelling For Actuaries. Course Outline Semester 2, 2014

Teaching a Laboratory Section

Corrective Feedback and Persistent Learning for Information Extraction

Planning with External Events

Data Structures and Algorithms

CSL465/603 - Machine Learning

IN THIS UNIT YOU LEARN HOW TO: SPEAKING 1 Work in pairs. Discuss the questions. 2 Work with a new partner. Discuss the questions.

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Go fishing! Responsibility judgments when cooperation breaks down

Software Maintenance

Carnegie Mellon University Department of Computer Science /615 - Database Applications C. Faloutsos & A. Pavlo, Spring 2014.

Lecture 1: Basic Concepts of Machine Learning

THE UNIVERSITY OF SYDNEY Semester 2, Information Sheet for MATH2068/2988 Number Theory and Cryptography

A simulated annealing and hill-climbing algorithm for the traveling tournament problem

An Online Handwriting Recognition System For Turkish

arxiv: v1 [math.at] 10 Jan 2016

Assignment 1: Predicting Amazon Review Ratings

Telekooperation Seminar

Transcription:

Introduction to Reinforcement Learning MAL Seminar 2013-2014

RL Background Learning by interacting with the environment Reward good behavior, punish bad behavior Combines ideas from psychology and control theory

The Problem Reinforcement learning is learning what to do--how to map situations to actions--so as to maximize a numerical reward signal. The learner is not told which actions to take, as in most forms of machine learning, but instead must discover which actions yield the most reward by trying them. In the most interesting and challenging cases, actions may affect not only the immediate reward but also the next situation and, through that, all subsequent rewards. Sutton & Barto

Some Examples Mountain Car: Accelerate (underpowered) car to top of hill state observations: position (1d), velocity (1d) actions: apply force -40N, 0,+40N

Some Examples Pole balancing: keep pole in upright position on moving cart state observations: pole angle, angular velocity actions: apply force to cart

Some Examples Helicopter hovering: stable hovering in the presence of wind observed states: posities (3d), velocities (3d), angular rates (3d) actions: pitches (4d)

Formal Problem Definition: Markov Decision Process a Markov Decision Process consists of: set of States S= {s1,...,sn} (for now: finite & discrete) set of Actions A = {a1,..,am} (for now: finite & discrete) Transition function T: T(s,a,s') = P(s(t+1)=s' s(t) =s, a(t)=a) Reward function r: r(s,a,s') = E[r(t+1) s(t) =s, a(t)=a, s(t+1)=s ] Formal definition of reinforcement learning problem. Note: assumes the Markov property (next state / reward are independent of history, given the current state)

Goal Goal of RL is to maximize the expected long term future return R t Usually the discounted sum of rewards is used: Note: this is not the same as maximizing immediate rewards r(s,a,s ), takes into account the future Other measures exist (e.g. total or average reward over time)

Note on reward functions RL considers the reward function as an unknown part of the environment, external to the learning agent. In practice, reward functions are typically chosen by the system designer and therefor known Knowing the reward function, however, does not mean we know how to maximize long term rewards. This also depends on the system dynamics (T), which are unknown Typical reward function:

π(s,a) gives the probability of selecting action a in state s under policy π For deterministic policies we use π(s) to denote the action a for which π(s,a)=1 In finite MDPs it can be shown that a deterministic optimal policy always exists Policies The agent s goal is to learn a policy π, which determines the probability of selecting each action in a given state in order to maximize future rewards

Example +10 +10 GOAL +10 START States: Location 1... 25 Actions: Move N,E,S,W Transitions: move 1 step in selected direction (except at borders) Rewards: +10 if next loc == goal, 0 else find shortest path to goal Rewards can be delayed: only receive reward when reaching goal unknown environment Consequences of an action can only be discovered by trying it and observing the result (new state s', reward r)

Value Functions State Values (V-values): Expected future (discounted) reward when starting from state s and following policy π.

Optimal values A policy π Is better than π (π π ) iff: A policy π* is optimal iff it is better or equal to all other policies. The associated optimal value function, denoted V*, is defined as: Multiple optimal policies can exist, but they all share the same value function V*

Optimal values example 9 10 0 10 9 8.1 9 10 9 8.1 7.2 8.1 9 8.1 7.2 6.3 7.2 8.1 7.2 6.3 5.4 6.3 7.2 6.3 5.4 V*(s) π*(s)

Q-values Often it is easier to use state-action values (Q-values) rather than state values: The optimal Q-values can be expressed as: Given Q*, the optimal policy can be obtained as follows:

Dimensions of RL Some issues when selecting algorithms: Policy based vs. Value based On-policy vs Off-policy learning Exploration vs Exploitation Monte carlo vs Bootstrapping

Policy iteration vs Value iteration Policy Iteration algorithms iterate policy evaluation and policy improvement. Value iteration algorithms directly construct a series of estimates in order to immediately learn the optimal value function.

RL Taxonomy Value Based: o Learn Value Function o Policy is implicit (e.g. Greedy ) Policy Based: o Explicitly store Policy o Directly update Policy (e.g. using gradient) Actor-Critic: o Learn Policy o Learn Value function o Update policy using Value Function Value Based Value iteration Actor Critic Policy iteration Policy Based Policy gradient

Sarsa & Q-learning 2 algorithms for on-line Temporal Difference (TD) control Learn Q-values while actively controlling system Both use TD error to update value function estimates: Both algorithms use bootstrapping: Q-value estimates are updated using using estimates for the next state Use different estimates for the next state value V(s t+1 ) SARSA is on-policy: learns value Q π for active control policy π Q-learning is off-policy: learns Q*, regardless of control policy that is used

Q-Learning

Q-Learning Off-policy: V(s t+1 )= max a Q(s t+1,a )

SARSA

SARSA On-policy: V(s t+1 )= Q(s t+1, a t+1 )

Actor-Critic Policy iteration method Consists of 2 learners: actor and critic Critic learns evaluation (Values) for current policy Actor updates policy based on critic feedback

Actor-critic

Actor-critic Actor: update using critic estimate Critic: On-policy TD update

Exploration Vs. Exploitation In online learning, where the system is actively controlled during learning, it is important to balance exploration and exploitation Exploration means trying new actions in order to observe their results. It is needed to learn and discover good actions Exploitation means using what was already learnt: select actions known to be good in order to obtain high rewards. Common choices: greedy, e-greedy, softmax

Greedy Action selection always select action with highest Q-value a= argmax a Q(s,a) Pure exploitation, no exploration Will immediately converge to action if observed value is higher than initial Q-values Can be made to explore by initializing Q- values optimistically

ε-greedy With probability ε select random action, else select greedy Fixed rate of exploration for fixed ε ε can be reduced over time to reduce amount of exploration

Softmax Assign each action a probability, based on Q- value: Parameter T determines amount of exploration. Large T: play more randomly, small T: play greedily (T can also be reduced over time)

Bootstrapping Vs. Monte Carlo Q -learning & Sarsa use bootstrapping updates: R t = r t+1 + γv(s t+1 ) Future returns are estimated using the value of the next state. Monte Carlo updates use the complete return over the remainder of the episode: R t = r t+1 +γ r t+2 + γ 2 r t+3 +... +γ n r T

Bootstrapping Vs. Monte Carlo Bootstrapping: Take 1 step, then update s t using V(s t+1 ) a t+1 Monte Carlo: Complete episode, then update s t using rewards over remainder of episode a t+1 a t+2 a t+3...

Monte Carlo vs Bootstrapping 5 10 15 20 25 goal 5 10 15 20 25 25 x 25 grid world +100 reward for reaching goal 0 reward else discount = 0.9 Q-learning with 0.9 learning rate Monte carlo updates vs bootstrapping Start

Optimal Value function

Monte Carlo vs Bootstrapping Episode 1

Monte Carlo vs Bootstrapping Episode 2

Monte Carlo vs Bootstrapping Episode 5

Monte Carlo vs Bootstrapping Episode 10

Monte Carlo vs Bootstrapping Episode 50

Monte Carlo vs Bootstrapping Episode 100

Monte Carlo vs Bootstrapping Episode 1000

Monte Carlo vs Bootstrapping Episode 10000

N-step returns Bootstrapping Monte Carlo

Eligibility Traces Idea: after receiving a reward states (or state action pairs) are updated depending on how recently they were visited A trace value e(s,a) is kept for each (s,a) pair. This value is increased when (s,a) is visited and decayed else. The TD update for a state is weighted by e (s,a)

Eligibility traces (2) +1 when s is visited Decay trace when not in s λ determines trace decay

Replacing Traces Set to1 when s is visited Decay trace when not in s

sarsa(λ)

Q(λ)

Q(λ) Reset trace when nongreedy action is selected

Q(0.5) Episode 1

Q(0.5) Episode 2

Q(0.5) Episode 5

Q(0.5) Episode 10

Q(0.5) Episode 50

Q(0.5) Episode 100

Q(0.5) Episode 1000

Q(0.5) Episode 10000

Using traces Setting λ allows full range of backups from monte carlo (λ=1) to bootstrapping (λ=0) Intermediate approaches often more efficient than extreme λ (1/0) Often easier to reason about #steps trace will last: Offer a method to apply Monte Carlo methods in non-episodic tasks

Optimal λ values

Next Lecture Read Ch8 in the Barto & Sutton book Look at example code Install RL Park Try different trace settings in grid world example