Introduction to Reinforcement Learning

Similar documents
Exploration. CS : Deep Reinforcement Learning Sergey Levine

Lecture 10: Reinforcement Learning

Reinforcement Learning by Comparing Immediate Reward

Lecture 6: Applications

Axiom 2013 Team Description Paper

High-level Reinforcement Learning in Strategy Games

Speeding Up Reinforcement Learning with Behavior Transfer

Regret-based Reward Elicitation for Markov Decision Processes

AMULTIAGENT system [1] can be defined as a group of

Lecture 1: Machine Learning Basics

Georgetown University at TREC 2017 Dynamic Domain Track

Improving Action Selection in MDP s via Knowledge Transfer

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Laboratorio di Intelligenza Artificiale e Robotica

Task Completion Transfer Learning for Reward Inference

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Task Completion Transfer Learning for Reward Inference

TD(λ) and Q-Learning Based Ludo Players

Learning Human Utility from Video Demonstrations for Deductive Planning in Robotics

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

Learning Prospective Robot Behavior

Laboratorio di Intelligenza Artificiale e Robotica

Artificial Neural Networks written examination

An Introduction to Simulation Optimization

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

Robot Learning Simultaneously a Task and How to Interpret Human Instructions

Intelligent Agents. Chapter 2. Chapter 2 1

A Reinforcement Learning Variant for Control Scheduling

TOKEN-BASED APPROACH FOR SCALABLE TEAM COORDINATION. by Yang Xu PhD of Information Sciences

(Sub)Gradient Descent

An investigation of imitation learning algorithms for structured prediction

A Case-Based Approach To Imitation Learning in Robotic Agents

FF+FPG: Guiding a Policy-Gradient Planner

On the Combined Behavior of Autonomous Resource Management Agents

The Good Judgment Project: A large scale test of different methods of combining expert predictions

AI Agent for Ice Hockey Atari 2600

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

AGENDA LEARNING THEORIES LEARNING THEORIES. Advanced Learning Theories 2/22/2016

BMBF Project ROBUKOM: Robust Communication Networks

Python Machine Learning

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

IAT 888: Metacreation Machines endowed with creative behavior. Philippe Pasquier Office 565 (floor 14)

LEARNING TO PLAY IN A DAY: FASTER DEEP REIN-

Software Maintenance

Knowledge based expert systems D H A N A N J A Y K A L B A N D E

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

Learning Methods for Fuzzy Systems

Seminar - Organic Computing

Probabilistic Latent Semantic Analysis

Lecture 1: Basic Concepts of Machine Learning

An OO Framework for building Intelligence and Learning properties in Software Agents

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Teachable Robots: Understanding Human Teaching Behavior to Build More Effective Robot Learners

A Reinforcement Learning Approach for Adaptive Single- and Multi-Document Summarization

Chapter 2. Intelligent Agents. Outline. Agents and environments. Rationality. PEAS (Performance measure, Environment, Actuators, Sensors)

Truth Inference in Crowdsourcing: Is the Problem Solved?

Introduction to Simulation

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Managerial Decision Making

DIGITAL GAMING & INTERACTIVE MEDIA BACHELOR S DEGREE. Junior Year. Summer (Bridge Quarter) Fall Winter Spring GAME Credits.

POLA: a student modeling framework for Probabilistic On-Line Assessment of problem solving performance

Semi-Supervised Face Detection

Applying Fuzzy Rule-Based System on FMEA to Assess the Risks on Project-Based Software Engineering Education

SACS Reaffirmation of Accreditation: Process and Reports

Acquiring Competence from Performance Data

Evolutive Neural Net Fuzzy Filtering: Basic Description

A Bayesian Model of Imitation in Infants and Robots

Transfer Learning Action Models by Measuring the Similarity of Different Domains

Softprop: Softmax Neural Network Backpropagation Learning

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Improving Fairness in Memory Scheduling

Generative models and adversarial training

CSL465/603 - Machine Learning

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

CS Machine Learning

TABLE OF CONTENTS TABLE OF CONTENTS COVER PAGE HALAMAN PENGESAHAN PERNYATAAN NASKAH SOAL TUGAS AKHIR ACKNOWLEDGEMENT FOREWORD

LEGO MINDSTORMS Education EV3 Coding Activities

DOCTOR OF PHILOSOPHY HANDBOOK

Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method

Model Ensemble for Click Prediction in Bing Search Ads

Adaptive Generation in Dialogue Systems Using Dynamic User Modeling

A Model of Knower-Level Behavior in Number Concept Development

Soft Computing based Learning for Cognitive Radio

A Metacognitive Approach to Support Heuristic Solution of Mathematical Problems

arxiv: v1 [cs.lg] 8 Mar 2017

A Comparison of Annealing Techniques for Academic Course Scheduling

SURVEY RESEARCH POLICY TABLE OF CONTENTS STATEMENT OF POLICY REASON FOR THIS POLICY

A GENERIC SPLIT PROCESS MODEL FOR ASSET MANAGEMENT DECISION-MAKING

KENTUCKY FRAMEWORK FOR TEACHING

While you are waiting... socrative.com, room number SIMLANG2016

Seven Keys to a Positive Learning Environment in Your Classroom. Study Guide

MYCIN. The MYCIN Task

Uncertainty concepts, types, sources

Learning and Transferring Relational Instance-Based Policies

Centralized Assignment of Students to Majors: Evidence from the University of Costa Rica. Job Market Paper

Motivation to e-learn within organizational settings: What is it and how could it be measured?

Toward Probabilistic Natural Logic for Syllogistic Reasoning

Learning Methods in Multilingual Speech Recognition

ECE-492 SENIOR ADVANCED DESIGN PROJECT

Transcription:

Introduction to Reinforcement Learning

sequential decision making under uncertainty? How Can I...? Move around in the physical world (e.g. driving, navigation) Play and win a game Retrieve information over the web Do medical diagnosis and treatment Maximize the throughput of a factory Optimize the performance of a rescue team

Reinforcement learning Action Reward Environment State RL: A class of learning problems in which an agent interacts with an unfamiliar, dynamic and stochastic environment Goal: Learn a policy to maximize some measure of long-term reward Interaction: Modeled as a MDP or a POMDP

Markov decision processes An MDP is defined as a 5-tuple (X, A, p, q, p0 ) X : State space of the process A : Action space of the process p( x, a) : Probability distribution over next state xt+1 p( xt, at ) q( x, a) : Probability distribution over rewards R(xt, at ) q( xt, at ) p0 : Initial state distribution Policy: Mapping from states to actions or distributions over actions µ(x) A Bayesian Methods in Reinforcement Learning or µ( x) Pr(A) ICML 2007

Example: Backgammon States: board configurations (about ) 10 20 Actions: permissible moves Rewards: win +1, lose -1, else 0

RL applications Backgammon (Tesauro, 1994) Inventory Management (Van Roy, Bertsekas, Lee, & Tsitsiklis, 1996) Dynamic Channel Allocation (e.g. Singh & Bertsekas, 1997) Elevator Scheduling (Crites & Barto, 1998) Robocup Soccer (e.g. Stone & Veloso, 1999) Many Robots (navigation, bi-pedal walking, grasping, switching between skills,...) Helicopter Control (e.g. Ng, 2003, Abbeel & Ng, 2006) More Applications http://neuromancer.eecs.umich.edu/cgi-bin/twiki/view/main/successesofrl

Value Function State Value Function: V µ (x) = E µ [ t=0 γ t R(xt, µ(x t )) x 0 = x ] State-Action Value Function: [ ] Q µ (x, a) = E µ γ t R(xt, a t ) x 0 = x, a 0 = a t=0

Policy Evaluation Finding the value function of a policy Bellman Equations V µ (x) = a A µ(a x) [ R(x, a) + γ x X p(x x, a)v µ (x ) ] Q µ (x, a) = R(x, a) + γ x X p(x x, a) a A µ(a x )Q µ (x, a )

Policy Optimization Finding a policy µ maximizing V µ (x) x X Bellman Optimality Equations! a) + γ V (x) = max R(x, a A x! X a) + γ Q (x, a) = R(x,! x! X µ Note: if Q (x, a) = Q state x is given by any " # p(x# x, a)v (x# ) # # p(x# x, a) max Q (x, a )! a A (x, a) is available, then an optimal action for a arg maxa Q (x, a) Bayesian Methods in Reinforcement Learning ICML 2007

Policy Optimization Value Iteration V 0 (x) = 0 V t+1 (x) = max a A [ R(x, a) + γ x X p(x x, a)v t (x ) ] system dynamics unknown

Reinforcement Learning (RL) Action Reward Environment State RL Problem: Solve MDP when transition and/or reward models are unknown Basic Idea: use samples obtained from the agent s interaction with the environment to solve the MDP

Model-Based vs. Model-Free RL What is model? state transition distribution and reward distribution Model-Based RL: model is not available, but it is explicitly learned Model-Free RL: model is not available and is not explicitly learned Value Function / Policy Model-Based RL or Planning Model-Free or Direct RL Acting Model Experience Model Learning

Reinforcement learning solutions SARSA Q-learning Value Iteration Policy Gradient Algorithms PEGASUS Genetic Algorithms Value Function Algorithms Actor-Critic Algorithms Policy Search Algorithms Sutton, et al. 2000 Konda & Tsitsiklis 2000 Peters, et al. 2005 Bhatnagar, Ghavamzadeh, Sutton 2007

Learning Modes Offline Learning Learning while interacting with a simulator Online Learning Learning while interacting with the environment

Offline Learning Agent interacts with a simulator Rewards/costs do not matter no exploration/exploitation tradeoff Computation time between actions is not critical Simulator can produce as much as data we wish Main Challenge How to minimize time to converge to optimal policy

Online Learning No simulator - Direct interaction with environment Agent receives reward/cost for each action Main Challenges Exploration/exploitation tradeoff Should actions be picked to maximize immediate reward or to maximize information gain to improve policy Real-time execution of actions Limited amount of data since interaction with environment is required

Bayesian Learning

The bayesian approach Z Y Z - hidden process, - observable Y Goal: infer from measurements of Known: statistical dependence between and Place prior over Observe: Z P (Y Z) Z Y = y Compute posterior of : : reflecting our uncertainty Z Y P (Z Y = y) = Z P (Z) Y P (y Z)P (Z) P (y Z )P (Z )dz

Bayesian Learning Pros Principled treatment of uncertainty Conceptually simple Immune to overfitting (prior serves as regularizer) Facilitates encoding of domain knowledge (prior) Cons Mathematically and computationally complex E.g. posterior may not have a closed form How do we pick the prior?

Bayesian RL + Systematic method for inclusion and update of prior knowledge and domain assumptions Encode uncertainty about transition function, reward function, value function, policy, etc. with a probability distribution (belief) Update belief based on evidence (e.g., state, action, reward) Appropriately reconcile exploration with exploitation Select action based on belief Providing full distribution, not just point estimates Measure of uncertainty for performance predictions (e.g. value function, policy gradient)

Bayesian RL Model-based Bayesian RL Distribution over transition probability Model-free Bayesian RL Distribution over value function, policy, or policy gradient Bayesian inverse RL Distribution over reward Bayesian multi-agent RL Distribution over other agents policies