CMU e Real Life Reinforcement Learning

Similar documents
Lecture 10: Reinforcement Learning

Reinforcement Learning by Comparing Immediate Reward

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Georgetown University at TREC 2017 Dynamic Domain Track

Axiom 2013 Team Description Paper

High-level Reinforcement Learning in Strategy Games

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Agents and environments. Intelligent Agents. Reminders. Vacuum-cleaner world. Outline. A vacuum-cleaner agent. Chapter 2 Actuators

Laboratorio di Intelligenza Artificiale e Robotica

AMULTIAGENT system [1] can be defined as a group of

Laboratorio di Intelligenza Artificiale e Robotica

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

Regret-based Reward Elicitation for Markov Decision Processes

Chapter 2. Intelligent Agents. Outline. Agents and environments. Rationality. PEAS (Performance measure, Environment, Actuators, Sensors)

Stochastic Calculus for Finance I (46-944) Spring 2008 Syllabus

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

Lecture 1: Machine Learning Basics

TD(λ) and Q-Learning Based Ludo Players

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Data Structures and Algorithms

Intelligent Agents. Chapter 2. Chapter 2 1

Artificial Neural Networks written examination

Task Completion Transfer Learning for Reward Inference

Task Completion Transfer Learning for Reward Inference

Introduction. Chem 110: Chemical Principles 1 Sections 40-52

FF+FPG: Guiding a Policy-Gradient Planner

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

A Reinforcement Learning Variant for Control Scheduling

Speeding Up Reinforcement Learning with Behavior Transfer

(Sub)Gradient Descent

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

EECS 700: Computer Modeling, Simulation, and Visualization Fall 2014

ENEE 302h: Digital Electronics, Fall 2005 Prof. Bruce Jacob

Lecture 1: Basic Concepts of Machine Learning

Learning Methods for Fuzzy Systems

AN EXAMPLE OF THE GOMORY CUTTING PLANE ALGORITHM. max z = 3x 1 + 4x 2. 3x 1 x x x x N 2

Lahore University of Management Sciences. FINN 321 Econometrics Fall Semester 2017

Seminar - Organic Computing

CS 1103 Computer Science I Honors. Fall Instructor Muller. Syllabus

Syllabus for ART 365 Digital Photography 3 Credit Hours Spring 2013

T Seminar on Internetworking

What to Do When Conflict Happens

Acquiring Competence from Performance Data

BADM 641 (sec. 7D1) (on-line) Decision Analysis August 16 October 6, 2017 CRN: 83777

CS177 Python Programming

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

CS 100: Principles of Computing

The Good Judgment Project: A large scale test of different methods of combining expert predictions

ECO 3101: Intermediate Microeconomics

Decision Analysis. Decision-Making Problem. Decision Analysis. Part 1 Decision Analysis and Decision Tables. Decision Analysis, Part 1

Learning Prospective Robot Behavior

Objective: Total Time. (60 minutes) (6 minutes) (6 minutes) starting at 0. , 8, 10 many fourths? S: 4 fourths. T: (Beneat , 2, 4, , 14 , 16 , 12

An Introduction to Simulation Optimization

ECON492 Senior Capstone Seminar: Cost-Benefit and Local Economic Policy Analysis Fall 2017 Instructor: Dr. Anita Alves Pena

DOCTOR OF PHILOSOPHY HANDBOOK

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Lecture 6: Applications

Spring 2015 IET4451 Systems Simulation Course Syllabus for Traditional, Hybrid, and Online Classes

Syllabus ENGR 190 Introductory Calculus (QR)

Software Maintenance

Physics Experimental Physics II: Electricity and Magnetism Prof. Eno Spring 2017

Probability and Game Theory Course Syllabus

Computer Science 1015F ~ 2016 ~ Notes to Students

A Comparison of Annealing Techniques for Academic Course Scheduling

Syllabus Foundations of Finance Summer 2014 FINC-UB

A Decision Tree Analysis of the Transfer Student Emma Gunu, MS Research Analyst Robert M Roe, PhD Executive Director of Institutional Research and

Predicting Students Performance with SimStudent: Learning Cognitive Skills from Observation

FINN FINANCIAL MANAGEMENT Spring 2014

Major Milestones, Team Activities, and Individual Deliverables

EECS 571 PRINCIPLES OF REAL-TIME COMPUTING Fall 10. Instructor: Kang G. Shin, 4605 CSE, ;

Intensive English Program Southwest College

Learning and Transferring Relational Instance-Based Policies

IAT 888: Metacreation Machines endowed with creative behavior. Philippe Pasquier Office 565 (floor 14)

PHY2048 Syllabus - Physics with Calculus 1 Fall 2014

MTH 141 Calculus 1 Syllabus Spring 2017

KLI: Infer KCs from repeated assessment events. Do you know what you know? Ken Koedinger HCI & Psychology CMU Director of LearnLab

Infrared Paper Dryer Control Scheme

Firms and Markets Saturdays Summer I 2014

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Python Machine Learning

ACCOMMODATIONS MANUAL. How to Select, Administer, and Evaluate Use of Accommodations for Instruction and Assessment of Students with Disabilities

Accounting 312: Fundamentals of Managerial Accounting Syllabus Spring Brown

Executive Guide to Simulation for Health

Parent Information Welcome to the San Diego State University Community Reading Clinic

Modeling user preferences and norms in context-aware systems

Evolutive Neural Net Fuzzy Filtering: Basic Description

Improving Fairness in Memory Scheduling

An Online Handwriting Recognition System For Turkish

Physics 270: Experimental Physics

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Automatic Discretization of Actions and States in Monte-Carlo Tree Search

LEARNING TO PLAY IN A DAY: FASTER DEEP REIN-

Page 1 of 8 REQUIRED MATERIALS:

BMBF Project ROBUKOM: Robust Communication Networks

Citrine Informatics. The Latest from Citrine. Citrine Informatics. The data analytics platform for the physical world

Session 2B From understanding perspectives to informing public policy the potential and challenges for Q findings to inform survey design

Transfer Learning Action Models by Measuring the Similarity of Different Domains

Transcription:

CMU 15-889e Real Life Reinforcement Learning Emma Brunskill Fall 2015

Class Logistics Instructor: Emma Brunskill TA: Christoph Dann Time: Monday/Wednesday 1:30-2:50pm Website: http://www.cs.cmu.edu/~ebrun/15889e/index. html We will be using Piazza for class discussions and communication: please use this to pose all standard questions Office hours will be announced 2

Prerequisites Assume basic familiarity with probability, machine learning, sequential decision making under uncertainty and programming It is useful but not required to have taken one or more of: Machine Learning, Stat Techniques in Robotics, Graduate AI. Enthusiasm and creativity are required! 3

Class Requirements & Policy Grading Homeworks (30%) Midterm (20%) Final project (40%) Participation (10%) Late policy 4 late days to use without penalty on homeworks only across the semester. See website for full details. Collaboration: unless otherwise specified, written homeworks can be discussed with others but must be written up individually. You must write the names of the other students you collaborated with on your homework. 4

Reinforcement Learning Learn a behavior strategy (policy) that maximizes the long term sum of rewards in an unknown & stochastic environment 5

RL Examples: Intelligent Tutoring Systems 6

RL Examples: Robotics 7

RL Examples: Playing Atari Image from David Silver 8

RL Examples: Healthcare decision support 9

Go through background knowledge check 10

Why is RL Different Than Other AI and Machine Learning? optimization + Image from Ben Van Roy 11

RL: Designer Choices 12

RL: Designer Choices Representation (how represent the world and the space of actions/interventions, and feedback signal/ reward) Algorithm for learning Objective function Evaluation 13

Common Restrictions / Constraints Computation time 14

Common Restrictions / Constraints Computation time Data available Restricted in way can act (policy class, constraints on which actions can take in states) Online vs offline Do we get to choose how to act or does someone else (an expert, semi-expert, offpolicy/onpolicy learning ) 15

Desirable Properties in a RL Algorithm? 16

Desirable Properties in a RL Algorithm? Convergence Consistency Small generalization error Small estimation error Small approximation error High learning speed Safety 17

Broad Classes of RL Approaches Image from David Silver 18

3 Important Challenges in Real Life RL 1. From Old Data to Future Decisions 2. Quickly Learning to Act Well: Highly Sample Efficient RL 3. Beyond Expectation: Safety & Risk Sensitive RL Most of class will focus on these 3 topics 19

Reasoning Under Uncertainty Learn model of outcomes Given model of stochastic outcomes Actions Don t Change State of the World Actions Change State of the World

Markov Decision Processes

MDP is a tuple (S,A,P,R,γ) o o o o o o o o Set of states S Start state s0 Set of actions A Transitions P(s s,a) (or T(s,a,s )) Rewards R(s,a,s ) (or R(s) or R(s,a) Discount γ Policy = Choice of action for each state Utility / Value = sum of (discounted) rewards Slide adapted from Klein

Value of a Policy Optimal Value & Optimal Policy

Bellman Equation * Holds for V* Inspires an update rule

Value Iteration 1. Initialize V1(si) for all states si 2. k=2 3. While k < desired horizon or (if infinite horizon) values have converged o For all s,

Will Value Iteration Converge? Yes, if discount factor is < 1 or end up in a terminal state with probability 1 Bellman equation is a contraction If apply it to two different value functions, distance between value functions shrinks after apply Bellman equation to each

Bellman Operator is a Contraction 27

Properties of Contraction Only has 1 fixed point o If had two, then would not get closer when apply contraction function, violating definition of contraction When apply contraction function to any argument, value must get closer to fixed point o Fixed point doesn t move o Repeated function applications yield fixed point

Value Iteration Converges If discount factor < 1 Bellman is a contraction Value iteration converges to unique solution which is optimal value function