Intro to Reinforcement Learning. Part 2: Ideas and Examples

Similar documents
Axiom 2013 Team Description Paper

TD(λ) and Q-Learning Based Ludo Players

Reinforcement Learning by Comparing Immediate Reward

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Lecture 10: Reinforcement Learning

Laboratorio di Intelligenza Artificiale e Robotica

Lecture 6: Applications

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Lecture 1: Machine Learning Basics

Speeding Up Reinforcement Learning with Behavior Transfer

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

High-level Reinforcement Learning in Strategy Games

Laboratorio di Intelligenza Artificiale e Robotica

Intelligent Agents. Chapter 2. Chapter 2 1

Artificial Neural Networks written examination

AGENDA LEARNING THEORIES LEARNING THEORIES. Advanced Learning Theories 2/22/2016

AMULTIAGENT system [1] can be defined as a group of

A Case-Based Approach To Imitation Learning in Robotic Agents

Knowledge based expert systems D H A N A N J A Y K A L B A N D E

Georgetown University at TREC 2017 Dynamic Domain Track

Seminar - Organic Computing

Planning with External Events

An OO Framework for building Intelligence and Learning properties in Software Agents

Teachable Robots: Understanding Human Teaching Behavior to Build More Effective Robot Learners

On the Combined Behavior of Autonomous Resource Management Agents

Case Acquisition Strategies for Case-Based Reasoning in Real-Time Strategy Games

The Good Judgment Project: A large scale test of different methods of combining expert predictions

A Reinforcement Learning Variant for Control Scheduling

Improving Action Selection in MDP s via Knowledge Transfer

Learning Prospective Robot Behavior

An investigation of imitation learning algorithms for structured prediction

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

MYCIN. The MYCIN Task

Airplane Rescue: Social Studies. LEGO, the LEGO logo, and WEDO are trademarks of the LEGO Group The LEGO Group.

Chapter 2. Intelligent Agents. Outline. Agents and environments. Rationality. PEAS (Performance measure, Environment, Actuators, Sensors)

*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN

Python Machine Learning

K5 Math Practice. Free Pilot Proposal Jan -Jun Boost Confidence Increase Scores Get Ahead. Studypad, Inc.

FF+FPG: Guiding a Policy-Gradient Planner

Introduction to Simulation

5 Guidelines for Learning to Spell

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

File # for photo

Evaluating Statements About Probability

The KAM project: Mathematics in vocational subjects*

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

LEGO MINDSTORMS Education EV3 Coding Activities

Concept Acquisition Without Representation William Dylan Sabo

Learning and Transferring Relational Instance-Based Policies

IMGD Technical Game Development I: Iterative Development Techniques. by Robert W. Lindeman

Shockwheat. Statistics 1, Activity 1

Executive Guide to Simulation for Health

Helping your child succeed: The SSIS elementary curriculum

Modeling user preferences and norms in context-aware systems

Probabilistic Latent Semantic Analysis

Seven Keys to a Positive Learning Environment in Your Classroom. Study Guide

IAT 888: Metacreation Machines endowed with creative behavior. Philippe Pasquier Office 565 (floor 14)

Computers Change the World

EXPERT SYSTEMS IN PRODUCTION MANAGEMENT. Daniel E. O'LEARY School of Business University of Southern California Los Angeles, California

Rover Races Grades: 3-5 Prep Time: ~45 Minutes Lesson Time: ~105 minutes

Proposal of Pattern Recognition as a necessary and sufficient principle to Cognitive Science

An Introduction to Simulation Optimization

Causal Link Semantics for Narrative Planning Using Numeric Fluents

ACTL5103 Stochastic Modelling For Actuaries. Course Outline Semester 2, 2014

Learning Methods for Fuzzy Systems

DIGITAL GAMING & INTERACTIVE MEDIA BACHELOR S DEGREE. Junior Year. Summer (Bridge Quarter) Fall Winter Spring GAME Credits.

Regret-based Reward Elicitation for Markov Decision Processes

TOKEN-BASED APPROACH FOR SCALABLE TEAM COORDINATION. by Yang Xu PhD of Information Sciences

Contents. Foreword... 5

Breaking the Habit of Being Yourself Workshop for Quantum University

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

Agents and environments. Intelligent Agents. Reminders. Vacuum-cleaner world. Outline. A vacuum-cleaner agent. Chapter 2 Actuators

Successfully Flipping a Mathematics Classroom

WHAT ARE VIRTUAL MANIPULATIVES?

4-3 Basic Skills and Concepts

Evolutive Neural Net Fuzzy Filtering: Basic Description

Go fishing! Responsibility judgments when cooperation breaks down

The lasting impact of the Great Depression

The Effect of Discourse Markers on the Speaking Production of EFL Students. Iman Moradimanesh

IS FINANCIAL LITERACY IMPROVED BY PARTICIPATING IN A STOCK MARKET GAME?

Science Fair Project Handbook

PreReading. Lateral Leadership. provided by MDI Management Development International

A Bayesian Model of Imitation in Infants and Robots

MAKING YOUR OWN ALEXA SKILL SHRIMAI PRABHUMOYE, ALAN W BLACK

CSL465/603 - Machine Learning

While you are waiting... socrative.com, room number SIMLANG2016

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

The Impact of Formative Assessment and Remedial Teaching on EFL Learners Listening Comprehension N A H I D Z A R E I N A S TA R A N YA S A M I

11:00 am Robotics and the Law: An American Perspective Prof. Ryan Calo, University of Washington School of Law

Lesson plan for Maze Game 1: Using vector representations to move through a maze Time for activity: homework for 20 minutes

Generative models and adversarial training

West s Paralegal Today The Legal Team at Work Third Edition

Stopping rules for sequential trials in high-dimensional data

BAYESIAN ANALYSIS OF INTERLEAVED LEARNING AND RESPONSE BIAS IN BEHAVIORAL EXPERIMENTS

Cognitive Modeling. Tower of Hanoi: Description. Tower of Hanoi: The Task. Lecture 5: Models of Problem Solving. Frank Keller.

Surprise-Based Learning for Autonomous Systems

The Evolution of Random Phenomena

Interaction Design Considerations for an Aircraft Carrier Deck Agent-based Simulation

Automatic Discretization of Actions and States in Monte-Carlo Tree Search

Transcription:

Intro to Reinforcement Learning Part 2: Ideas and Examples

Psychology Artificial Intelligence Reinforcement Learning Neuroscience Control Theory

Reinforcement learning The engineering endeavor most closely related to natural learning in animals and people A new (~30 year old) class of learning algorithms, inspired by animal learning psychology, and developed within machine learning and AI, for approximately solving large optimal-control problems RL methods have outperformed previous solution methods in many cases: Game-playing, robot control, auto-pilots, efficient management of queues, inventories, power systems... RL ideas provide a computational theory that deepens our understanding of natural learning behavior and mechanisms

Reinforcement learning is learning from interaction to achieve a goal Environment state action reward Agent complete agent temporally situated continual learning & planning object is to affect environment environment stochastic & uncertain

States, Actions, and Rewards

Backward New Robot, Same algorithm Hajime Kimura s RL Robots Before After

Devilsticking Finnegan Southey University of Alberta Stefan Schaal & Chris Atkeson Univ. of Southern California Model-based Reinforcement Learning of Devilsticking

The RoboCup Soccer Competition

Autonomous Learning of Efficient Gait Kohl & Stone (UTexas) 2004

Policies A policy maps each state to an action to take Like a stimulus response rule We seek a policy that maximizes cumulative reward The policy is a subgoal to achieving reward

The reward hypothesis That all of what we mean by goals and purposes can be well thought of as the maximization of the cumulative sum of a received scalar signal (reward) A sort of null hypothesis.! Probably ultimately wrong, but so simple we have to disprove it before considering anything more complicated R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 18

Brain reward systems What signal does this neuron carry? Honeybee Brain VUM Neuron Hammer, Menzel

Value

Value systems are hedonism with foresight We value situations according to how much reward we expect will follow them All efficient methods for solving sequential decision problems determine (learn or compute) value functions as an intermediate step Value systems are a means to reward, yet we care more about values than rewards

Pleasure = Immediate Reward good = Long-term Reward Even enjoying yourself you call evil whenever it leads to the loss of a pleasure greater than its own, or lays up pains that outweigh its pleasures.... Isn't it the same when we turn back to pain? To suffer pain you call good when it either rids us of greater pains than its own or leads to pleasures that outweigh them. Plato, Protagoras

Backgammon STATES: configurations of the playing board ( 10 20 ) ACTIONS: moves REWARDS: win: +1 lose: 1 else: 0 a big game

TD-Gammon Tesauro, 1992-1995............ Value Action selection by 2-3 ply search TD Error V t + 1 Vt Start with a random Network Play millions of games against itself Learn a value function from this simulated experience Six weeks later it s the best player of backgammon in the world

The Mountain Car Problem Goal Gravity wins SITUATIONS: car's position and velocity ACTIONS: three thrusts: forward, reverse, none REWARDS: always 1 until car reaches the goal No Discounting Minimum-Time-to-Goal Problem Moore, 1990

Value Functions Learned while solving the Mountain Car problem Minimize Time-to-Goal Goal region Value = estimated time to goal

Temporal-difference (TD) error Do things seem to be getting better or worse, in terms of long-term reward, at this instant in time?

What everybody should know about Temporal-difference (TD) learning Used to learn value functions without human input Learns a guess from a guess Applied by Samuel to play Checkers (1959) and by Tesauro to beat humans at Backgammon (1992-5) and Jeopardy! (2011) Explains (accurately models) the brain reward systems of primates, rats, bees, and many other animals (Schultz, Dayan & Montague 1997) Arguably solves Bellman s curse of dimensionality

Brain reward systems TD error seem to signal TD error Wolfram Schultz, et al.

World models

Autonomous helicopter flight via Reinforcement Learning Ng (Stanford), Kim, Jordan, & Sastry (UC Berkeley) 2004

Reason as RL over Imagined Experience 1. Learn a predictive model of the world s dynamics transition probabilities, expected immediate rewards 2. Use model to generate imaginary experiences internal thought trials, mental simulation (Craik, 1943) 3. Apply RL as if experience had really happened vicarious trial and error (Tolman, 1932)

GridWorld Example

Summary: RL s Computational Theory of Mind Reward Policy It s all created from the scalar reward signal Value Function Predictive Model together with the causal structure of the world

Personal perspective There is a science of mind that is neither natural science nor applications technology In the future, most minds will be designed rather than evolved Reinforcement learning is the beginning of an interdisciplinary, computational theory of mind

The great divisions, or dimensions, of RL Prediction and control problems Methods Tabular vs function approximation Temporal-difference learning vs Monte Carlo Model-based vs model-free Value-based vs explicitly representing the policy And yet there is an amazing unity, convergence