Introduction to RL. Robert Platt Northeastern University. (some slides/material borrowed from Rich Sutton)

Similar documents
TD(λ) and Q-Learning Based Ludo Players

Reinforcement Learning by Comparing Immediate Reward

Exploration. CS : Deep Reinforcement Learning Sergey Levine

An OO Framework for building Intelligence and Learning properties in Software Agents

Laboratorio di Intelligenza Artificiale e Robotica

Axiom 2013 Team Description Paper

Lecture 6: Applications

Lecture 10: Reinforcement Learning

Speeding Up Reinforcement Learning with Behavior Transfer

Laboratorio di Intelligenza Artificiale e Robotica

How long did... Who did... Where was... When did... How did... Which did...

High-level Reinforcement Learning in Strategy Games

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

A Reinforcement Learning Variant for Control Scheduling

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

Improving Action Selection in MDP s via Knowledge Transfer

Georgetown University at TREC 2017 Dynamic Domain Track

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

AMULTIAGENT system [1] can be defined as a group of

AI Agent for Ice Hockey Atari 2600

Teachable Robots: Understanding Human Teaching Behavior to Build More Effective Robot Learners

Automatic Discretization of Actions and States in Monte-Carlo Tree Search

An investigation of imitation learning algorithms for structured prediction

IAT 888: Metacreation Machines endowed with creative behavior. Philippe Pasquier Office 565 (floor 14)

DIGITAL GAMING & INTERACTIVE MEDIA BACHELOR S DEGREE. Junior Year. Summer (Bridge Quarter) Fall Winter Spring GAME Credits.

CS Machine Learning

Lecture 1: Machine Learning Basics

Artificial Neural Networks written examination

Rover Races Grades: 3-5 Prep Time: ~45 Minutes Lesson Time: ~105 minutes

LEGO MINDSTORMS Education EV3 Coding Activities

Lecture 2: Quantifiers and Approximation

Designing a Computer to Play Nim: A Mini-Capstone Project in Digital Design I

Lesson plan for Maze Game 1: Using vector representations to move through a maze Time for activity: homework for 20 minutes

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Python Machine Learning

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

Top US Tech Talent for the Top China Tech Company

The Oregon Literacy Framework of September 2009 as it Applies to grades K-3

Contents. Foreword... 5

Radius STEM Readiness TM

Learning Methods for Fuzzy Systems

Intelligent Agents. Chapter 2. Chapter 2 1

4-3 Basic Skills and Concepts

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

give every teacher everything they need to teach mathematics

An Introduction to the Minimalist Program

Learning and Transferring Relational Instance-Based Policies

Learning Prospective Robot Behavior

Lesson Plan. Preparation

Generative models and adversarial training

Knowledge Transfer in Deep Convolutional Neural Nets

Cooperative Game Theoretic Models for Decision-Making in Contexts of Library Cooperation 1

Work Stations 101: Grades K-5 NCTM Regional Conference &

Shockwheat. Statistics 1, Activity 1

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Evolution of Symbolisation in Chimpanzees and Neural Nets

Regret-based Reward Elicitation for Markov Decision Processes

Bluetooth mlearning Applications for the Classroom of the Future

Case Acquisition Strategies for Case-Based Reasoning in Real-Time Strategy Games

SARDNET: A Self-Organizing Feature Map for Sequences

FF+FPG: Guiding a Policy-Gradient Planner

Undergraduate Program Guide. Bachelor of Science. Computer Science DEPARTMENT OF COMPUTER SCIENCE and ENGINEERING

Secret Code for Mazes

DOCTOR OF PHILOSOPHY HANDBOOK

P a g e 1. Grade 4. Grant funded by: MS Exemplar Unit English Language Arts Grade 4 Edition 1

To the Student: ABOUT THE EXAM

Dialog-based Language Learning

Hentai High School A Game Guide

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Coaching Others for Top Performance 16 Hour Workshop

The KAM project: Mathematics in vocational subjects*

CSL465/603 - Machine Learning

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Seminar - Organic Computing

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

A BOOK IN A SLIDESHOW. The Dragonfly Effect JENNIFER AAKER & ANDY SMITH

Evaluating Statements About Probability

Connect Communicate Collaborate. Transform your organisation with Promethean s interactive collaboration solutions

Natural Language Processing. George Konidaris

The dilemma of Saussurean communication

preassessment was administered)

School of Innovative Technologies and Engineering

An Introduction to Simio for Beginners

Calibration of Confidence Measures in Speech Recognition

The Round Earth Project. Collaborative VR for Elementary School Kids

IN THIS UNIT YOU LEARN HOW TO: SPEAKING 1 Work in pairs. Discuss the questions. 2 Work with a new partner. Discuss the questions.

LEARNING TO PLAY IN A DAY: FASTER DEEP REIN-

Surprise-Based Learning for Autonomous Systems

Massively Multi-Author Hybrid Articial Intelligence

IMGD Technical Game Development I: Iterative Development Techniques. by Robert W. Lindeman

More ESL Teaching Ideas

EDEXCEL FUNCTIONAL SKILLS PILOT. Maths Level 2. Chapter 7. Working with probability

Grade 5 + DIGITAL. EL Strategies. DOK 1-4 RTI Tiers 1-3. Flexible Supplemental K-8 ELA & Math Online & Print

An Evaluation of the Interactive-Activation Model Using Masked Partial-Word Priming. Jason R. Perry. University of Western Ontario. Stephen J.

Android App Development for Beginners

Summary / Response. Karl Smith, Accelerations Educational Software. Page 1 of 8

SOFTWARE EVALUATION TOOL

Getting Started with Deliberate Practice

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Transcription:

Introduction to RL Robert Platt Northeastern University (some slides/material borrowed from Rich Sutton)

What is reinforcement learning? RL is learning through trial-and-error without a model of the world

What is reinforcement learning? RL is learning through trial-and-error without a model of the world This is different from standard control/planning systems: vs Standard control system Reinforcement learning

What is reinforcement learning? RL is learning through trial-and-error without a model of the world This is different from standard control/planning systems: require a model of the world i.e. you need to hand-code the successor function often require the world to be expressed in a certain way e.g. symbolic planners assume symbolic representation e.g. optimal control assume algebraic representation

What is reinforcement learning? RL is learning through trial-and-error without a model of the world This is different from standard control/planning systems: require a model of the world i.e. you need to hand-code the successor function often require the world to be expressed in a certain way e.g. symbolic planners assume symbolic representation e.g. optimal control assume algebraic representation RL doesn t require any of this RL intuitively resembles natural learning RL is harder than planning b/c you don t get the model RL can be less efficient that control/planning b/c of its generality

The RL Setting Action Agent Observation Reward World On a single time step, agent does the following: 1. observe some information 2. select an action to execute 3. take note of any reward Goal of agent: select actions that maximize cumulative reward in the long run

Example: rat in a maze Move left/right/up/down Agent Observe position in maze Reward = +1 if get cheese Goal: maximize cheese eaten World

Example: robot makes coffee Move robot joints Agent Observe camera image Reward = +1 if coffee in cup Goal: maximize coffee produced World

Example: agent plays pong Joystick command Agent Observe screen pixels Reward = game score Goal: maximize game score World

Think-Pair-Share Question How would you express the problem of playing online texas hold-em as an RL problem? Action =? Agent Observation =? Reward =? Goal:? World

RL example Let s say you want to program the computer to play tic-tac-toe How might you do it?

RL example Let s say you want to program the computer to play tic-tac-toe How might you do it? 1. search: mini-max tree search plans for the optimal opponent, not actual opponent 2. evolutionary computation: start w/ population of random policies; have them play each other can view this as hillclimbing in policy space wrt a fitness function

RL example Let s say you want to program the computer to play tic-tac-toe How might you do it? 3. RL: Value function: estimate value function V(s) over states s examples of states:... V(s) denotes expected reward from state s (+1 win, -1 lose, 0 draw) Game play: the agent selects actions that lead to states with high values, V(s) the agent gradually gets lots of experience of the results of executing various actions from different states But how estimate value function?

RL example Value function: estimate value function V(s) over states s examples of states:... V(s) denotes expected reward from state s (+1 win, -1 lose, 0 draw) Game play: the agent selects actions that lead to states with high values, V(s) the agent gradually gets lots of experience of the results of executing various actions from different states But how estimate value function?

RL example: MENACE Donald Michie teaching MENACE to play tic-tac-toe (1960) Can a machine comprised only of matchbooks learn to play tic-tac-toe?

RL example: MENACE Donald Michie teaching MENACE to play tic-tac-toe (1960) Can a machine comprised only of matchbooks learn to play tic-tac-toe?

RL example: MENACE How it works: Gameplay: each tic-tac-toe board position corresponds to a matchbox at the beginning of play, each matchbox is filled will beads of different colors there are nine bead colors: one for each board position when it is MENACE s turn, open drawer corresponding to board configuration and select a bead randomly. Make the corresponding move. Leave bead on table and leave matchbox open. Reward: play an entire game to its conclusion until it ends: win/lose/draw if MENACE loses the game, remove beads from table and throw them away if MENACE draws, replace each bead back into the box it came from. Add an extra bead of the same color to each box. if MENACE wins, replace each bead back into the box it came from. Add THREE extra beads of the same color to each box.

RL example: MENACE Bead initialization: How it works: First move boxes: 4 beads per move Second move boxes: 3 beads per move Third move boxes: 2 beads per move Fourth move boxes: 1 bead per move Gameplay: each tic-tac-toe board position corresponds to a matchbox at the beginning of play, each matchbox is filled will beads of different colors there are nine bead colors: one for each board position when it is MENACE s turn, open drawer corresponding to board configuration and select a bead randomly. Make the corresponding move. Leave bead on table and leave matchbox open. Reward: play an entire game to its conclusion until it ends: win/lose/draw if MENACE loses the game, remove beads from table and throw them away if MENACE draws, replace each bead back into the box it came from. Add an extra bead of the same color to each box. if MENACE wins, replace each bead back into the box it came from. Add THREE extra beads of the same color to each box.

Think-Pair-Share Question Questions: why did Michie use that particular bead initialization? why add an extra bead when you get to a draw? how might this learning algorithm fail? How would you fix it? What tradeoff do you face?

Where does RL live?

Key challenges in RL no model of the environment agent only gets a scalar reward signal delayed feedback need to balance exploration of the world exploitation of learned knowledge real world problems can be non-stationary

Major historical RL successes Learned the world s best player of Backgammon (Tesauro 1995) Learned acrobatic helicopter autopilots (Ng, Abbeel, Coates et al 2006+) Widely used in the placement and selection of advertisements and pages on the web (e.g., A-B tests) Used to make strategic decisions in Jeopardy! (IBM s Watson 2011) Achieved human-level performance on Atari games from pixel-level visual input, in conjunction with deep learning (Google Deepmind 2015) In all these cases, performance was better than could be obtained by any other method, and was obtained without human instruction

Example: TD-Gammon

RL + Deep Learing on Atari Games

RL + Deep Learing on Atari Games

Major historical RL successes Learned the world s best player of Backgammon (Tesauro 1995) Learned acrobatic helicopter autopilots (Ng, Abbeel, Coates et al 2006+) Widely used in the placement and selection of advertisements and pages on the web (e.g., A-B tests) Used to make strategic decisions in Jeopardy! (IBM s Watson 2011) Achieved human-level performance on Atari games from pixel-level visual input, in conjunction with deep learning (Google Deepmind 2015) In all these cases, performance was better than could be obtained by any other method, and was obtained without human instruction

The singularity

The singularity At some point, humankind will probably create a machine that is pretty smart smarter than us in many ways. this event is generally known as the singularity although it means slightly diferent things to diferent people.

Advances in AI abilities are coming faster (last 5 yrs) IBM s Watson beats the best human players of Jeopardy! (2011) Deep neural networks greatly improve the state of the art in speech recognition and computer vision (2012 ) Google s self-driving car becomes a plausible reality ( 2013) Deepmind s DQN learns to play Atari games at the human level, from pixels, with no gamespecifc knowledge ( 2014, Nature) University of Alberta s Cepheus solves Poker (2015, Science) Google Deepmind s AlphaGo defeats the world Go champion, vastly improving over all previous programs (2016)

The singularity At some point, humankind will probably create a machine that is pretty smart smarter than us in many ways. this even is generally known as the singularity although it means slightly diferent things to diferent people. It s hard to know what would happen after that event. One thought: our new inventions might better be modeled as new species rather than new machines.

Think-Pair-Share Question When will we understand the principles of intelligence well enough to create, using technology, artifcial minds that rival our own in skill and generality? Which of the following best represents your current views? A. Never B. Not during your lifetime C. During your lifetime, but not before 2045 D. Before 2045 E. Before 2035 F. It s already happened and we re all living in a simulation of reality What do you think happens after that?

This course Content: Most of the material comes from Sutton and Barto s book, Reinforcement Learning: an Introduction, second ed. We will also cover selected topics in deep RL not covered in that book. Objectives: understand theoretical underpinnings of RL gain practical knowledge of how to solve problems using RL

This course Workload: written/programming assignments approximately weekly (60% of grade) end of semester project (40% of grade) Prerequisites: you need to be able to write Python code and install tensorflow. need to be mathematically mature, i.e. be able to understand concepts explained mathematically. background in probability and linear algebra

What do you want to learn?