Deep Reinforcement Learning. Sargur N. Srihari

Similar documents
Lecture 10: Reinforcement Learning

Reinforcement Learning by Comparing Immediate Reward

Python Machine Learning

Artificial Neural Networks written examination

Lecture 1: Machine Learning Basics

AI Agent for Ice Hockey Atari 2600

CS Machine Learning

Generative models and adversarial training

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Axiom 2013 Team Description Paper

Laboratorio di Intelligenza Artificiale e Robotica

Georgetown University at TREC 2017 Dynamic Domain Track

LEARNING TO PLAY IN A DAY: FASTER DEEP REIN-

(Sub)Gradient Descent

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

AMULTIAGENT system [1] can be defined as a group of

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Laboratorio di Intelligenza Artificiale e Robotica

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

Exploration. CS : Deep Reinforcement Learning Sergey Levine

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

INTERMEDIATE ALGEBRA PRODUCT GUIDE

Hentai High School A Game Guide

arxiv: v1 [cs.lg] 15 Jun 2015

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

A Reinforcement Learning Variant for Control Scheduling

Softprop: Softmax Neural Network Backpropagation Learning

Word Segmentation of Off-line Handwritten Documents

Using focal point learning to improve human machine tacit coordination

CSL465/603 - Machine Learning

Task Completion Transfer Learning for Reward Inference

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Analysis of Enzyme Kinetic Data

arxiv: v1 [cs.cv] 10 May 2017

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Session 2B From understanding perspectives to informing public policy the potential and challenges for Q findings to inform survey design

Regret-based Reward Elicitation for Markov Decision Processes

Learning to Schedule Straight-Line Code

High-level Reinforcement Learning in Strategy Games

An empirical study of learning speed in backpropagation

TD(λ) and Q-Learning Based Ludo Players

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Learning Methods for Fuzzy Systems

Deep Neural Network Language Models

Evolution of Symbolisation in Chimpanzees and Neural Nets

Model Ensemble for Click Prediction in Bing Search Ads

Lahore University of Management Sciences. FINN 321 Econometrics Fall Semester 2017

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Task Completion Transfer Learning for Reward Inference

Appendix L: Online Testing Highlights and Script

Mike Cohn - background

PowerTeacher Gradebook User Guide PowerSchool Student Information System

Probability and Game Theory Course Syllabus

INPE São José dos Campos

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

On-the-Fly Customization of Automated Essay Scoring

Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures

CPS122 Lecture: Identifying Responsibilities; CRC Cards. 1. To show how to use CRC cards to identify objects and find responsibilities

Discriminative Learning of Beam-Search Heuristics for Planning

Evolutive Neural Net Fuzzy Filtering: Basic Description

Getting Started with TI-Nspire High School Science

Software Maintenance

Copyright Corwin 2015

Learning and Transferring Relational Instance-Based Policies

Purdue Data Summit Communication of Big Data Analytics. New SAT Predictive Validity Case Study

Speeding Up Reinforcement Learning with Behavior Transfer

arxiv: v1 [cs.dc] 19 May 2017

WiggleWorks Software Manual PDF0049 (PDF) Houghton Mifflin Harcourt Publishing Company

Teachable Robots: Understanding Human Teaching Behavior to Build More Effective Robot Learners

Reducing Features to Improve Bug Prediction

A Comparison of Annealing Techniques for Academic Course Scheduling

Executive Guide to Simulation for Health

Lecture 1: Basic Concepts of Machine Learning

Story Problems with. Missing Parts. s e s s i o n 1. 8 A. Story Problems with. More Story Problems with. Missing Parts

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

Knowledge Transfer in Deep Convolutional Neural Nets

Improving Action Selection in MDP s via Knowledge Transfer

Manipulative Mathematics Using Manipulatives to Promote Understanding of Math Concepts

Human Emotion Recognition From Speech

Backwards Numbers: A Study of Place Value. Catherine Perez

Houghton Mifflin Online Assessment System Walkthrough Guide

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

Top US Tech Talent for the Top China Tech Company

While you are waiting... socrative.com, room number SIMLANG2016

Contents. Foreword... 5

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Grades. From Your Friends at The MAILBOX

Android App Development for Beginners

IAT 888: Metacreation Machines endowed with creative behavior. Philippe Pasquier Office 565 (floor 14)

An investigation of imitation learning algorithms for structured prediction

Adaptive Generation in Dialogue Systems Using Dynamic User Modeling

Transcription:

Deep Reinforcement Learning Sargur N. srihari@cedar.buffalo.edu 1

Topics in Deep RL 1. Q-learning target function as a table 2. Learning Q as a function 3. Simple versus deep reinforcement learning 4. Deep Q Network for Atari Breakout 5. The Gym framework for RL 6. Research frontiers of RL 2

Definitions for Q Learning & Grid world Machine Learning r(s,a) (Immediate Reward) Q(s,a) values V*(s) (Maximum Discounted Cumulative Reward) V π (s t ) = r t + γr t+1 + γ 2 r t+2 +...= γ i r i+1 i=0 Recurrent Definition Q(s,a) = r(s,a)+ γ maxq(δ(s,a),a ') a ' Q(s,a) = r(s,a)+ γv *(δ(s,a) V *(s) = maxq(s,a ') a ' One Optimal policy π *(s) = arg max[r(s,a)+ γv *(δ(s,a))] π π *(s) = arg maxq(s,a) π 3

Q Learningè table updates The target function is a lookup table With a distinct table entry for every state-action pair Training rule (deterministic case): ˆQ(s,a) = r(s,a)+ γ max a ' ˆQ(s,a ') Q(s,a)=r+γmax a Q(s,a ) is called Bellman s equation: Which says, maximum future reward is immediate reward plus maximum future reward for next state Training rule (non-deterministic case): ˆQ n (s,a) (1 α n ) ˆQ n 1 (s,a)+ α n r + γ max a ' ˆQ n 1 (s ',a ') 4

Iterative Q-learning using Bellman eqn initialize Q[numstates,numactions] arbitrarily observe initial state s repeat select and carry out an action a observe reward r and new state s' Q[s,a] = Q[s,a] + α(r + γmaxa Q[s',a'] - Q[s,a]) s = s' until terminated α is a learning rate that controls how much of the difference between previous Q-value and newly proposed Q-value is taken into account. When α=1, then two Q[s,a]-s cancel and the update is exactly the same as Bellman equation Q(s,a)=r+γmax a Q(s,a ) 5

Q-Learning is Rote Learning Target function is an explict entry for each state-action pair It makes no attempt to estimate the Q value for unseen action-state pairs By generalizing from those that have been seen Rote learning inherent in convergence theorem Relies on every (s,a) pair visited infinitely often An unrealistic assumption for large or infinite spaces More practical RL systems combine ML function approximation methods with Q learning rules 6

Learning Q as a function Replace ˆQ table with a neural net or other generalizer Using each ˆQ (s,a) update as a training example Encode s and a as inputs and train network to output target values of Q given by the training rules Deterministic: Nondeterministic: Loss Function: ˆQ(s,a) = r(s,a)+ γ max a ' ˆQ(s,a ') ˆQ n (s,a) (1 α n ) ˆQ n 1 (s,a)+ α n L = 1 2 [r + γ max Q(s,a ) Q(s,a)]2 a r + γ max a ' ˆQ n 1 (s ',a ') Target Prediction 7

Machine Learning Simple ML v Deep Learning 1. Simple Machine Learning (e.g., SVM) 2. Deep Learning (e.g., Neural Net using CNN) Gradient descent using Backward error propagation for computing gradients http://rail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-1.pdf 8

Simple RL vs Deep RL 1. Simple Reinforcement Learning (Q Table Learning) 2. Deep Reinforcement Learning (Q Function Learning) 9

Deep Q Network for Atari Breakout The game: You control a paddle at the bottom of screen Bounce the ball back to clear all the bricks in upper half of screen Each time you hit a brick, it disappears and you get a reward https://arxiv.org/abs/1312.5602 10

Neural network to play Breakout Input to network: screen images Output would be three actions: left, right or press fire (to launch the ball). Can treat it as a classification problem Given a game screen decide: left, right or fire we could record game sessions using players, But that s not how we learn. Don t need a million times which move to choose at each screen. Just need occasional feedback that we did the right thing and can then figure out everything else ourselves This is the task of reinforcement learning

What is state in Atari breakout? Game specific representation Location of paddle Location and direction of the ball Existence of each individual brick More general representation Screen pixels would contain all relevant information except speed and direction of ball Two consecutive screens would cover these as well 12

Role of deep learning If we take four last screen images, Resize them to 84 84 Convert to grayscale with 256 gray levels we would have 256 84 84 4 10 67970 game states Deep learning to the rescue They are exceptionally good in coming up with good features for highly structured data 13

Alternative architectures for Breakout Naiive architecture More optimal architecture Left Right Fire Four game screens https://neuro.cs.ut.ee/demystifying-deep-reinforcement-learning/ 14

Loss Function Q-values can be any real values, which makes it a regression task, that can be optimized with a simple squared error loss. L = 1 2 [r + γ max Q(s,a ) Q(s,a)]2 a Target Prediction 15

Deep Q Network for Breakout 16

Q Table Update Rule Given a transition <s,a,r,s > 1. Do a feedforward pass for the current state s to get predicted Q-values for all actions. 2. Do a feedforward pass for the next state s and calculate maximum over all network outputs max a Q(s,a) 3. Set Q-value target for action a to r+γmax a Q(s,a) (use the max calculated in step 2). For all other actions, set the Q-value target to the same as originally returned from step 1, making the error 0 for those outputs. 4. Update the weights using backpropagation. 17

Experience Replay Approximation of Q-values using non-linear functions is not very stable A bag of tricks needed for convergence Also, it takes a long time, a week on a single GPU Most important trick is experience replay During gameplay all experiences <s,a,r,s> are stored in a replay memory During training, random samples from memory are used instead of the most recent transition. This breaks the similarity of subsequent training samples Human gameplay experiences can also be used 18

Q-learning using experience replay initialize replay memory D initialize action-value function Q with random weights observe initial state s repeat select an action a with probability ε select a random action otherwise select a = argmaxa Q(s,a ) carry out action a observe reward r and new state s store experience <s, a, r, s > in replay memory D sample random transitions <ss, aa, rr, ss > from replay memory D calculate target for each minibatch transition if ss is terminal state then tt = rr otherwise tt = rr + γmaxa Q(ss, aa ) train the Q network using (tt - Q(ss, aa))^2 as loss s = s' until terminated

Gym Gym is a toolkit for developing and comparing reinforcement learning algorithms. It supports teaching agents everything from walking to playing games like Pong or Pinball It is compatible with any numerical computation library, such as TensorFlow or Theano To get started, you ll need to have Python 3.5+ installed. Simply install gym using pip: pip install gym 20

Other research topics in RL Case where state only partially observable Design optimal exploration strategies Extend to continuous action, state https://arxiv.org/abs/1509.02971 Learn and use : S AèS Double Q-learning, Prioritized Experience Replay, Dueling Network Architecture ˆδ 21

Final comments on Deep RL Because our Q-function is initialized randomly, it initially outputs complete garbage We use this garbage (the maximum Q-value of the next state) as targets for the network, only occasionally folding in a tiny reward How could it learn anything meaningful at all? The fact is, that it does Watching them figure it out is like observing an animal in the wild a rewarding experience by itself 22