Deep Learning. Mohammad Ali Keyvanrad Lecture 19:Deep Reinforcement Learning

Similar documents
Lecture 10: Reinforcement Learning

Georgetown University at TREC 2017 Dynamic Domain Track

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

Reinforcement Learning by Comparing Immediate Reward

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Exploration. CS : Deep Reinforcement Learning Sergey Levine

AI Agent for Ice Hockey Atari 2600

Artificial Neural Networks written examination

Generative models and adversarial training

Axiom 2013 Team Description Paper

LEARNING TO PLAY IN A DAY: FASTER DEEP REIN-

AMULTIAGENT system [1] can be defined as a group of

Lecture 1: Machine Learning Basics

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

High-level Reinforcement Learning in Strategy Games

CSL465/603 - Machine Learning

Python Machine Learning

TD(λ) and Q-Learning Based Ludo Players

Laboratorio di Intelligenza Artificiale e Robotica

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

Improving Action Selection in MDP s via Knowledge Transfer

A Reinforcement Learning Variant for Control Scheduling

INTERMEDIATE ALGEBRA PRODUCT GUIDE

Learning Prospective Robot Behavior

Regret-based Reward Elicitation for Markov Decision Processes

Laboratorio di Intelligenza Artificiale e Robotica

Seminar - Organic Computing

FF+FPG: Guiding a Policy-Gradient Planner

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Task Completion Transfer Learning for Reward Inference

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

TOKEN-BASED APPROACH FOR SCALABLE TEAM COORDINATION. by Yang Xu PhD of Information Sciences

Speeding Up Reinforcement Learning with Behavior Transfer

arxiv: v1 [cs.lg] 8 Mar 2017

On the Combined Behavior of Autonomous Resource Management Agents

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

The Strong Minimalist Thesis and Bounded Optimality

BMBF Project ROBUKOM: Robust Communication Networks

Shockwheat. Statistics 1, Activity 1

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Task Completion Transfer Learning for Reward Inference

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Evolutive Neural Net Fuzzy Filtering: Basic Description

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

Major Milestones, Team Activities, and Individual Deliverables

Lecture 6: Applications

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

Chapter 2. Intelligent Agents. Outline. Agents and environments. Rationality. PEAS (Performance measure, Environment, Actuators, Sensors)

Radius STEM Readiness TM

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

A Comparison of Annealing Techniques for Academic Course Scheduling

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

(Sub)Gradient Descent

Intelligent Agents. Chapter 2. Chapter 2 1

Adaptive Generation in Dialogue Systems Using Dynamic User Modeling

INPE São José dos Campos

Cal s Dinner Card Deals

Transferring End-to-End Visuomotor Control from Simulation to Real World for a Multi-Stage Task

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Dialog-based Language Learning

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

The Evolution of Random Phenomena

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

Individual Differences & Item Effects: How to test them, & how to test them well

Teachable Robots: Understanding Human Teaching Behavior to Build More Effective Robot Learners

Knowledge Transfer in Deep Convolutional Neural Nets

Model Ensemble for Click Prediction in Bing Search Ads

Agents and environments. Intelligent Agents. Reminders. Vacuum-cleaner world. Outline. A vacuum-cleaner agent. Chapter 2 Actuators

University of Victoria School of Exercise Science, Physical and Health Education EPHE 245 MOTOR LEARNING. Calendar Description Units: 1.

Improving Fairness in Memory Scheduling

LEGO MINDSTORMS Education EV3 Coding Activities

ECE-492 SENIOR ADVANCED DESIGN PROJECT

A Neural Network GUI Tested on Text-To-Phoneme Mapping

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Learning Human Utility from Video Demonstrations for Deductive Planning in Robotics

Transfer Learning Action Models by Measuring the Similarity of Different Domains

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

arxiv: v1 [cs.dc] 19 May 2017

Robot Learning Simultaneously a Task and How to Interpret Human Instructions

Comment-based Multi-View Clustering of Web 2.0 Items

arxiv: v1 [cs.cl] 2 Apr 2017

An OO Framework for building Intelligence and Learning properties in Software Agents

Soft Computing based Learning for Cognitive Radio

Finding Your Friends and Following Them to Where You Are

Word learning as Bayesian inference

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Enduring Understandings: Students will understand that

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Undergraduate Program Guide. Bachelor of Science. Computer Science DEPARTMENT OF COMPUTER SCIENCE and ENGINEERING

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Given a directed graph G =(N A), where N is a set of m nodes and A. destination node, implying a direction for ow to follow. Arcs have limitations

1.11 I Know What Do You Know?

arxiv: v1 [cs.cv] 10 May 2017

Emergency Management Games and Test Case Utility:

Ericsson Wallet Platform (EWP) 3.0 Training Programs. Catalog of Course Descriptions

Assignment 1: Predicting Amazon Review Ratings

Transcription:

Deep Learning Mohammad Ali Keyvanrad Lecture 19:Deep Reinforcement Learning

OUTLINE Introduction Reinforcement Learning examples Mathematical formulation of the RL problem Deep Q-learning Deep Q-learning examples 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 2

OUTLINE Introduction Reinforcement Learning examples Mathematical formulation of the RL problem Deep Q-learning Deep Q-learning examples 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 3

Introduction Supervised Learning 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 4

Introduction Supervised Learning Training Info = desired (target) outputs Inputs Supervised Learning System Outputs Error = (target output actual output) 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 5

Introduction Unsupervised Learning 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 6

Introduction Unsupervised Learning Inputs Unsupervised Learning System Outputs Using measure of similarity, Likelihood, 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 7

Introduction Reinforcement Learning 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 8

Introduction Reinforcement Learning Training Info = evaluations ( rewards / penalties ) Inputs RL System Outputs ( actions ) 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 9

OUTLINE Introduction Reinforcement Learning examples Mathematical formulation of the RL problem Deep Q-learning Deep Q-learning examples 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 10

Reinforcement Learning examples 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 11

Reinforcement Learning examples Cart-Pole Problem 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 12

Reinforcement Learning examples Robot Locomotion 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 13

Reinforcement Learning examples Atari Games 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 14

Reinforcement Learning examples UCL Course on RL by David Silver Architecture 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 15

OUTLINE Introduction Reinforcement Learning examples Mathematical formulation of the RL problem Deep Q-learning Deep Q-learning examples 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 16

Mathematical formulation of the RL problem The basic reinforcement is modeled as a Markov decision process (MDP) Markov property: Current state completely characterizes the state of the world. 1. S: a set of environment and agent states 2. A: a set of actions of the agent 3. R: R a (s, s ) is the immediate reward after transition from s to s with action a 4. P: P a (s, s ) = Pr s t+1 = s s t = s, a t = a is the probability of transition from state s to state s under action a 5. γ : discount factor, represents the difference in importance between future rewards and present rewards 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 17

Mathematical formulation of the RL problem Algorithm At time step t = 0, environment samples initial state s 0 ~p s 0 Then, for t = 0 until done: Agent selects action a t Environment samples reward r t ~R. s t, a t ) Environment samples next state s t+1 ~ P. s t, a t ) Agent receives reward r t and next state s t+1 A policy π is a function from S to A that specifies what action to take in each state Objective: find policy π that maximizes cumulative discounted reward: 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 18

Mathematical formulation of the RL problem A simple MDP: Grid World 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 19

Mathematical formulation of the RL problem A simple MDP: Grid World 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 20

Mathematical formulation of the RL problem We want to find optimal policy π that maximizes the sum of rewards. How do we handle the randomness (initial state, transition probability )? Maximize the expected sum of rewards! Formally 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 21

Mathematical formulation of the RL problem Following a policy produces sample trajectories (or paths) s 0, a 0, r 0, s 1, a 1, r 1, How good is a state? The value function at state s, is the expected cumulative reward from following the policy from state s: How good is a state-action pair? The Q-value function at state s and action a, is the expected cumulative reward from taking action a in state s and then following the policy: 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 22

Mathematical formulation of the RL problem Bellman equation The optimal Q-value function Q is the maximum expected cumulative reward achievable from a given (state, action) pair Q satisfies the following Bellman equation: 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 23

Mathematical formulation of the RL problem Solving for the optimal policy Use Bellman equation as an iterative update Q i will converge to Q as i infinity What s the problem with this? Not scalable. Must compute Q(s, a) for every state-action pair. If state is e.g. current game state pixels, computationally infeasible to compute for entire state space! 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 24

Mathematical formulation of the RL problem Think about the Breakout game State: screen pixels Image size: 84 84 (resized) Consecutive 4 images Grayscale with 256 gray levels Solution: use a function approximator to estimate Q(s, a). E.g. a neural network! 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 25

OUTLINE Introduction Reinforcement Learning examples Mathematical formulation of the RL problem Deep Q-learning Deep Q-learning examples 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 26

Deep Q-learning Use a function (with parameters) to approximate the Q- function Deep Q-learning A Q-Learning that the function approximator is a deep neural network function parameters: weights 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 27

Deep Q-learning 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 28

Deep Q-learning 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 29

Experience Replay 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 30

Fixed Target Q-Network 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 31

Reward / Value Range DQN clips the reward to [ 1, +1] This prevents Q-values from becoming too large Ensures gradients are well-conditioned 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 32

Deep Q-learning Stable Deep RL 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 33

Train the Deep Q-Network 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 34

Train the Deep Q-Network 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 35

Train the Deep Q-Network 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 36

Train the Deep Q-Network 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 37

OUTLINE Introduction Reinforcement Learning examples Mathematical formulation of the RL problem Deep Q-learning Deep Q-learning examples 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 38

Deep Q-learning examples 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 39

Deep Q-learning examples 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 40

Deep Q-learning examples 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 41

Deep Q-learning examples A visualization of the learned action-value function on the game Pong. 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 42

Deep Q-learning examples Google's DeepMind used a Deep Learning technique to teach a computer to play Control of the keyboard while watching the score, and its goal was to maximize the score 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 43

Beating people in dozens of computer games Computer program playing Doom using only raw pixel data. 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 44

References Stanford Convolutional Neural Networks for Visual Recognition course (Lecture 14) Mnih, Volodymyr, et al. "Human-level control through deep reinforcement learning." Nature 518.7540 (2015): 529-533. Bowen Xu, Human-level control through deep reinforcement learning, Vehicle Intelligence Lab 1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 45

1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 46

1/3/2018 M.A Keyvanrad Deep Learning (Lecture17-Deep RL) 47