Deep Reinforcement Learning: An Overview

Similar documents
Lecture 10: Reinforcement Learning

Georgetown University at TREC 2017 Dynamic Domain Track

Reinforcement Learning by Comparing Immediate Reward

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

Lecture 1: Machine Learning Basics

Axiom 2013 Team Description Paper

AI Agent for Ice Hockey Atari 2600

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

LEARNING TO PLAY IN A DAY: FASTER DEEP REIN-

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

FF+FPG: Guiding a Policy-Gradient Planner

TD(λ) and Q-Learning Based Ludo Players

CSL465/603 - Machine Learning

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Generative models and adversarial training

Laboratorio di Intelligenza Artificiale e Robotica

Python Machine Learning

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

High-level Reinforcement Learning in Strategy Games

Regret-based Reward Elicitation for Markov Decision Processes

Speeding Up Reinforcement Learning with Behavior Transfer

Improving Action Selection in MDP s via Knowledge Transfer

A Reinforcement Learning Variant for Control Scheduling

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

(Sub)Gradient Descent

Laboratorio di Intelligenza Artificiale e Robotica

Task Completion Transfer Learning for Reward Inference

Dialog-based Language Learning

An investigation of imitation learning algorithms for structured prediction

Chapter 2. Intelligent Agents. Outline. Agents and environments. Rationality. PEAS (Performance measure, Environment, Actuators, Sensors)

arxiv: v1 [cs.lg] 15 Jun 2015

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Agents and environments. Intelligent Agents. Reminders. Vacuum-cleaner world. Outline. A vacuum-cleaner agent. Chapter 2 Actuators

Learning Prospective Robot Behavior

Task Completion Transfer Learning for Reward Inference

An OO Framework for building Intelligence and Learning properties in Software Agents

Artificial Neural Networks written examination

arxiv: v1 [cs.lg] 8 Mar 2017

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Transferring End-to-End Visuomotor Control from Simulation to Real World for a Multi-Stage Task

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Teachable Robots: Understanding Human Teaching Behavior to Build More Effective Robot Learners

Case Acquisition Strategies for Case-Based Reasoning in Real-Time Strategy Games

IAT 888: Metacreation Machines endowed with creative behavior. Philippe Pasquier Office 565 (floor 14)

Assignment 1: Predicting Amazon Review Ratings

Automatic Discretization of Actions and States in Monte-Carlo Tree Search

Adaptive Generation in Dialogue Systems Using Dynamic User Modeling

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Truth Inference in Crowdsourcing: Is the Problem Solved?

AMULTIAGENT system [1] can be defined as a group of

HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Softprop: Softmax Neural Network Backpropagation Learning

Abstractions and the Brain

Lecture 6: Applications

Massachusetts Institute of Technology Tel: Massachusetts Avenue Room 32-D558 MA 02139

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Calibration of Confidence Measures in Speech Recognition

Learning and Transferring Relational Instance-Based Policies

Learning to Schedule Straight-Line Code

Welcome to MyOutcomes Online, the online course for students using Outcomes Elementary, in the classroom.

Robot Learning Simultaneously a Task and How to Interpret Human Instructions

Rule Learning With Negation: Issues Regarding Effectiveness

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

A Reinforcement Learning Approach for Adaptive Single- and Multi-Document Summarization

Learning Methods for Fuzzy Systems

Evolutive Neural Net Fuzzy Filtering: Basic Description

An Introduction to Simulation Optimization

DOCTOR OF PHILOSOPHY HANDBOOK

Interaction Design Considerations for an Aircraft Carrier Deck Agent-based Simulation

Intelligent Agents. Chapter 2. Chapter 2 1

Learning Methods in Multilingual Speech Recognition

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Using Deep Convolutional Neural Networks in Monte Carlo Tree Search

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

A study of speaker adaptation for DNN-based speech synthesis

Introduction to Simulation

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Semi-Supervised Face Detection

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Second Exam: Natural Language Parsing with Neural Networks

Human-like Natural Language Generation Using Monte Carlo Tree Search

An empirical study of learning speed in backpropagation

arxiv: v2 [cs.ro] 3 Mar 2017

Natural Language Processing. George Konidaris

BMBF Project ROBUKOM: Robust Communication Networks

A Comparison of Annealing Techniques for Academic Course Scheduling

COMPUTER-AIDED DESIGN TOOLS THAT ADAPT

Go fishing! Responsibility judgments when cooperation breaks down

arxiv: v1 [cs.lg] 7 Apr 2015

Visual CP Representation of Knowledge

Australian Journal of Basic and Applied Sciences

Transcription:

: An Overview PhD student, CISE department July 10, 2018 : An Overview

Background Motivation What is a good framework for studying intelligence? : An Overview

Background Motivation What is a good framework for studying intelligence? What are the necessary and sufficient ingredients for building agents that learn and act like people? : An Overview

Background Reinforcement Learning Source: Sutton, Richard S., and Andrew G. Barto. Reinforcement learning: An introduction. Vol. 1. No. 1. Cambridge: MIT press, 1998. : An Overview

Background Reinforcement Learning My Perspective Reinforcement Learning is necessary but not sufficient for general (strong) artificial intelligence Source: Yann LeCun, NIPS 2016 : An Overview

Background Markov Decision Processes Definition A Markov decision process (MDP) is a formal way to describe the sequential decision-making problems encountered in RL. : An Overview

Background Markov Decision Processes Definition A Markov decision process (MDP) is a formal way to describe the sequential decision-making problems encountered in RL. In the simplest RL setting, an MDP is specified by states S, actions A, an episode length H, and a reward function r(s, a). : An Overview

Background Policies and Value Functions A policy π(a s) is a behavior function for selecting an action given the current state. The action-value function is the expected total reward accumulated from starting in state s, taking action a, and following policy π until the end of the length H episode: Q π [ H (s, a) = E π r t s 0 = s, a 0 = a ] t=0 What is the utility of doing action a when I m in state s? : An Overview

Background Big Picture Find policy π that maximizes expected total reward, i.e., π = argmax Q π (s, a). π In particular, for any start state s 0 S, the agent can use π to select the action a 0 that will maximize its expected total reward. : An Overview

Source: http://people.csail.mit.edu/hongzi/ : An Overview

: An Overview

Playing Atari (2013) What is RL? Source: Mnih, Volodymyr, et al. Playing atari with deep reinforcement learning. arxiv preprint arxiv:1312.5602 (2013). : An Overview

Deep Q-Network () : An Overview

Just Apply Gradient Descent Represent Q π (s, a) by a deep Q-network with weights w Q(s, a, w) Q π (s, a) Define objective function by mean-squared Bellman error [( ) 2 ] L(w) = E Leading to the following gradient r + γmax a Q(s, a, w) Q(s, a, w) L [( ) Q(s, a, w) ] w = E r + γmaxq(s, a, w) Q(s, a, w) a w Optimize with stochastic gradient descent : An Overview

with Deep RL Naive Q-learning with non-linear function approximation oscillates or diverges Experiences from episodes generated during training are correlated, non-iid Policy can change rapidly with slight changes to Q-values Q-learning gradients can be large and unstable when backpropagated : An Overview

Stabilizing Deep RL What is RL? Maintain a replay buffer of experiences to uniformly sample from to compute gradients for Q-network. This decorrelates samples and improves samples efficiency Hold the parameters of the target Q-values fixed in Bellman error with a target Q-network. Periodically update the parameters of the target network Clip rewards and potentially clip gradients as well : An Overview

Source: Mnih, Volodymyr, et al. Playing atari with deep reinforcement learning. arxiv preprint arxiv:1312.5602 (2013). : An Overview

Rainbow (2017) Source: Hessel, Matteo, et al. Rainbow: Combining Improvements in Deep Reinforcement Learning. arxiv:1710.02298 (2017). : An Overview

(2015) What is RL? It was thought that AI was a decade away from beating humans at Go : An Overview

What is RL? Key Ingredients Tree search augmented with policy and value deep networks that intelligently control exploration and exploitation : An Overview

Monte Carlo Tree Search Source: Silver, David, et al. Mastering the game of Go with deep neural networks and tree search. nature 529.7587 (2016): 484-489. : An Overview

AlphaZero (2017) What is RL? Source: arxiv:1712.01815 : An Overview

Continuous control What is RL? 1 Locomotion Behaviors https://www.youtube.com/watch?v=g59nsurxygk 2 Learning to Run https://www.youtube.com/watch?v=mbjuarg DI 3 Robotics https://www.youtube.com/watch?v=q4bmcuk6pcw&t=56s : An Overview

Continuous control What is RL? Can we learn Q π by minimizing the expected Bellman error? Continuous Action Spaces For A R n, E [( r + γmax a A Q(s, a, w) Q(s, a, w) ) 2 ] Requires solving a non-convex optimization problem! : An Overview

Policy Gradient Algorithms REINFORCE w J(w) = N log π w (a i s i )(R b) i=1 R can be the sum of rewards for the episode or the discounted sum of rewards for the episode. b is a baseline, or control variate, for reducing the variance of this gradient estimator. Source: Williams, Ronald J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8.3-4 (1992): 229-256. : An Overview

Policy Gradient Algorithms REINFORCE w J(w) = N log π w (a i s i )(R b) i=1 R can be the sum of rewards for the episode or the discounted sum of rewards for the episode. b is a baseline, or control variate, for reducing the variance of this gradient estimator. How does this work? Ascend the policy gradient! Source: Williams, Ronald J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8.3-4 (1992): 229-256. : An Overview

Deep Deterministic Policy Gradient (2016) 1 Let the policy be a deterministic function π(s, θ) : S A, S R m, A R n, parameterized as a deep network 2 Still maximize expected total reward, except now need to compute the deterministic policy (actor) gradient and the (critic) action-value gradient 3 Train both the policy and action-value networks with an actor-critic approach Source: Lillicrap, Timothy P., et al. Continuous control with deep reinforcement learning. arxiv preprint arxiv:1509.02971 (2015). : An Overview

Actor-Critic What is RL? Source: Sutton, Richard S., and Andrew G. Barto. Reinforcement learning: An introduction. Vol. 1. No. 1. Cambridge: MIT press, 1998. : An Overview

Future Research Directions 1 Sample-efficient learning by embedding priors about the world, e.g., intuitive physics 2 Low-variance, unbiased policy gradient estimators 3 Multi-agent RL (Dota2 and Starcraft) 4 Safe RL 5 Meta-learning and transfer learning 6 Reinforcement learning on combinatorial action spaces Source: Emami, Patrick, and Sanjay Ranka. Learning Permutations with Sinkhorn Policy Gradient. arxiv preprint arxiv:1805.07010 (2018). : An Overview

Source: http://blog.otoro.net/2017/11/12/evolvingstable-strategies/ Fin. : An Overview