Trust Region Policy Optimization

Similar documents
Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

arxiv: v1 [cs.lg] 8 Mar 2017

Georgetown University at TREC 2017 Dynamic Domain Track

Lecture 1: Machine Learning Basics

Lecture 10: Reinforcement Learning

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Reinforcement Learning by Comparing Immediate Reward

Axiom 2013 Team Description Paper

Transferring End-to-End Visuomotor Control from Simulation to Real World for a Multi-Stage Task

Python Machine Learning

(Sub)Gradient Descent

arxiv: v2 [cs.ro] 3 Mar 2017

AI Agent for Ice Hockey Atari 2600

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

FF+FPG: Guiding a Policy-Gradient Planner

CSL465/603 - Machine Learning

Laboratorio di Intelligenza Artificiale e Robotica

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Chapter 2. Intelligent Agents. Outline. Agents and environments. Rationality. PEAS (Performance measure, Environment, Actuators, Sensors)

LEARNING TO PLAY IN A DAY: FASTER DEEP REIN-

TD(λ) and Q-Learning Based Ludo Players

Laboratorio di Intelligenza Artificiale e Robotica

AN EXAMPLE OF THE GOMORY CUTTING PLANE ALGORITHM. max z = 3x 1 + 4x 2. 3x 1 x x x x N 2

A Reinforcement Learning Variant for Control Scheduling

Using Deep Convolutional Neural Networks in Monte Carlo Tree Search

Generative models and adversarial training

WHEN THERE IS A mismatch between the acoustic

Regret-based Reward Elicitation for Markov Decision Processes

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

AMULTIAGENT system [1] can be defined as a group of

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

B.S/M.A in Mathematics

Active Learning. Yingyu Liang Computer Sciences 760 Fall

INPE São José dos Campos

Dialog-based Language Learning

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Massachusetts Institute of Technology Tel: Massachusetts Avenue Room 32-D558 MA 02139

Agents and environments. Intelligent Agents. Reminders. Vacuum-cleaner world. Outline. A vacuum-cleaner agent. Chapter 2 Actuators

A study of speaker adaptation for DNN-based speech synthesis

SARDNET: A Self-Organizing Feature Map for Sequences

Probabilistic Latent Semantic Analysis

Softprop: Softmax Neural Network Backpropagation Learning

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Artificial Neural Networks written examination

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

University of Victoria School of Exercise Science, Physical and Health Education EPHE 245 MOTOR LEARNING. Calendar Description Units: 1.

Evolutive Neural Net Fuzzy Filtering: Basic Description

Cal s Dinner Card Deals

A Comparison of Annealing Techniques for Academic Course Scheduling

Introduction to Simulation

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

Designing a Rubric to Assess the Modelling Phase of Student Design Projects in Upper Year Engineering Courses

Speech Emotion Recognition Using Support Vector Machine

Learning Methods for Fuzzy Systems

Improving Fairness in Memory Scheduling

Learning From the Past with Experiment Databases

An Online Handwriting Recognition System For Turkish

Australian Journal of Basic and Applied Sciences

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Why Did My Detector Do That?!

Rule Learning With Negation: Issues Regarding Effectiveness

Automatic Discretization of Actions and States in Monte-Carlo Tree Search

arxiv: v1 [cs.cv] 10 May 2017

Semi-Supervised Face Detection

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Navigating the PhD Options in CMS

A simulated annealing and hill-climbing algorithm for the traveling tournament problem

Assignment 1: Predicting Amazon Review Ratings

Learning to Schedule Straight-Line Code

A GENERIC SPLIT PROCESS MODEL FOR ASSET MANAGEMENT DECISION-MAKING

Deep Neural Network Language Models

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

On the Combined Behavior of Autonomous Resource Management Agents

Major Milestones, Team Activities, and Individual Deliverables

Statewide Framework Document for:

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

Rule Learning with Negation: Issues Regarding Effectiveness

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Speech Recognition at ICSI: Broadcast News and beyond

BMBF Project ROBUKOM: Robust Communication Networks

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

Speeding Up Reinforcement Learning with Behavior Transfer

Instructor: Matthew Wickes Kilgore Office: ES 310

Calibration of Confidence Measures in Speech Recognition

ACTL5103 Stochastic Modelling For Actuaries. Course Outline Semester 2, 2014

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Investigating Ahuja-Orlin s Large Neighbourhood Search Approach for Examination Timetabling

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Learning Human Utility from Video Demonstrations for Deductive Planning in Robotics

Learning Methods in Multilingual Speech Recognition

Task Completion Transfer Learning for Reward Inference

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

High-level Reinforcement Learning in Strategy Games

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

Transcription:

Trust Region Policy Optimization TINGWU WANG MACHINE LEARNING GROUP, UNIVERSITY OF TORONTO

Contents 1. Introduction 1. Problem Domain: Locomotion 2. Related Work 2. TRPO Step-by-step 1. The Preliminaries 2. Find the Lower-Bound in General Stochastic policies 3. Optimization of the Parameterized Policies 4. From Math to Practical Algorithm 5. Tricks and Efficiency 6. Summary 3. Misc 1. Results and Problems of TRPO

Introduction 1. Introduction 1. Problem Domain: Locomotion 2. Related Work 2. TRPO Step-by-step 1. The Preliminaries 2. Find the Lower-Bound in General Stochastic policies 3. Optimization of the Parameterized Policies 4. From Math to Practical Algorithm 5. Tricks and Efficiency 6. Summary 3. Misc 1. Results and Problems of TRPO

Problem Domain: Locomotion 1. The two action domains in reinforcement learning: 1. Discrete action space 1. Only several actions are available (up, down, left, right) 2. Q-value based methods (DQN [1], or DQN + MCTS [2])

Problem Domain: Locomotion 1. The two action domains in reinforcement learning: 1. Discrete action space 2. Continuous action space 1. One of the most interesting problems: locomotion 2. MuJuCo: A physics engine for model-based control [3] 3. TRPO [4] (today's focus) 1. One of the most important baselines in model-free continuous control problem [5] 2. It works for discrete action space too

Problem Domain: Locomotion 1. The two action domains in reinforcement learning: 1. Discrete action space 2. Continuous action space 3. Difference between Discrete & Continuous 1. Raw-pixel Input 1. Control versus perception 2. Dynamical Model 1. Game dynamics versus physical models 3. Reward Shaping 1. Zero-one reward versus continous reward at evert time step

Related Work 1. REINFORCE algorithm [6] 2. Deep Deterministic Policy Gradient [7] 3. TNPG method [8] 1. Very similar to the TRPO 2. TRPO uses a fixed KL divergence rather than a fixed penalty coefficient 3. Similar performance according to Duan [9]

TROO Step-by-step 1. Introduction 1. Problem Domain: Locomotion 2. Related Work 2. TRPO Step-by-step 1. The Preliminaries 2. Find the Lower-Bound in General Stochastic policies 3. Optimization of the Parameterized Policies 4. From Math to Practical Algorithm 5. Tricks and Efficiency 6. Summary 3. Misc 1. Results and Problems of TRPO

The Preliminaries 1. The objective function to optimize 2. Can we expresses the expected return of another policy in terms of the advantage over the original policy? Yes, orginally proven in [8] (see whiteboard 1). It shows that a guaranteed increase in the performance is possible.

The Preliminaries 3. Can we remove the dependency of discounted visitation frequencies under the new policy? 1. The local approximation 2. The lower bound from conservative policy iteration [8]

Find the Lower-Bound in General Stochastic policies 1. Can we move the be extended to general stochastic policies, rather than just mixture polices? (see whiteboard) 2. Maybe even make the equation simpler? (later we make it even easier by approximate the maximum of KL using the average of KL)

Find the Lower-Bound in General Stochastic policies 3. Now what's the objective function we are trying to maximize? Guaranteed Improvement! (minorization-maximization algorithm)

Optimization of the Parameterized Policies 1. In practice, if we used the penalty coefficient C recommended by the theory above, the step sizes would be very small. 2. One way to take larger steps in a robust way is to use a constraint on the KL divergence between the new policy and the old policy, i.e., a trust region constraint 1. Use the average KL instead of the maximum of the KL (heuristic approximation)

From Math to Practical Algorithm 1. Sample-Based Estimation of the Objective and Constraint

Tricks and Efficiency 1. Search for the next parameter 1. Compute a search direction, using a linear approximation to objective and quadratic approximation to the constraint 2. Use conjugate gradient algorithm to solve 3. Get the maximal step length and decay exponentially

Summary 1. The original objective 2. The objective of another policy in terms of the advantage over the original policy 3. Remove the dependency on the trajectories of new policy.

Summary 4. Find the lower-bound that guarantees the improvement 5. Sample-based estimation 6. Using line-search (Approximation, Fisher matrix, Conjugate gradient)

Misc 1. Introduction 1. Problem Domain: Locomotion 2. Related Work 2. TRPO Step-by-step 1. The Preliminaries 2. Find the Lower-Bound in General Stochastic policies 3. Optimization of the Parameterized Policies 4. From Math to Practical Algorithm 5. Tricks and Efficiency 6. Summary 3. Misc 1. Results and Problems of TRPO

Results and Problems of TRPO 1. Results 1. One of the most successful baselines in locomotion 2. Problems 1. Sample inefficiency 2. Unable to scale to big network

References [1] Mnih, Volodymyr, et al. "Human-level control through deep reinforcement learning." Nature 518.7540 (2015): 529. [2] Silver, David, et al. "Mastering the game of Go with deep neural networks and tree search." Nature 529.7587 (2016): 484. [3] Erez, Tom, Yuval Tassa, and Emanuel Todorov. "Simulation tools for model-based robotics: Comparison of Bullet, Havok, MuJoCo, ODE and PhysX." Robotics and Automation (ICRA), 2015 IEEE International Conference on. IEEE, 2015. [4] Schulman, John, et al. "Trust region policy optimization." Proceedings of the 32nd International Conference on Machine Learning (ICML-15). 2015. [5] Duan, Yan, et al. "Benchmarking deep reinforcement learning for continuous control." Proceedings of the 33rd International Conference on Machine Learning (ICML). 2016. [6] Williams, Ronald J. "Simple statistical gradient-following algorithms for connectionist reinforcement learning." Machine learning 8.3-4 (1992): 229-256. [7] Lillicrap, Timothy P., et al. "Continuous control with deep reinforcement learning." arxiv preprint arxiv:1509.02971 (2015). [8] Kakade, Sham. "A natural policy gradient." Advances in neural information processing systems 2 (2002): 1531-1538.

Q&A Thanks for listening ;P