Learning Policies by Imitating Optimal Control. CS : Deep Reinforcement Learning Week 3, Lecture 2 Sergey Levine

Similar documents
Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

Lecture 1: Machine Learning Basics

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Generative models and adversarial training

Artificial Neural Networks written examination

An investigation of imitation learning algorithms for structured prediction

Python Machine Learning

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

Axiom 2013 Team Description Paper

FF+FPG: Guiding a Policy-Gradient Planner

Lecture 10: Reinforcement Learning

arxiv: v1 [cs.lg] 15 Jun 2015

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

(Sub)Gradient Descent

Reinforcement Learning by Comparing Immediate Reward

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

arxiv: v2 [cs.ro] 3 Mar 2017

Regret-based Reward Elicitation for Markov Decision Processes

Detailed course syllabus

Speeding Up Reinforcement Learning with Behavior Transfer

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

A Reinforcement Learning Variant for Control Scheduling

Transferring End-to-End Visuomotor Control from Simulation to Real World for a Multi-Stage Task

Learning to Schedule Straight-Line Code

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

B.S/M.A in Mathematics

Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures

CS Machine Learning

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Texas Wisconsin California Control Consortium Group Highlights

AMULTIAGENT system [1] can be defined as a group of

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

An empirical study of learning speed in backpropagation

9.85 Cognition in Infancy and Early Childhood. Lecture 7: Number

TD(λ) and Q-Learning Based Ludo Players

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Georgetown University at TREC 2017 Dynamic Domain Track

Discriminative Learning of Beam-Search Heuristics for Planning

CSL465/603 - Machine Learning

Dialog-based Language Learning

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

arxiv: v1 [cs.lg] 7 Apr 2015

DOCTOR OF PHILOSOPHY HANDBOOK

A Comparison of Annealing Techniques for Academic Course Scheduling

Designing a Rubric to Assess the Modelling Phase of Student Design Projects in Upper Year Engineering Courses

HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION

Learning Methods for Fuzzy Systems

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Predicting Future User Actions by Observing Unmodified Applications

Softprop: Softmax Neural Network Backpropagation Learning

Calibration of Confidence Measures in Speech Recognition

Texas Wisconsin California Control Consortium Group Highlights

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Automatic Discretization of Actions and States in Monte-Carlo Tree Search

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

Degeneracy results in canalisation of language structure: A computational model of word learning

Natural Language Processing. George Konidaris

WHEN THERE IS A mismatch between the acoustic

Major Milestones, Team Activities, and Individual Deliverables

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Laboratorio di Intelligenza Artificiale e Robotica

TABLE OF CONTENTS TABLE OF CONTENTS COVER PAGE HALAMAN PENGESAHAN PERNYATAAN NASKAH SOAL TUGAS AKHIR ACKNOWLEDGEMENT FOREWORD

Evolutive Neural Net Fuzzy Filtering: Basic Description

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Navigating the PhD Options in CMS

Version Space. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Version Space Term 2012/ / 18

Given a directed graph G =(N A), where N is a set of m nodes and A. destination node, implying a direction for ow to follow. Arcs have limitations

Ricochet Robots - A Case Study for Human Complex Problem Solving

Kamaldeep Kaur University School of Information Technology GGS Indraprastha University Delhi

Deep Facial Action Unit Recognition from Partially Labeled Data

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

LEARNING TO PLAY IN A DAY: FASTER DEEP REIN-

AP Calculus AB. Nevada Academic Standards that are assessable at the local level only.

The use of mathematical programming with artificial intelligence and expert systems

A Case Study: News Classification Based on Term Frequency

Adaptive Learning in Time-Variant Processes With Application to Wind Power Systems

Mathematics process categories

University of Victoria School of Exercise Science, Physical and Health Education EPHE 245 MOTOR LEARNING. Calendar Description Units: 1.

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

Chapter 2. Intelligent Agents. Outline. Agents and environments. Rationality. PEAS (Performance measure, Environment, Actuators, Sensors)

The Strong Minimalist Thesis and Bounded Optimality

Data Fusion Through Statistical Matching

Second Exam: Natural Language Parsing with Neural Networks

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Getting Started with Deliberate Practice

Dual-Memory Deep Learning Architectures for Lifelong Learning of Everyday Human Behaviors

Knowledge Transfer in Deep Convolutional Neural Nets

How to analyze visual narratives: A tutorial in Visual Narrative Grammar

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING

Laboratorio di Intelligenza Artificiale e Robotica

A theoretic and practical framework for scheduling in a stochastic environment

Task Completion Transfer Learning for Reward Inference

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

Rover Races Grades: 3-5 Prep Time: ~45 Minutes Lesson Time: ~105 minutes

Residual Stacking of RNNs for Neural Machine Translation

Transcription:

Learning Policies by Imitating Optimal Control CS 294-112: Deep Reinforcement Learning Week 3, Lecture 2 Sergey Levine

Overview 1. Last time: learning models of system dynamics and using optimal control to choose actions Global models and model-based RL Local models and model-based RL with constraints 2. What if we want a policy? Much quicker to evaluate actions at runtime Potentially better generalization 3. Can we just backpropagate into the policy? 4. How does this relate to imitation learning?

Today s Lecture 1. Backpropagating into a policy with learned models 2. How this becomes equivalent to imitating optimal control 3. The guided policy search algorithm 4. Imitating optimal control with DAgger 5. Limitations & considerations Goals Understand how to train policies using optimal control Understand tradeoffs between various methods

So how can we train policies? So far we saw how we can Train global models (e.g. GPs) Train local models (e.g. linear models) Combine global and local models (e.g. using Bayesian linear regression) But what if we want a policy? Don t need to replan (faster) Potentially better generalization (e.g. gaze heuristic)

Backpropagate directly into the policy? backprop backprop backprop easy for deterministic policies, but also possible for stochastic policy (more on this later)

What s the problem with backprop into policy? backprop backprop backprop big gradients here small gradients here

What s the problem? backprop backprop backprop

What s the problem? backprop backprop backprop Similar parameter sensitivity problems as shooting methods But no longer have convenient second order LQR-like method, because policy parameters couple all the time steps, so no dynamic programming Similar problems to training long RNNs with BPTT Vanishing and exploding gradients Unlike LSTM, we can t just choose a simple dynamics, dynamics are chosen by nature

What s the problem? What about collocation methods?

What s the problem? What about collocation methods?

Even simpler generic trajectory optimization, solve however you want How can we impose constraints on trajectory optimization?

Review: dual gradient descent

A small tweak to DGD: augmented Lagrangian Still converges to correct solution When far from solution, quadratic term tends to improve stability Closely related to alternating direction method of multipliers (ADMM)

Constraining trajectory optimization with dual gradient descent

Constraining trajectory optimization with dual gradient descent

Guided policy search discussion Can be interpreted as constrained trajectory optimization method Can be interpreted as imitation of an optimal control expert, since step 2 is just supervised learning The optimal control teacher adapts to the learner, and avoids actions that the learner can t mimic

General guided policy search scheme

Stochastic (Gaussian) GPS

Stochastic (Gaussian) GPS with local models

Robotics Example trajectory-centric RL supervised learning

Input Remapping Trick training time test time

CNN Vision-Based Policy

Case study: vision-based control with GPS

Case study: vision-based control with GPS

Imitating optimal control with DAgger

A problem with DAgger

Imitating MPC: PLATO algorithm Kahn, Zhang, Levine, Abbeel 16

Imitating MPC: PLATO algorithm path replanned!

Imitating MPC: PLATO algorithm

Imitating MPC: PLATO algorithm

Imitating MPC: PLATO algorithm

Imitating MPC: PLATO algorithm

Imitating MPC: PLATO algorithm

Imitating MPC: PLATO algorithm

Imitating MPC: PLATO algorithm

Imitating MPC: PLATO algorithm avoids high cost! input substitution trick need state at training time but not at test time!

Imitating MPC: PLATO algorithm

DAgger vs GPS DAgger does not require an adaptive expert Any expert will do, so long as states from learned policy can be labeled Assumes it is possible to match expert s behavior up to bounded loss Not always possible (e.g. partially observed domains) GPS adapts the expert behavior Does not require bounded loss on initial expert (expert will change)

Why imitate optimal control? Relatively stable and easy to use Supervised learning works very well Optimal control (usually) works very well The combination of the two (usually) works very well Input remapping trick: can exploit availability of additional information at training time to learn policy from raw observations Overcomes optimization challenges of backpropagating into policy directly Usually sample-efficient and viable for real physical systems

Limitations of model-based RL Need some kind of model Not always available Sometimes harder to learn than the policy Learning the model takes time & data Sometimes expressive model classes (neural nets) are not fast Sometimes fast model classes (linear models) are not expressive Some kind of additional assumptions Linearizability/continuity Ability to reset the system (for local linear models) Smoothness (for GP-style global models) Etc.

Model-free RL: trial and error learning What if we didn t need a model? Intuition: trial and error learning Much slower Often more general Coming up next!