Asynchronous & Parallel Algorithms. Sergey Levine UC Berkeley

Similar documents
Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Lecture 1: Machine Learning Basics

Georgetown University at TREC 2017 Dynamic Domain Track

Python Machine Learning

Transferring End-to-End Visuomotor Control from Simulation to Real World for a Multi-Stage Task

Artificial Neural Networks written examination

Axiom 2013 Team Description Paper

arxiv: v1 [cs.dc] 19 May 2017

Reinforcement Learning by Comparing Immediate Reward

AI Agent for Ice Hockey Atari 2600

Using Deep Convolutional Neural Networks in Monte Carlo Tree Search

Lecture 10: Reinforcement Learning

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

TD(λ) and Q-Learning Based Ludo Players

LEARNING TO PLAY IN A DAY: FASTER DEEP REIN-

Agents and environments. Intelligent Agents. Reminders. Vacuum-cleaner world. Outline. A vacuum-cleaner agent. Chapter 2 Actuators

arxiv: v2 [cs.ro] 3 Mar 2017

An Introduction to Simio for Beginners

(Sub)Gradient Descent

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

FF+FPG: Guiding a Policy-Gradient Planner

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

Redirected Inbound Call Sampling An Example of Fit for Purpose Non-probability Sample Design

Assignment 1: Predicting Amazon Review Ratings

Chapter 2. Intelligent Agents. Outline. Agents and environments. Rationality. PEAS (Performance measure, Environment, Actuators, Sensors)

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

arxiv: v1 [cs.lg] 8 Mar 2017

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

On the Combined Behavior of Autonomous Resource Management Agents

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Speeding Up Reinforcement Learning with Behavior Transfer

LEGO MINDSTORMS Education EV3 Coding Activities

A Reinforcement Learning Variant for Control Scheduling

arxiv: v1 [cs.lg] 15 Jun 2015

Introduction to Simulation

Seminar - Organic Computing

Intelligent Agents. Chapter 2. Chapter 2 1

Improving Fairness in Memory Scheduling

Learning to Schedule Straight-Line Code

Designing a Rubric to Assess the Modelling Phase of Student Design Projects in Upper Year Engineering Courses

AMULTIAGENT system [1] can be defined as a group of

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Software Maintenance

Improving Action Selection in MDP s via Knowledge Transfer

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Learning Prospective Robot Behavior

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

Laboratorio di Intelligenza Artificiale e Robotica

WHAT DOES IT REALLY MEAN TO PAY ATTENTION?

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Bayllocator: A proactive system to predict server utilization and dynamically allocate memory resources using Bayesian networks and ballooning

Voices on the Web: Online Learners and Their Experiences

CSL465/603 - Machine Learning

Shockwheat. Statistics 1, Activity 1

What Different Kinds of Stratification Can Reveal about the Generalizability of Data-Mined Skill Assessment Models

A Stochastic Model for the Vocabulary Explosion

High-level Reinforcement Learning in Strategy Games

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

An Introduction to Simulation Optimization

Rover Races Grades: 3-5 Prep Time: ~45 Minutes Lesson Time: ~105 minutes

Major Milestones, Team Activities, and Individual Deliverables

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Combining Proactive and Reactive Predictions for Data Streams

Kelli Allen. Vicki Nieter. Jeanna Scheve. Foreword by Gregory J. Kaiser

An investigation of imitation learning algorithms for structured prediction

THE DEPARTMENT OF DEFENSE HIGH LEVEL ARCHITECTURE. Richard M. Fujimoto

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN

Circuit Simulators: A Revolutionary E-Learning Platform

A Process-Model Account of Task Interruption and Resumption: When Does Encoding of the Problem State Occur?

Evaluation of Hybrid Online Instruction in Sport Management

While you are waiting... socrative.com, room number SIMLANG2016

Top US Tech Talent for the Top China Tech Company

DYNAMIC ADAPTIVE HYPERMEDIA SYSTEMS FOR E-LEARNING

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

Learning Cases to Resolve Conflicts and Improve Group Behavior

Diagnostic Test. Middle School Mathematics

Evolutive Neural Net Fuzzy Filtering: Basic Description

Star Math Pretest Instructions

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Community Rhythms. Purpose/Overview NOTES. To understand the stages of community life and the strategic implications for moving communities

The Agile Mindset. Linda Rising.

Evolution of Symbolisation in Chimpanzees and Neural Nets

Knowledge Transfer in Deep Convolutional Neural Nets

Spring 2015 IET4451 Systems Simulation Course Syllabus for Traditional, Hybrid, and Online Classes

Cognitive Thinking Style Sample Report

Lesson plan for Maze Game 1: Using vector representations to move through a maze Time for activity: homework for 20 minutes

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

Model Ensemble for Click Prediction in Bing Search Ads

Hi I m Ryan O Donnell, I m with Florida Tech s Orlando Campus, and today I am going to review a book titled Standard Celeration Charting 2002 by

Executive Guide to Simulation for Health

Top Ten Persuasive Strategies Used on the Web - Cathy SooHoo, 5/17/01

Making Confident Decisions

IAT 888: Metacreation Machines endowed with creative behavior. Philippe Pasquier Office 565 (floor 14)

arxiv: v1 [cs.cv] 10 May 2017

Transcription:

Asynchronous & Parallel Algorithms Sergey Levine UC Berkeley

Overview 1. We learned about a number of policy search methods 2. These algorithms have all been sequential 3. Is there a natural way to parallelize RL algorithms? Experience sampling vs learning Multiple learning threads Multiple experience collection threads

Today s Lecture 1. High-level schematic of a generic RL algorithm 2. What can we parallelize? 3. Case studies: specific parallel RL methods 4. Tradeoffs & considerations Goals Understand the high-level anatomy of reinforcement learning algorithms Understand standard strategies for parallelization Tradeoffs of different parallel methods REMINDER: PROJECT GROUPS DUE TODAY! SEND TITLE & GROUP MEMBERS TO berkeleydeeprlcourse@gmail.com

High-level RL schematic fit a model/ estimate the return generate samples (i.e. run the policy) improve the policy

Which parts are slow? real robot/car/power grid/whatever: 1x real time, until we invent time travel MuJoCo simulator: up to 10000x real time generate samples (i.e. run the policy) fit a model/ estimate the return trivial, fast expensive, but nontrivial to parallelize improve the policy trivial, nothing to do expensive, but nontrivial to parallelize

Which parts can we parallelize? fit a model/ estimate the return parallel SGD generate samples (i.e. run the policy) improve the policy parallel SGD Helps to group data generation and training (worker generates data, computes gradients, and gradients are pooled)

High-level decisions 1. Online or batch-mode? 2. Synchronous or asynchronous? generate samples generate samples generate samples policy gradient generate one step generate one step generate one step fit Q-value fit Q-value fit Q-value

Relationship to parallelized SGD fit a model/ estimate the return improve the policy Dai et al. 15 1. Parallelizing model/critic/actor training typically involves parallelizing SGD 2. Simple parallel SGD: 1. Each worker has a different slice of data 2. Each worker computes gradients, sums them, sends to parameter server 3. Parameter server sums gradients from all workers and sends back new parameters 3. Mathematically equivalent to SGD, but not asynchronous (communication delays) 4. Async SGD typically does not achieve perfect parallelism, but lack of locks can make it much faster 5. Somewhat problem dependent

Simple example: sample parallelism with PG (1) (2, 3, 4) generate samples generate samples policy gradient generate samples

Simple example: sample parallelism with PG (1) generate samples generate samples generate samples (2) evaluate reward evaluate reward evaluate reward (3, 4) policy gradient

Simple example: sample parallelism with PG Dai et al. 15 (1) (2) (3) (4) generate samples evaluate reward compute gradient generate samples evaluate reward compute gradient sum & apply gradient generate samples evaluate reward compute gradient

What if we add a critic? see John s actor-critic lecture for what the options here are (1, 2) (3) (3) samples & rewards samples & rewards critic gradients critic gradients (4) (5) policy gradients policy gradients sum & apply critic gradient sum & apply policy gradient costly synchronization

What if we add a critic? see John s actor-critic lecture for what the options here are (1, 2) (3) (3) samples & rewards samples & rewards critic gradients critic gradients sum & apply critic gradient (4) (5) policy gradients policy gradients sum & apply policy gradient

What if we run online? only the parameter update requires synchronization (actor + critic params) (1, 2) (3) (3) samples & rewards samples & rewards critic gradients critic gradients sum & apply critic gradient (4) (5) policy gradients policy gradients sum & apply policy gradient

Actor-critic algorithm: A3C Mnih et al. 16 Some differences vs DQN, DDPG, etc: No replay buffer, instead rely on diversity of samples from different workers to decorrelate Some variability in exploration between workers Pro: generally much faster in terms of wall clock Con: generally must slower in terms of # of samples (more on this later )

Actor-critic algorithm: A3C DDPG: more on this later 1,000,000 steps 20,000,000 steps

Model-based algorithms: parallel GPS [parallelize sampling] [parallelize dynamics] [parallelize LQR] [parallelize SGD] (1) Rollout execution (1) (2, 3) Local policy optimization (2, 3) (4) Global policy optimization (4) Yahya, Li, Kalakrishnan, Chebotar, L., 16

Model-based algorithms: parallel GPS

Real-world model-free deep RL: parallel NAF Gu*, Holly*, Lillicrap, L., 16

Simplest example: sample parallelism with off-policy algorithms sample sample sample grasp success predictor training

Break

Challenges in Deep Reinforcement Learning Sergey Levine UC Berkeley

Today s Lecture 1. High-level summary of deep RL challenges 2. Stability 3. Sample complexity 4. Scaling up & generalization 5. Reward specification Goals Understand the open problems in deep RL Understand tradeoffs between different algorithms

Some recent work on deep RL stability efficiency scale RL on raw visual input Lange et al. 2009 End-to-end visuomotor policies Levine*, Finn* et al. 2015 Guided policy search Levine et al. 2013 Deep deterministic policy gradients Lillicrap et al. 2015 Deep Q-Networks Mnih et al. 2013 AlphaGo Silver et al. 2016 Trust region policy optimization Schulman et al. 2015 Supersizing self-supervision Pinto & Gupta 2016

Stability and hyperparameter tuning Devising stable RL algorithms is very hard Q-learning/value function estimation Fitted Q/fitted value methods with deep network function estimators are typically not contractions, hence no guarantee of convergence Lots of parameters for stability: target network delay, replay buffer size, clipping, sensitivity to learning rates, etc. Policy gradient/likelihood ratio/reinforce Very high variance gradient estimator Lots of samples, complex baselines, etc. Parameters: batch size, learning rate, design of baseline Model-based RL algorithms Model class and fitting method Optimizing policy w.r.t. model non-trivial due to backpropagation through time

Tuning hyperparameters Get used to running multiple hyperparameters learning_rate = [0.1, 0.5, 1.0, 5.0, 20.0] Grid layout for hyperparameter sweeps OK when sweeping 1 or 2 parameters Random layout generally more optimal, the only viable option in higher dimensions Don t forget the random seed! RL is self-reinforcing, very likely to get local optima Don t assume it works well until you test a few random seeds Remember that random seed is not a hyperparameter!

The challenge with hyperparameters Can t run hyperparameter sweeps in the real world How representative is your simulator? Usually the answer is not very Actual sample complexity = time to run algorithm x number of runs to sweep In effect stochastic search + gradient-based optimization Can we develop more stable algorithms that are less sensitive to hyperparameters?

What can we do? Algorithms with favorable improvement and convergence properties Trust region policy optimization [Schulman et al. 16] Safe reinforcement learning, High-confidence policy improvement [Thomas 15] Algorithms that adaptively adjust parameters Q-Prop [Gu et al. 17]: adaptively adjust strength of control variate/baseline More research needed here! Not great for beating benchmarks, but absolutely essential to make RL a viable tool for real-world problems

Sample Complexity

gradient-free methods (e.g. NES, CMA, etc.) 10x fully online methods (e.g. A3C) 10x policy gradient methods (e.g. TRPO) 10x replay buffer value estimation methods (Q-learning, DDPG, NAF, etc.) 10x model-based deep RL (e.g. guided policy search) 10x model-based shallow RL (e.g. PILCO) half-cheetah (slightly different version) TRPO+GAE (Schulman et al. 16) half-cheetah Gu et al. 16 Wang et al. 17 10,000,000 steps (10,000 episodes) (~ 1.5 days real time) 1,000,000 steps (1,000 episodes) (~ 3 hours real time) 10x gap Chebotar et al. 17 (note log scale) 100,000,000 steps (100,000 episodes) (~ 15 days real time) about 20 minutes of experience on a real robot

What about more realistic tasks? Big cost paid for dimensionality Big cost paid for using raw images Big cost in the presence of real-world diversity (many tasks, many situations, etc.)

The challenge with sample complexity Need to wait for a long time for your homework to finish running Real-world learning becomes difficult or impractical Precludes the use of expensive, high-fidelity simulators Limits applicability to real-world problems

What can we do? Better model-based RL algorithms Design faster algorithms Q-Prop (Gu et al. 17): policy gradient algorithm that is as fast as value estimation Learning to play in a day (He et al. 17): Q-learning algorithm that is much faster on Atari than DQN Reuse prior knowledge to accelerate reinforcement learning RL2: Fast reinforcement learning via slow reinforcement learning (Duan et al. 17) Learning to reinforcement learning (Wang et al. 17) Model-agnostic meta-learning (Finn et al. 17)

Scaling up deep RL & generalization Large-scale Emphasizes diversity Evaluated on generalization Small-scale Emphasizes mastery Evaluated on performance Where is the generalization?

Generalizing from massive experience Pinto & Gupta, 2015 Levine et al. 2016

Generalizing from multi-task learning Train on multiple tasks, then try to generalize or finetune Policy distillation (Rusu et al. 15) Actor-mimic (Parisotto et al. 15) Model-agnostic meta-learning (Finn et al. 17) many others Unsupervised or weakly supervised learning of diverse behaviors Stochastic neural networks (Florensa et al. 17) Reinforcement learning with deep energy-based policies (Haarnoja et al. 17) many others

Generalizing from prior knowledge & experience Can we get better generalization by leveraging off-policy data? Model-based methods: perhaps a good avenue, since the model (e.g. physics) is more task-agnostic What does it mean to have a feature of decision making, in the same sense that we have features in computer vision? Options framework (mini behaviors) Between MDPs and semi-mdps: A framework for temporal abstraction in reinforcement learning (Sutton et al. 99) The option-critic architecture (Bacon et al. 16) Muscle synergies & low-dimensional spaces Unsupervised learning of sensorimotor primitives (Todorov & Gahramani 03)

Reward specification If you want to learn from many different tasks, you need to get those tasks somewhere! Learn objectives/rewards from demonstration (inverse reinforcement learning) Generative objectives automatically?