Scaling Up RL Using Evolution Strategies. Tim Salimans, Jonathan Ho, Peter Chen, Szymon Sidor, Ilya Sutskever

Similar documents
Exploration. CS : Deep Reinforcement Learning Sergey Levine

Lecture 1: Machine Learning Basics

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

Artificial Neural Networks written examination

Lecture 10: Reinforcement Learning

Axiom 2013 Team Description Paper

FF+FPG: Guiding a Policy-Gradient Planner

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Python Machine Learning

Generative models and adversarial training

(Sub)Gradient Descent

Probabilistic Latent Semantic Analysis

Georgetown University at TREC 2017 Dynamic Domain Track

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

The Evolution of Random Phenomena

Laboratorio di Intelligenza Artificiale e Robotica

TD(λ) and Q-Learning Based Ludo Players

*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN

IAT 888: Metacreation Machines endowed with creative behavior. Philippe Pasquier Office 565 (floor 14)

An empirical study of learning speed in backpropagation

An Introduction to Simio for Beginners

Improving Fairness in Memory Scheduling

Machine Learning and Development Policy

AMULTIAGENT system [1] can be defined as a group of

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Assignment 1: Predicting Amazon Review Ratings

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Speech Recognition at ICSI: Broadcast News and beyond

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

IT Students Workshop within Strategic Partnership of Leibniz University and Peter the Great St. Petersburg Polytechnic University

FUZZY EXPERT. Dr. Kasim M. Al-Aubidy. Philadelphia University. Computer Eng. Dept February 2002 University of Damascus-Syria

MGT/MGP/MGB 261: Investment Analysis

WHEN THERE IS A mismatch between the acoustic

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

On the Combined Behavior of Autonomous Resource Management Agents

Australian Journal of Basic and Applied Sciences

Education: Integrating Parallel and Distributed Computing in Computer Science Curricula

Reinforcement Learning by Comparing Immediate Reward

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Laboratorio di Intelligenza Artificiale e Robotica

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

MYCIN. The MYCIN Task

Task Completion Transfer Learning for Reward Inference

Introduction to Simulation

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

Task Completion Transfer Learning for Reward Inference

Designing a Rubric to Assess the Modelling Phase of Student Design Projects in Upper Year Engineering Courses

Empowering Public Education Through Online Learning

Residual Stacking of RNNs for Neural Machine Translation

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Lecture 6: Applications

An Introduction to Simulation Optimization

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

On Human Computer Interaction, HCI. Dr. Saif al Zahir Electrical and Computer Engineering Department UBC

A GENERIC SPLIT PROCESS MODEL FOR ASSET MANAGEMENT DECISION-MAKING

Mathematics (JUN14MS0401) General Certificate of Education Advanced Level Examination June Unit Statistics TOTAL.

arxiv: v1 [cs.dc] 19 May 2017

Motivation to e-learn within organizational settings: What is it and how could it be measured?

Speeding Up Reinforcement Learning with Behavior Transfer

Finding, Hiring, and Directing e-learning Voices Harlan Hogan, E-learningvoices.com

A Reinforcement Learning Variant for Control Scheduling

Semantic Segmentation with Histological Image Data: Cancer Cell vs. Stroma

AI Agent for Ice Hockey Atari 2600

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Cooperative evolutive concept learning: an empirical study

The Implementation of a Consecutive Giving Recognition Program at the University of Florida

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

PM tutor. Estimate Activity Durations Part 2. Presented by Dipo Tepede, PMP, SSBB, MBA. Empowering Excellence. Powered by POeT Solvers Limited

Regret-based Reward Elicitation for Markov Decision Processes

BENCHMARK TREND COMPARISON REPORT:

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

MASTER OF SCIENCE (M.S.) MAJOR IN COMPUTER SCIENCE

arxiv: v1 [cs.lg] 15 Jun 2015

Evolutive Neural Net Fuzzy Filtering: Basic Description

Short vs. Extended Answer Questions in Computer Science Exams

CSL465/603 - Machine Learning

Proposal of Pattern Recognition as a necessary and sufficient principle to Cognitive Science

Data Integration through Clustering and Finding Statistical Relations - Validation of Approach

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Knowledge Transfer in Deep Convolutional Neural Nets

OFFICE OF ENROLLMENT MANAGEMENT. Annual Report

Seminar - Organic Computing

High-level Reinforcement Learning in Strategy Games

Top US Tech Talent for the Top China Tech Company

Learning Methods for Fuzzy Systems

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Medical Complexity: A Pragmatic Theory

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Beyond Classroom Solutions: New Design Perspectives for Online Learning Excellence

Dual-Memory Deep Learning Architectures for Lifelong Learning of Everyday Human Behaviors

Learning to Schedule Straight-Line Code

Transcription:

Scaling Up RL Using Evolution Strategies Tim Salimans, Jonathan Ho, Peter Chen, Szymon Sidor, Ilya Sutskever

Reinforcement Learning = AI? Definition of RL broad enough to capture all that is needed for AGI action Increased interest world and improved algorithms Large investments are made observation

Still a long way to go

What s keeping us? Credit assignment Compute Many other things we will not discuss right now

Credit assignment is difficult for general MDPs

Credit assignment is difficult for general MDPs At state st take action at. Next get state st+1 Receive return R after taking T actions No precisely timed rewards, no discounting, no value functions Currently this seems true for our hardest problems, like meta learning Duan et al (2016) "RL^2: Fast Reinforcement Learning via Slow Reinforcement Learning. Wang et al. (2016) "Learning to reinforcement learn."

Vanilla policy gradients Stochastic policy P(a s,θ) Estimate gradient of expected return F = E[R] using REINFORCE

Vanilla policy gradients Correlation between return and individual actions is typically low Gradient of logprob is sum of T uncorrelated terms This means the variance grows linearly with T!

We can do only very little sequential computation

CPU clock speed has stopped improving long ago source: https://smoothspan.com/2007/09/06/a-picture-of-the-multicore-crisis/

But increased parallelism keeps us going Supercomputer GFLOPS over time. Source: WikiPedia

Communication is the eventual bottleneck Clock speed = constant Number of cores communication bandwidth between cores becomes bottleneck

Thought experiment: What s the optimal algorithm to calculate a policy gradient if Sequence length T We cannot do credit assignment Communication is the only computational bottleneck

Thought experiment: What s the optimal algorithm to calculate a policy gradient if Sequence length T We cannot do credit assignment Communication is the only computational bottleneck Finite differences!

Finite differences and other black box optimizers Each function evaluation only requires communicating a scalar result Variance independent of sequence length No credit assignment required

Evolution Strategies Old technique, known under many other names Randomized finite differences: Add noise vector ε to the parameters If the result improves, keep the change Repeat

Parallelization You have a bunch of workers They all try on different random noise Then they report how good the random noise was But they don t need to communicate the noise vector Because they know each other s seeds!

Parallelization

Distributed Deep Learning 1 2 6 3 5 4

Distributed Deep Learning Each worker sends big vectors 1 2 6 ALL REDUCE 3 5 4

Distributed Evolution Strategies Each worker broadcasts tiny scalars 1 2 6 3 5 4

Distributed Evolution Strategies Each worker broadcasts tiny scalars 1 2 6 3 5 4

Distributed Evolution Strategies Each worker broadcasts tiny scalars 1 2 6 3 5 4

Does it work in practice? Surprisingly competitive with popular RL techniques in terms of data efficiency need 3-10x more data than TRPO / A3C on MuJoCo and Atari No backward pass, no need to store activations in memory Near perfect scaling

MuJoCo results ES needs more data, but it achieves nearly the same result If we use 1440 cores, we need 10 minutes to solve the humanoid task, which takes 1 day with TRPO on a single machine

Distributed Evolution Strategies Quantitative results on the Humanoid MuJoCo task:

Distributed Evolution Strategies Networking requirements very limited Cheap! $12 to rent 1440 cores for an hour on Amazon EC2 with spot pricing Can run the experiment 6 times for $12!

MuJoCo Results Humanoid walker

Atari Results We can match one-day A3C on Atari games on average (better on 50%, worse on 50% of games) in 1 hour of our distributed implementation with 720 cores

Long Horizons Long horizons are hard for RL RL is sensitive to action frequency Higher frequency of actions makes the RL problem more difficult Not so for Evolution Strategies

Long Horizons

How can it work in high dimensions? Fact: the speed of Evolution Strategies depends on the intrinsic dimensionality of the problem, not on the actual dimensionality of the neural net policy

Intrinsic Dimensionality Loss Evolution strategies doesn t care about: relevant parameters irrelevant parameters Evolution strategies automatically discards the irrelevant dimensions even when they live on a complicated subspace!

Intrinsic Dimensionality One explanation for how hill-climbing can succeed in a million-dimensional space! Parameterization of policy matters more than number of parameters Virtual batch normalization helps a lot Salimans et al. (2016) "Improved techniques for training GANs." Future advances to be made?

Backprop vs Evolution Strategies Evolution strategies does not use backprop So scale of initialization, vanishing gradients, etc, are not important?

Backprop vs Evolution Strategies Counterintuitive result: every trick that helps backprop, also helps evolution strategies scale of random init, batch norm, ResNet Why? Because evolution strategies tries to estimate the gradient! If the gradient is vanishing, we won t get much by estimating it!

Conclusion: pros Though experiment: black box methods optimal if long horizon, no credit assignment, bandwidth limited Scales extremely well Competitive with other RL techniques Possibility proof for evolution of intelligence: us

Conclusion: cons Natural evolution seems much more sophisticated Better parameterization? Evolution of evolvability? Assumption that we cannot solve credit assignment / communication may be pessimistic We should not give up on improvements in credit assignment, value functions, hierarchical RL, networking, and communication strategies!