Meta-Learning. CS : Deep Reinforcement Learning Sergey Levine

Similar documents
Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

Exploration. CS : Deep Reinforcement Learning Sergey Levine

(Sub)Gradient Descent

Python Machine Learning

Using Deep Convolutional Neural Networks in Monte Carlo Tree Search

Lecture 1: Machine Learning Basics

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Reinforcement Learning by Comparing Immediate Reward

Introduction to Simulation

Designing a Rubric to Assess the Modelling Phase of Student Design Projects in Upper Year Engineering Courses

Artificial Neural Networks written examination

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

AI Agent for Ice Hockey Atari 2600

ECE-492 SENIOR ADVANCED DESIGN PROJECT

Georgetown University at TREC 2017 Dynamic Domain Track

A Reinforcement Learning Variant for Control Scheduling

An Introduction to Simio for Beginners

Lecture 10: Reinforcement Learning

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Axiom 2013 Team Description Paper

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

Knowledge Transfer in Deep Convolutional Neural Nets

arxiv: v1 [cs.dc] 19 May 2017

Assignment 1: Predicting Amazon Review Ratings

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Transferring End-to-End Visuomotor Control from Simulation to Real World for a Multi-Stage Task

Top US Tech Talent for the Top China Tech Company

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

Measurement. Time. Teaching for mastery in primary maths

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Generative models and adversarial training

A Study of the Effectiveness of Using PER-Based Reforms in a Summer Setting

Seminar - Organic Computing

arxiv: v1 [cs.lg] 7 Apr 2015

DIGITAL GAMING & INTERACTIVE MEDIA BACHELOR S DEGREE. Junior Year. Summer (Bridge Quarter) Fall Winter Spring GAME Credits.

Evaluation of Learning Management System software. Part II of LMS Evaluation

arxiv: v1 [cs.cv] 10 May 2017

Lesson plan for Maze Game 1: Using vector representations to move through a maze Time for activity: homework for 20 minutes

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

CS Machine Learning

A Pipelined Approach for Iterative Software Process Model

TD(λ) and Q-Learning Based Ludo Players

Executive Guide to Simulation for Health

Improving Fairness in Memory Scheduling

Essentials of Ability Testing. Joni Lakin Assistant Professor Educational Foundations, Leadership, and Technology

Software Maintenance

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

FUZZY EXPERT. Dr. Kasim M. Al-Aubidy. Philadelphia University. Computer Eng. Dept February 2002 University of Damascus-Syria

M55205-Mastering Microsoft Project 2016

Citrine Informatics. The Latest from Citrine. Citrine Informatics. The data analytics platform for the physical world

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Unit: Human Impact Differentiated (Tiered) Task How Does Human Activity Impact Soil Erosion?

Diagnostic Test. Middle School Mathematics

Institutionen för datavetenskap. Hardware test equipment utilization measurement

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

LEGO MINDSTORMS Education EV3 Coding Activities

ZACHARY J. OSTER CURRICULUM VITAE

Forget catastrophic forgetting: AI that learns after deployment

Rule Learning With Negation: Issues Regarding Effectiveness

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Designing a Computer to Play Nim: A Mini-Capstone Project in Digital Design I

Bayllocator: A proactive system to predict server utilization and dynamically allocate memory resources using Bayesian networks and ballooning

Guidelines for Project I Delivery and Assessment Department of Industrial and Mechanical Engineering Lebanese American University

On the Combined Behavior of Autonomous Resource Management Agents

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Softprop: Softmax Neural Network Backpropagation Learning

An OO Framework for building Intelligence and Learning properties in Software Agents

Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures

Learning Methods for Fuzzy Systems

Major Milestones, Team Activities, and Individual Deliverables

Voices on the Web: Online Learners and Their Experiences

WORK OF LEADERS GROUP REPORT

Improving Conceptual Understanding of Physics with Technology

Laboratorio di Intelligenza Artificiale e Robotica

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

Every curriculum policy starts from this policy and expands the detail in relation to the specific requirements of each policy s field.

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

The Round Earth Project. Collaborative VR for Elementary School Kids

Introduction to Modeling and Simulation. Conceptual Modeling. OSMAN BALCI Professor

Hardhatting in a Geo-World

Bootstrapping Personal Gesture Shortcuts with the Wisdom of the Crowd and Handwriting Recognition

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Education: Integrating Parallel and Distributed Computing in Computer Science Curricula

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

Model Ensemble for Click Prediction in Bing Search Ads

FF+FPG: Guiding a Policy-Gradient Planner

Evolution of Symbolisation in Chimpanzees and Neural Nets

Beyond the Blend: Optimizing the Use of your Learning Technologies. Bryan Chapman, Chapman Alliance

Finding, Hiring, and Directing e-learning Voices Harlan Hogan, E-learningvoices.com

The Consistent Positive Direction Pinnacle Certification Course

Learning to Schedule Straight-Line Code

Community Rhythms. Purpose/Overview NOTES. To understand the stages of community life and the strategic implications for moving communities

What Different Kinds of Stratification Can Reveal about the Generalizability of Data-Mined Skill Assessment Models

Coaching Others for Top Performance 16 Hour Workshop

MYCIN. The MYCIN Task

Transcription:

Meta-Learning CS 294-112: Deep Reinforcement Learning Sergey Levine

Class Notes 1. Two weeks until the project milestone! 2. Guest lectures start next week, be sure to attend! 3. Today: part 1: meta-learning 4. Today: part 2: parallelism

How can we frame transfer learning problems? No single solution! Survey of various recent research papers 1. Forward transfer: train on one task, transfer to a new task a) Just try it and hope for the best b) Finetune on the new task c) Architectures for transfer: progressive networks d) Randomize source task domain 2. Multi-task transfer: train on many tasks, transfer to a new task a) Model-based reinforcement learning b) Model distillation c) Contextual policies d) Modular policy networks 3. Multi-task meta-learning: learn to learn from many tasks a) RNN-based meta-learning b) Gradient-based meta-learning

So far Forward transfer: source domain to target domain Diversity is good! The more varied the training, the more likely transfer is to succeed Multi-task learning: even more variety No longer training on the same kind of task But more variety = more likely to succeed at transfer How do we represent transfer knowledge? Model (as in model-based RL): rules of physics are conserved across tasks Policies requires finetuning, but closer to what we want to accomplish What about learning methods?

What is meta-learning? If you ve learned 100 tasks already, can you figure out how to learn more efficiently? Now having multiple tasks is a huge advantage! Meta-learning = learning to learn In practice, very closely related to multi-task learning Many formulations Learning an optimizer Learning an RNN that ingests experience Learning a representation image credit: Ke Li

Why is meta-learning a good idea? Deep reinforcement learning, especially model-free, requires a huge number of samples If we can meta-learn a faster reinforcement learner, we can learn new tasks efficiently! What can a meta-learned learner do differently? Explore more intelligently Avoid trying actions that are know to be useless Acquire the right features more quickly

Meta-learning with supervised learning image credit: Ravi & Larochelle 17

Meta-learning with supervised learning input (e.g., image) output (e.g., label) test label training set (few shot) training set test input How to read in training set? Many options, RNNs can work More on this later

The meta-learning problem in RL recent experience state output (e.g., action) new action experience new state

Meta-learning in RL with memory water maze task second attempt third attempt first attempt with memory without memory Heess et al., Memory-based control with recurrent neural networks.

RL2 Duan et al., RL2: Fast Reinforcement Learning via Slow Reinforcement Learning

Connection to contextual policies just contextual policies, with experience as context

Back to representations is pretraining a type of meta-learning? better features = faster learning of new task!

Preparing a model for faster learning Finn et al., Model-Agnostic Meta-Learning

What did we just do?? Just another computation graph Can implement with any autodiff package (e.g., TensorFlow) But has favorable inductive bias

Model-agnostic meta-learning: accelerating PG after MAML training after 1 gradient step (forward reward) after 1 gradient step (backward reward)

Model-agnostic meta-learning: accelerating PG after MAML training after 1 gradient step (backward reward) after 1 gradient step (forward reward)

Meta-learning summary & open problems Meta-learning = learning to learn Supervised meta-learning = supervised learning with datapoints that are entire datasets RL meta-learning with RNN policies Ingest past experience with RNN Simply run forward pass at test time to learn Just contextual policies (no actual learning) Model-agnostic meta-learning Use gradient descent (e.g., policy gradient) learning rule Conceptually not that different but can accelerate standard RL algorithms (e.g., learn in one iteration of PG)

Meta-learning summary & open problems The promise of meta-learning: use past experience to simply acquire a much more efficient deep RL algorithm The reality of meta-learning: mostly works well on smaller problems but getting better all the time Main limitations RNN policies are extremely hard to train, and likely not scalable Model-agnostic meta-learning presents a tough optimization problem Designing the right task distribution is hard Generally very sensitive to task distribution (meta-overfitting)

Parallelism in RL

Overview 1. We learned about a number of policy search methods 2. These algorithms have all been sequential 3. Is there a natural way to parallelize RL algorithms? Experience sampling vs learning Multiple learning threads Multiple experience collection threads

Today s Lecture 1. What can we parallelize? 2. Case studies: specific parallel RL methods 3. Tradeoffs & considerations Goals Understand the high-level anatomy of reinforcement learning algorithms Understand standard strategies for parallelization Tradeoffs of different parallel methods

High-level RL schematic fit a model/ estimate the return generate samples (i.e. run the policy) improve the policy

Which parts are slow? real robot/car/power grid/whatever: 1x real time, until we invent time travel MuJoCo simulator: up to 10000x real time generate samples (i.e. run the policy) fit a model/ estimate the return trivial, fast expensive, but nontrivial to parallelize improve the policy trivial, nothing to do expensive, but nontrivial to parallelize

Which parts can we parallelize? fit a model/ estimate the return parallel SGD generate samples (i.e. run the policy) improve the policy parallel SGD Helps to group data generation and training (worker generates data, computes gradients, and gradients are pooled)

High-level decisions 1. Online or batch-mode? 2. Synchronous or asynchronous? generate samples generate samples generate samples policy gradient generate one step generate one step generate one step fit Q-value fit Q-value fit Q-value

Relationship to parallelized SGD fit a model/ estimate the return improve the policy Dai et al. 15 1. Parallelizing model/critic/actor training typically involves parallelizing SGD 2. Simple parallel SGD: 1. Each worker has a different slice of data 2. Each worker computes gradients, sums them, sends to parameter server 3. Parameter server sums gradients from all workers and sends back new parameters 3. Mathematically equivalent to SGD, but not asynchronous (communication delays) 4. Async SGD typically does not achieve perfect parallelism, but lack of locks can make it much faster 5. Somewhat problem dependent

Simple example: sample parallelism with PG (1) (2, 3, 4) generate samples generate samples policy gradient generate samples

Simple example: sample parallelism with PG (1) generate samples generate samples generate samples (2) evaluate reward evaluate reward evaluate reward (3, 4) policy gradient

Simple example: sample parallelism with PG Dai et al. 15 (1) (2) (3) (4) generate samples evaluate reward compute gradient generate samples evaluate reward compute gradient sum & apply gradient generate samples evaluate reward compute gradient

What if we add a critic? see John s actor-critic lecture for what the options here are (1, 2) (3) (3) samples & rewards samples & rewards critic gradients critic gradients (4) (5) policy gradients policy gradients sum & apply critic gradient sum & apply policy gradient costly synchronization

What if we add a critic? see John s actor-critic lecture for what the options here are (1, 2) (3) (3) samples & rewards samples & rewards critic gradients critic gradients sum & apply critic gradient (4) (5) policy gradients policy gradients sum & apply policy gradient

What if we run online? only the parameter update requires synchronization (actor + critic params) (1, 2) (3) (3) samples & rewards samples & rewards critic gradients critic gradients sum & apply critic gradient (4) (5) policy gradients policy gradients sum & apply policy gradient

Actor-critic algorithm: A3C Mnih et al. 16 Some differences vs DQN, DDPG, etc: No replay buffer, instead rely on diversity of samples from different workers to decorrelate Some variability in exploration between workers Pro: generally much faster in terms of wall clock Con: generally must slower in terms of # of samples (more on this later )

Actor-critic algorithm: A3C DDPG: more on this later 1,000,000 steps 20,000,000 steps

Model-based algorithms: parallel GPS [parallelize sampling] [parallelize dynamics] [parallelize LQR] [parallelize SGD] (1) Rollout execution (1) (2, 3) Local policy optimization (2, 3) (4) Global policy optimization (4) Yahya, Li, Kalakrishnan, Chebotar, L., 16

Model-based algorithms: parallel GPS

Real-world model-free deep RL: parallel NAF Gu*, Holly*, Lillicrap, L., 16

Simplest example: sample parallelism with off-policy algorithms sample sample sample grasp success predictor training