Automated Curriculum Learning for Neural Networks

Similar documents
Lecture 1: Machine Learning Basics

Georgetown University at TREC 2017 Dynamic Domain Track

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Introduction to Simulation

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Residual Stacking of RNNs for Neural Machine Translation

arxiv: v1 [cs.lg] 15 Jun 2015

Artificial Neural Networks written examination

(Sub)Gradient Descent

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Reinforcement Learning by Comparing Immediate Reward

Axiom 2013 Team Description Paper

On the Combined Behavior of Autonomous Resource Management Agents

Python Machine Learning

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

UNIVERSITY OF CALIFORNIA SANTA CRUZ TOWARDS A UNIVERSAL PARAMETRIC PLAYER MODEL

Dialog-based Language Learning

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Rule Learning With Negation: Issues Regarding Effectiveness

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

CSL465/603 - Machine Learning

arxiv: v1 [cs.lg] 7 Apr 2015

Lecture 10: Reinforcement Learning

A study of speaker adaptation for DNN-based speech synthesis

College Pricing and Income Inequality

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

Speech Recognition at ICSI: Broadcast News and beyond

Semi-Supervised Face Detection

Laboratorio di Intelligenza Artificiale e Robotica

Truth Inference in Crowdsourcing: Is the Problem Solved?

Rule Learning with Negation: Issues Regarding Effectiveness

Evolutive Neural Net Fuzzy Filtering: Basic Description

Comment-based Multi-View Clustering of Web 2.0 Items

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

Generative models and adversarial training

AI Agent for Ice Hockey Atari 2600

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Calibration of Confidence Measures in Speech Recognition

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

WHEN THERE IS A mismatch between the acoustic

The Impact of Formative Assessment and Remedial Teaching on EFL Learners Listening Comprehension N A H I D Z A R E I N A S TA R A N YA S A M I

Improving Fairness in Memory Scheduling

Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures

CS Machine Learning

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Toward Probabilistic Natural Logic for Syllogistic Reasoning

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Team Formation for Generalized Tasks in Expertise Social Networks

Learning From the Past with Experiment Databases

College Pricing. Ben Johnson. April 30, Abstract. Colleges in the United States price discriminate based on student characteristics

Assignment 1: Predicting Amazon Review Ratings

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Human-like Natural Language Generation Using Monte Carlo Tree Search

A Reinforcement Learning Variant for Control Scheduling

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

FF+FPG: Guiding a Policy-Gradient Planner

AMULTIAGENT system [1] can be defined as a group of

INPE São José dos Campos

Laboratorio di Intelligenza Artificiale e Robotica

Evolution of Symbolisation in Chimpanzees and Neural Nets

The Strong Minimalist Thesis and Bounded Optimality

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Purdue Data Summit Communication of Big Data Analytics. New SAT Predictive Validity Case Study

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

On-the-Fly Customization of Automated Essay Scoring

Improving Action Selection in MDP s via Knowledge Transfer

Learning Human Utility from Video Demonstrations for Deductive Planning in Robotics

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING

arxiv: v2 [cs.ir] 22 Aug 2016

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

A GENERIC SPLIT PROCESS MODEL FOR ASSET MANAGEMENT DECISION-MAKING

The Good Judgment Project: A large scale test of different methods of combining expert predictions

A Case Study: News Classification Based on Term Frequency

Language Model and Grammar Extraction Variation in Machine Translation

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

An investigation of imitation learning algorithms for structured prediction

arxiv: v1 [cs.cl] 20 Jul 2015

Softprop: Softmax Neural Network Backpropagation Learning

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Visit us at:

CROSS COUNTRY CERTIFICATION STANDARDS

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

College Pricing and Income Inequality

Module Title: Managing and Leading Change. Lesson 4 THE SIX SIGMA

Abnormal Activity Recognition Based on HDP-HMM Models

Dublin City Schools Mathematics Graded Course of Study GRADE 4

Learning goal-oriented strategies in problem solving

An empirical study of learning speed in backpropagation

arxiv: v1 [math.at] 10 Jan 2016

Transferring End-to-End Visuomotor Control from Simulation to Real World for a Multi-Stage Task

A Comparison of Annealing Techniques for Academic Course Scheduling

ASCD Recommendations for the Reauthorization of No Child Left Behind

Using focal point learning to improve human machine tacit coordination

MARKETING MANAGEMENT II: MARKETING STRATEGY (MKTG 613) Section 007

LEARNING TO PLAY IN A DAY: FASTER DEEP REIN-

Predicting Future User Actions by Observing Unmodified Applications

Transcription:

Automated Curriculum Learning for Neural Networks Alex Graves, Marc G. Bellemare, Jacob Menick, Remi Munos, Koray Kavukcuoglu DeepMind ICML 2017 Presenter: Jack Lanchantin Alex Graves, Marc G. Bellemare, Jacob Menick, Automated Remi Munos, Curriculum Koray Kavukcuoglu Learning for Neural Networks Presenter: Jack Lanchantin 1 / 27

Outline 1 Introduction Curriculum Learning Task Multi-Armed Bandits 2 Learning Progress Signals Learning Progress Signals Loss-driven Progress Complexity-driven Progress 3 Experiments 3 tasks N-gram Repeat Copy babi Alex Graves, Marc G. Bellemare, Jacob Menick, Automated Remi Munos, Curriculum Koray Kavukcuoglu Learning for Neural Networks Presenter: Jack Lanchantin 2 / 27

Outline 1 Introduction Curriculum Learning Task Multi-Armed Bandits 2 Learning Progress Signals Learning Progress Signals Loss-driven Progress Complexity-driven Progress 3 Experiments 3 tasks N-gram Repeat Copy babi Alex Graves, Marc G. Bellemare, Jacob Menick, Automated Remi Munos, Curriculum Koray Kavukcuoglu Learning for Neural Networks Presenter: Jack Lanchantin 2 / 27

Curriculum Learning (CL) The importance of starting small (Ellman, 1993) CL is highly sensitive to the mode of progression through the tasks Previous methods: tasks can be ordered by difficulty in reality they may vary along multiple axes of difficulty, or have no predefined order at all This paper: treat the decision about which task to study next as a stochastic policy, continuously adapted to optimise some notion of learning progress Alex Graves, Marc G. Bellemare, Jacob Menick, Automated Remi Munos, Curriculum Koray Kavukcuoglu Learning for Neural Networks Presenter: Jack Lanchantin 3 / 27

Curriculum Learning (CL) The importance of starting small (Ellman, 1993) CL is highly sensitive to the mode of progression through the tasks Previous methods: tasks can be ordered by difficulty in reality they may vary along multiple axes of difficulty, or have no predefined order at all This paper: treat the decision about which task to study next as a stochastic policy, continuously adapted to optimise some notion of learning progress Alex Graves, Marc G. Bellemare, Jacob Menick, Automated Remi Munos, Curriculum Koray Kavukcuoglu Learning for Neural Networks Presenter: Jack Lanchantin 3 / 27

Curriculum Learning (CL) The importance of starting small (Ellman, 1993) CL is highly sensitive to the mode of progression through the tasks Previous methods: tasks can be ordered by difficulty in reality they may vary along multiple axes of difficulty, or have no predefined order at all This paper: treat the decision about which task to study next as a stochastic policy, continuously adapted to optimise some notion of learning progress Alex Graves, Marc G. Bellemare, Jacob Menick, Automated Remi Munos, Curriculum Koray Kavukcuoglu Learning for Neural Networks Presenter: Jack Lanchantin 3 / 27

Outline 1 Introduction Curriculum Learning Task Multi-Armed Bandits 2 Learning Progress Signals Learning Progress Signals Loss-driven Progress Complexity-driven Progress 3 Experiments 3 tasks N-gram Repeat Copy babi Alex Graves, Marc G. Bellemare, Jacob Menick, Automated Remi Munos, Curriculum Koray Kavukcuoglu Learning for Neural Networks Presenter: Jack Lanchantin 3 / 27

Curriculum Learning Task Each example x X contains input a and target b: Task: a distribution D over sequences from X Curriculum: an ensemble of tasks D 1,..., D N Sample: an example drawn from one of the tasks of the curriculum Syllabus: a time-varying sequence of distributions over tasks The expected loss of the network on the k th task is L k (θ) := E x Dk L(x, θ) (1) Where L(x, θ) := logp θ (x) is the sample loss on x Alex Graves, Marc G. Bellemare, Jacob Menick, Automated Remi Munos, Curriculum Koray Kavukcuoglu Learning for Neural Networks Presenter: Jack Lanchantin 4 / 27

Curriculum Learning Task Each example x X contains input a and target b: Task: a distribution D over sequences from X Curriculum: an ensemble of tasks D 1,..., D N Sample: an example drawn from one of the tasks of the curriculum Syllabus: a time-varying sequence of distributions over tasks The expected loss of the network on the k th task is L k (θ) := E x Dk L(x, θ) (1) Where L(x, θ) := logp θ (x) is the sample loss on x Alex Graves, Marc G. Bellemare, Jacob Menick, Automated Remi Munos, Curriculum Koray Kavukcuoglu Learning for Neural Networks Presenter: Jack Lanchantin 4 / 27

Curriculum Learning: Two related settings 1 Multiple tasks setting: Perform well on all tasks in {D k }: L MT := 1 N N L k (2) 2 Target task setting: Only interested in minimizing the loss on the final task D N : L TT := L N (3) The other tasks act as a series of stepping stones to the real problem k=1 Alex Graves, Marc G. Bellemare, Jacob Menick, Automated Remi Munos, Curriculum Koray Kavukcuoglu Learning for Neural Networks Presenter: Jack Lanchantin 5 / 27

Outline 1 Introduction Curriculum Learning Task Multi-Armed Bandits 2 Learning Progress Signals Learning Progress Signals Loss-driven Progress Complexity-driven Progress 3 Experiments 3 tasks N-gram Repeat Copy babi Alex Graves, Marc G. Bellemare, Jacob Menick, Automated Remi Munos, Curriculum Koray Kavukcuoglu Learning for Neural Networks Presenter: Jack Lanchantin 5 / 27

Multi-Armed Bandits for CL Model a curriculum containing N tasks as an N-armed bandit Syllabus: adaptive policy which seeks to maximize payoffs from bandit An agent selects a sequence of actions a 1...a T over T rounds of play (a t {1,...N}) After each round, the selected arm yields a reward r t Alex Graves, Marc G. Bellemare, Jacob Menick, Automated Remi Munos, Curriculum Koray Kavukcuoglu Learning for Neural Networks Presenter: Jack Lanchantin 6 / 27

Exp3 Algorithm for Multi-Armed Bandits On round t, the agent selects an arm stochastically according to policy π t. This policy is defined by a set of weights w t,i : π EXP3 t (i) := e w t,i N j=1 ew t,j (4) The weights are the sum of importance-sampled rewards: w t,i := η s<t r s,i (5) r s,i := r si [as=i] π s (i) (6) Alex Graves, Marc G. Bellemare, Jacob Menick, Automated Remi Munos, Curriculum Koray Kavukcuoglu Learning for Neural Networks Presenter: Jack Lanchantin 7 / 27

Outline 1 Introduction Curriculum Learning Task Multi-Armed Bandits 2 Learning Progress Signals Learning Progress Signals Loss-driven Progress Complexity-driven Progress 3 Experiments 3 tasks N-gram Repeat Copy babi Alex Graves, Marc G. Bellemare, Jacob Menick, Automated Remi Munos, Curriculum Koray Kavukcuoglu Learning for Neural Networks Presenter: Jack Lanchantin 7 / 27

Learning Progress Signals for CL Goal: use the policy output by Exp3 as a syllabus for training our models Ideally: policy should maximize the rate at which we minimize the loss, and the reward should reflect this rate Hard to measure effect of a training sample on the target objective Method: Introduce defined measures of progress: Loss-driven: equate reward with a decrease in some loss Complexity-driven: equate reward with an increase in model complexity Alex Graves, Marc G. Bellemare, Jacob Menick, Automated Remi Munos, Curriculum Koray Kavukcuoglu Learning for Neural Networks Presenter: Jack Lanchantin 8 / 27

Training for Intrinsically Motivated Curriculum Learning T rounds, N number of tasks Alex Graves, Marc G. Bellemare, Jacob Menick, Automated Remi Munos, Curriculum Koray Kavukcuoglu Learning for Neural Networks Presenter: Jack Lanchantin 9 / 27

Outline 1 Introduction Curriculum Learning Task Multi-Armed Bandits 2 Learning Progress Signals Learning Progress Signals Loss-driven Progress Complexity-driven Progress 3 Experiments 3 tasks N-gram Repeat Copy babi Alex Graves, Marc G. Bellemare, Jacob Menick, Automated Remi Munos, Curriculum Koray Kavukcuoglu Learning for Neural Networks Presenter: Jack Lanchantin 9 / 27

Loss-driven Progress Loss-driven Progress: Compare the predictions made by the model before and after training on some sample x 1. Prediction Gain (PG) 2. Gradient prediction Gain (GPG) V PG := L(x, θ) L(x, θ ) (7) L(x, θ ) L(x, θ) + [ L(x, θ)] T θ (8) where θ is the descent step, θ L(x, θ) V GPG := L(x, θ) 2 2 (9) Alex Graves, Marc G. Bellemare, Jacob Menick, Automated Remi Munos, Curriculum Koray Kavukcuoglu Learning for Neural Networks Presenter: Jack Lanchantin 10 / 27

Loss-driven Progress Loss-driven Progress: Compare the predictions made by the model before and after training on some sample x 1. Prediction Gain (PG) 2. Gradient prediction Gain (GPG) V PG := L(x, θ) L(x, θ ) (7) L(x, θ ) L(x, θ) + [ L(x, θ)] T θ (8) where θ is the descent step, θ L(x, θ) V GPG := L(x, θ) 2 2 (9) Alex Graves, Marc G. Bellemare, Jacob Menick, Automated Remi Munos, Curriculum Koray Kavukcuoglu Learning for Neural Networks Presenter: Jack Lanchantin 10 / 27

Loss-driven Progress Loss-driven Progress: Compare the predictions made by the model before and after training on some sample x 1. Prediction Gain (PG) 2. Gradient prediction Gain (GPG) V PG := L(x, θ) L(x, θ ) (7) L(x, θ ) L(x, θ) + [ L(x, θ)] T θ (8) where θ is the descent step, θ L(x, θ) V GPG := L(x, θ) 2 2 (9) Alex Graves, Marc G. Bellemare, Jacob Menick, Automated Remi Munos, Curriculum Koray Kavukcuoglu Learning for Neural Networks Presenter: Jack Lanchantin 10 / 27

Loss-driven Progress Loss-driven Progress: Compare the predictions made by the model before and after training on some sample x 3. Self prediction Gain (SPG) 4. Target prediction Gain (TPG) 5. Mean prediction Gain (MPG) V SPG := L(x, θ) L(x, θ ) x D k (10) V TPG := L(x, θ) L(x, θ ) x D N (11) V TPG := L(x, θ) L(x, θ ) x D k, k U N (12) Alex Graves, Marc G. Bellemare, Jacob Menick, Automated Remi Munos, Curriculum Koray Kavukcuoglu Learning for Neural Networks Presenter: Jack Lanchantin 11 / 27

Loss-driven Progress Loss-driven Progress: Compare the predictions made by the model before and after training on some sample x 3. Self prediction Gain (SPG) 4. Target prediction Gain (TPG) 5. Mean prediction Gain (MPG) V SPG := L(x, θ) L(x, θ ) x D k (10) V TPG := L(x, θ) L(x, θ ) x D N (11) V TPG := L(x, θ) L(x, θ ) x D k, k U N (12) Alex Graves, Marc G. Bellemare, Jacob Menick, Automated Remi Munos, Curriculum Koray Kavukcuoglu Learning for Neural Networks Presenter: Jack Lanchantin 11 / 27

Loss-driven Progress Loss-driven Progress: Compare the predictions made by the model before and after training on some sample x 3. Self prediction Gain (SPG) 4. Target prediction Gain (TPG) 5. Mean prediction Gain (MPG) V SPG := L(x, θ) L(x, θ ) x D k (10) V TPG := L(x, θ) L(x, θ ) x D N (11) V TPG := L(x, θ) L(x, θ ) x D k, k U N (12) Alex Graves, Marc G. Bellemare, Jacob Menick, Automated Remi Munos, Curriculum Koray Kavukcuoglu Learning for Neural Networks Presenter: Jack Lanchantin 11 / 27

Outline 1 Introduction Curriculum Learning Task Multi-Armed Bandits 2 Learning Progress Signals Learning Progress Signals Loss-driven Progress Complexity-driven Progress 3 Experiments 3 tasks N-gram Repeat Copy babi Alex Graves, Marc G. Bellemare, Jacob Menick, Automated Remi Munos, Curriculum Koray Kavukcuoglu Learning for Neural Networks Presenter: Jack Lanchantin 11 / 27

Complexity-driven Progress So far: considered gains that gauge the networks learning progress directly, by observing the rate of change in its predictive ability Now: turn to a set of gains that instead measure the rate at which the networks complexity increases Alex Graves, Marc G. Bellemare, Jacob Menick, Automated Remi Munos, Curriculum Koray Kavukcuoglu Learning for Neural Networks Presenter: Jack Lanchantin 12 / 27

Minimum Description Length (MDL) principle In order to best generalize from a particular dataset, one should minimize: (# of bits required to describe the model parameters) + (# of bits required for the model to describe the data) I.e., increasing the model complexity by a certain amount is only worthwhile if it compresses the data by a greater amount Therefore, complexity should increase most in response to the training examples from which the network is best able to generalize These examples are exactly what we seek when attempting to maximize learning progress Alex Graves, Marc G. Bellemare, Jacob Menick, Automated Remi Munos, Curriculum Koray Kavukcuoglu Learning for Neural Networks Presenter: Jack Lanchantin 13 / 27

Minimum Description Length (MDL) principle In order to best generalize from a particular dataset, one should minimize: (# of bits required to describe the model parameters) + (# of bits required for the model to describe the data) I.e., increasing the model complexity by a certain amount is only worthwhile if it compresses the data by a greater amount Therefore, complexity should increase most in response to the training examples from which the network is best able to generalize These examples are exactly what we seek when attempting to maximize learning progress Alex Graves, Marc G. Bellemare, Jacob Menick, Automated Remi Munos, Curriculum Koray Kavukcuoglu Learning for Neural Networks Presenter: Jack Lanchantin 13 / 27

Background: Variational Inference (from David Blei) Alex Graves, Marc G. Bellemare, Jacob Menick, Automated Remi Munos, Curriculum Koray Kavukcuoglu Learning for Neural Networks Presenter: Jack Lanchantin 14 / 27

Background: Variational Inference (from David Blei) Alex Graves, Marc G. Bellemare, Jacob Menick, Automated Remi Munos, Curriculum Koray Kavukcuoglu Learning for Neural Networks Presenter: Jack Lanchantin 15 / 27

Minimum Description Length (MDL) principle MDL training in neural nets uses a variational posterior P φ (θ) over the network weights during training with a single weight sample drawn for each training example The parameters φ of the posterior are optimized rather than θ itself. Alex Graves, Marc G. Bellemare, Jacob Menick, Automated Remi Munos, Curriculum Koray Kavukcuoglu Learning for Neural Networks Presenter: Jack Lanchantin 16 / 27

Varational Loss in Neural Nets L VI (φ, ψ) = KL(P φ Q ψ ) + k x D k E θ Pφ L(x, θ) (13) L VI (x, φ, ψ) = 1 S KL(P φ Q ψ ) + E θ Pφ L(x, θ) (14) Alex Graves, Marc G. Bellemare, Jacob Menick, Automated Remi Munos, Curriculum Koray Kavukcuoglu Learning for Neural Networks Presenter: Jack Lanchantin 17 / 27

Varational Loss in Neural Nets L VI (φ, ψ) = KL(P φ Q ψ ) + k x D k E θ Pφ L(x, θ) (13) L VI (x, φ, ψ) = 1 S KL(P φ Q ψ ) + E θ Pφ L(x, θ) (14) Alex Graves, Marc G. Bellemare, Jacob Menick, Automated Remi Munos, Curriculum Koray Kavukcuoglu Learning for Neural Networks Presenter: Jack Lanchantin 17 / 27

Complexity-driven Progress for Variational Inference Variational Complexity Gain (VPG) V VPG := KL(P φ Q ψ ) KL(P φ Q ψ ) (15) Gradient Variational Complexity Gain (VPG) V GVPG := [ φ,ψ KL(P φ Q ψ )] T φ E φ Pφ L(x, θ) (16) Alex Graves, Marc G. Bellemare, Jacob Menick, Automated Remi Munos, Curriculum Koray Kavukcuoglu Learning for Neural Networks Presenter: Jack Lanchantin 18 / 27

Complexity-driven Progress for Variational Inference Variational Complexity Gain (VPG) V VPG := KL(P φ Q ψ ) KL(P φ Q ψ ) (15) Gradient Variational Complexity Gain (VPG) V GVPG := [ φ,ψ KL(P φ Q ψ )] T φ E φ Pφ L(x, θ) (16) Alex Graves, Marc G. Bellemare, Jacob Menick, Automated Remi Munos, Curriculum Koray Kavukcuoglu Learning for Neural Networks Presenter: Jack Lanchantin 18 / 27

Complexity-driven Progress for Maximum Likelihood L2 Gain (L2G) L L2 (x, θ) := L(x, θ) + α 2 θ 2 2 (17) V L2G := θ 2 2 θ 2 2 (18) V GL2G := [θ] T θ L(x, θ) (19) Alex Graves, Marc G. Bellemare, Jacob Menick, Automated Remi Munos, Curriculum Koray Kavukcuoglu Learning for Neural Networks Presenter: Jack Lanchantin 19 / 27

Outline 1 Introduction Curriculum Learning Task Multi-Armed Bandits 2 Learning Progress Signals Learning Progress Signals Loss-driven Progress Complexity-driven Progress 3 Experiments 3 tasks N-gram Repeat Copy babi Alex Graves, Marc G. Bellemare, Jacob Menick, Automated Remi Munos, Curriculum Koray Kavukcuoglu Learning for Neural Networks Presenter: Jack Lanchantin 19 / 27

Experiments Applied the previously defined gains in 3 tasks using the same LSTM model 1 synthetic language modelling on text generated by n-gram models 2 repeat copy (Graves et al., 2014) 3 babi tasks (Weston et al., 2015) Alex Graves, Marc G. Bellemare, Jacob Menick, Automated Remi Munos, Curriculum Koray Kavukcuoglu Learning for Neural Networks Presenter: Jack Lanchantin 20 / 27

Outline 1 Introduction Curriculum Learning Task Multi-Armed Bandits 2 Learning Progress Signals Learning Progress Signals Loss-driven Progress Complexity-driven Progress 3 Experiments 3 tasks N-gram Repeat Copy babi Alex Graves, Marc G. Bellemare, Jacob Menick, Automated Remi Munos, Curriculum Koray Kavukcuoglu Learning for Neural Networks Presenter: Jack Lanchantin 20 / 27

N-Gram Language Modeling Trained character level Kneser-Ney n-gram models on the King James Bible data from the Canterbury corpus, with the maximum depth parameter n ranging between 0 to 10 Used each model to generate a separate dataset of 1M characters, which were divided into disjoint sequences of 150 characters Since entropy decreases in n, learning progress should be higher for larger n, and thus the gain signals to be drawn to higher n Alex Graves, Marc G. Bellemare, Jacob Menick, Automated Remi Munos, Curriculum Koray Kavukcuoglu Learning for Neural Networks Presenter: Jack Lanchantin 21 / 27

N-Gram Language Modeling Alex Graves, Marc G. Bellemare, Jacob Menick, Automated Remi Munos, Curriculum Koray Kavukcuoglu Learning for Neural Networks Presenter: Jack Lanchantin 22 / 27

Outline 1 Introduction Curriculum Learning Task Multi-Armed Bandits 2 Learning Progress Signals Learning Progress Signals Loss-driven Progress Complexity-driven Progress 3 Experiments 3 tasks N-gram Repeat Copy babi Alex Graves, Marc G. Bellemare, Jacob Menick, Automated Remi Munos, Curriculum Koray Kavukcuoglu Learning for Neural Networks Presenter: Jack Lanchantin 22 / 27

Repeat Copy Network receives an input sequence of random bit vectors, and is then asked to output that sequence a given number of times. Sequence length varies from 1-13, and Repeats vary from 1-13 (169 tasks in total) Target task is length 13 sequences and 13 repeats NTMs are able to learn a for-loop like algorithm on simple examples that can directly generalise to much harder examples. LSTMs require significant retraining for harder tasks Alex Graves, Marc G. Bellemare, Jacob Menick, Automated Remi Munos, Curriculum Koray Kavukcuoglu Learning for Neural Networks Presenter: Jack Lanchantin 23 / 27

Repeat Copy Alex Graves, Marc G. Bellemare, Jacob Menick, Automated Remi Munos, Curriculum Koray Kavukcuoglu Learning for Neural Networks Presenter: Jack Lanchantin 24 / 27

Outline 1 Introduction Curriculum Learning Task Multi-Armed Bandits 2 Learning Progress Signals Learning Progress Signals Loss-driven Progress Complexity-driven Progress 3 Experiments 3 tasks N-gram Repeat Copy babi Alex Graves, Marc G. Bellemare, Jacob Menick, Automated Remi Munos, Curriculum Koray Kavukcuoglu Learning for Neural Networks Presenter: Jack Lanchantin 24 / 27

babi 20 synthetic question-answering tasks Some of the tasks follow a natural ordering of complexity (e.g. Two Arg Relations, Three Arg Relations) and all are based on a consistent probabilistic grammar, leading us to hope that an efficient syllabus could be found for learning the whole set The usual performance measure for babi is the number of tasks completed by the model, where completion is defined as getting less than 5% of the test set questions wrong Alex Graves, Marc G. Bellemare, Jacob Menick, Automated Remi Munos, Curriculum Koray Kavukcuoglu Learning for Neural Networks Presenter: Jack Lanchantin 25 / 27

babi Alex Graves, Marc G. Bellemare, Jacob Menick, Automated Remi Munos, Curriculum Koray Kavukcuoglu Learning for Neural Networks Presenter: Jack Lanchantin 26 / 27

Conclusion Using a stochastic syllabus to maximise learning progress can lead to significant gains in curriculum learning efficiency, so long as a a suitable progress signal is used Uniformly sampling from all tasks is a surprisingly strong benchmark learning is dominated by gradients from the tasks on which the network is making fastest progress, inducing a kind of implicit curriculum, albeit with the inefficiency of unnecessary samples Alex Graves, Marc G. Bellemare, Jacob Menick, Automated Remi Munos, Curriculum Koray Kavukcuoglu Learning for Neural Networks Presenter: Jack Lanchantin 27 / 27