Supervised Learning of Unsupervised Learning Rules

Similar documents
Lecture 1: Machine Learning Basics

Python Machine Learning

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Artificial Neural Networks written examination

(Sub)Gradient Descent

arxiv: v1 [cs.lg] 15 Jun 2015

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Generative models and adversarial training

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Assignment 1: Predicting Amazon Review Ratings

A study of speaker adaptation for DNN-based speech synthesis

HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

WHEN THERE IS A mismatch between the acoustic

arxiv: v1 [cs.lg] 7 Apr 2015

arxiv: v1 [cs.cv] 10 May 2017

Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures

CSL465/603 - Machine Learning

Second Exam: Natural Language Parsing with Neural Networks

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Knowledge Transfer in Deep Convolutional Neural Nets

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

Georgetown University at TREC 2017 Dynamic Domain Track

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Attributed Social Network Embedding

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Model Ensemble for Click Prediction in Bing Search Ads

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Axiom 2013 Team Description Paper

Modeling function word errors in DNN-HMM based LVCSR systems

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Semantic Segmentation with Histological Image Data: Cancer Cell vs. Stroma

Evolutive Neural Net Fuzzy Filtering: Basic Description

Dialog-based Language Learning

An empirical study of learning speed in backpropagation

arxiv: v2 [cs.cv] 30 Mar 2017

Reinforcement Learning by Comparing Immediate Reward

Truth Inference in Crowdsourcing: Is the Problem Solved?

Deep Neural Network Language Models

Comment-based Multi-View Clustering of Web 2.0 Items

Human Emotion Recognition From Speech

Modeling function word errors in DNN-HMM based LVCSR systems

Probabilistic Latent Semantic Analysis

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

arxiv: v4 [cs.cl] 28 Mar 2016

A Review: Speech Recognition with Deep Learning Methods

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, VOL XXX, NO. XXX,

arxiv: v2 [cs.ir] 22 Aug 2016

Глубокие рекуррентные нейронные сети для аспектно-ориентированного анализа тональности отзывов пользователей на различных языках

Test Effort Estimation Using Neural Network

TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen

Learning Methods for Fuzzy Systems

arxiv:submit/ [cs.cv] 2 Aug 2017

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

Calibration of Confidence Measures in Speech Recognition

Semi-Supervised Face Detection

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

THE enormous growth of unstructured data, including

Cultivating DNN Diversity for Large Scale Video Labelling

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Lecture 1: Basic Concepts of Machine Learning

Using Deep Convolutional Neural Networks in Monte Carlo Tree Search

Learning to Schedule Straight-Line Code

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Artificial Neural Networks

Softprop: Softmax Neural Network Backpropagation Learning

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

CS Machine Learning

Dual-Memory Deep Learning Architectures for Lifelong Learning of Everyday Human Behaviors

INPE São José dos Campos

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

Residual Stacking of RNNs for Neural Machine Translation

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Discriminative Learning of Beam-Search Heuristics for Planning

Online Updating of Word Representations for Part-of-Speech Tagging

Learning Methods in Multilingual Speech Recognition

Issues in the Mining of Heart Failure Datasets

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

TRANSFER LEARNING IN MIR: SHARING LEARNED LATENT REPRESENTATIONS FOR MUSIC AUDIO CLASSIFICATION AND SIMILARITY

Speech Emotion Recognition Using Support Vector Machine

An Online Handwriting Recognition System For Turkish

Word Segmentation of Off-line Handwritten Documents

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

A deep architecture for non-projective dependency parsing

Learning From the Past with Experiment Databases

Speaker Identification by Comparison of Smart Methods. Abstract

arxiv: v1 [math.at] 10 Jan 2016

Transcription:

Supervised Learning of Unsupervised Learning Rules Luke Metz 1, Brian Cheung 2, and Jascha Sohl-dickstein 1 1 Google Brain 2 Berkeley {lmetz, jascha}@google.com, bcheung@berkeley.edu 1 Introduction Supervised learning has proven extremely effective for many problems where large amounts of labeled training data are available. There is a common hope that unsupervised learning will prove similarly powerful in situations where labels are expensive or impractical to collect, or where the prediction target is unknown during training. However, unsupervised learning has yet to fulfill this promise. One explanation for this failure is that unsupervised training rules are typically mismatched to the target task. Ideally, learned representations should linearly expose high level attributes of data (e.g. object identity) and perform well in semi-supervised settings. However, many current unsupervised objectives optimize for objectives such as log-likelihood of a generative model or reconstruction error, and produce useful representations only as a side effect. Unsupervised representation learning seems uniquely suitable for metalearning (Hochreiter et al., 2001; Schmidhuber, 1995). Unlike most tasks to which metalearning has been applied, in unsupervised representation learning we are unable even to directly express the desired objective function. However, it is possible to directly express a meta-objective that captures the quality of an unsupervised update rule. In this work, we propose to meta-learn an unsupervised update rule, by metatraining on a meta-objective that directly rewards the usefulness of the unsupervised representation. Unlike hand-tuned proxies, this meta-objective directly measures the usefulness of a representation generated from unlabeled data. By re-casting unsupervised learning as metalearning, we treat creating the unsupervised update rule as a transfer learning problem. Unlike previous work which transfers feature extractors between different data domains, we transfer the learning rule between different domains, and between different network architectures. Instead of learning transferable features, we learn a transferable learning rule which does not depend on labels or dataset. Much previous work on meta-learning using gradient descent on the meta-objective has focused on improving the optimization process. Maclaurin et al. (2015); Andrychowicz et al. (2016); Ravi & Larochelle (2016); Wichrowska et al. (2017) learn an update rule that We leverage their insights here. By treating the entire learning process as a differentiable function of the loss, derivatives with respect to the update rule can be computed which in turn allow parameters within the update rule to be updated with backpropagation. 2 Methods We consider a multilayer perceptron (MLP) f( ; φ t ), with parameters φ t, as the base model which an update rule targets. In standard supervised learning, that update rule is Stochastic Gradient Descent (SGD). A supervised loss l (x, y) is associated with this model, where x is a minibatch of inputs, and y are the corresponding labels. The parameters φ t are then updated iteratively until convergence, by performing SGD using the gradient l(x,y) φ t. This supervised update rule can be written as φ t+1 = SupervisedUpdate(φ t, x t, y t ; θ), (1)

where θ are the parameters of the optimizer (e.g. learning rate), which we will refer to as the metaparameters. In this work, we instead define a parametric update process which does not depend on label information, φ t+1 = UnsupervisedUpdate(φ t, x t ; θ). (2) We then train the UnsupervisedUpdate function by performing SGD on a meta-loss, in terms of the meta-parameters θ, θ = argmin MetaObjective(φ t (θ)). (3) θ t In the following sections, we briefly review the main functional pieces to this model, the base model (f(, φ)), the UnsupervisedUpdate, and the MetaObjective. 2.1 Base Model: f( ; φ) We restrict our attention to training a family of feed forward MLPs. The weight matrix and bias vector for layer l are W l and b l respectively. The network has pre-activations z 1..z L, and postactivations x 0..x L, where L is the number of layers, and x 0 x is the network input. Batch norm (Ioffe & Szegedy, 2015) is applied at each layer. To encourage generalization of the update rule across model architectures, during meta-training we sample the layer width and number of layers from a distribution (except where otherwise noted for specific experiments). 2.2 Learned update rule: U nsupervisedu pdate( ; θ) In order to build an unsupervised update rule that generalizes across architectures, we design our unsupervised update rule to be neuron-local. That is, every neuron i in every layer l in the base model has an update network h l i ( ; θ), itself an MLP, associated with it. All update networks share parameters θ, and h l i ( ; θ) is evaluated during unsupervised training only. Evaluating the statistics of unit activation over a batch of data has proven helpful in supervised learning, in the case of batch norm. It has similarly proven helpful in hand designed unsupervised learning rules, for instance in sparse coding. We therefore allow h l i ( ; θ) to accumulate statistics across training minibatches. During an unsupervised training step, the base-model is first run in a standard feedforward fashion, populating x l ib, zl ib, where b is the training minibatch index. As in supervised learning, an error signal δib l is then propagated backwards through the network. Unlike in supervised backprop however, this error signal is generated by the corresponding update network for each unit, δib l hl i ( ; θ). Again as in supervised learning, the weight updates are a product of pre-and-postsynaptic signals. Unlike in supervised learning however, those signals are also generated from the update networks, Wij l = b cl ib dl 1 jb, {cl ib, dl ib } hl i ( ; θ). The input to the update network consists ( of unit[ pre- and post-activations, and backwards propagated error signal, i.e. it can be written (W x l i, zl l+1 i, ) ] ) T δ l+1 ; θ. Additional lateral interaction terms, not described in this short h l i i submission, enable units within a layer to remain decorrelated with each other. The internal architecture of the update networks h l i ( ; θ) is similarly beyond the scope of this submission, but involves repeated convolution along the batch dimension. 2.3 Meta-loss: M etaobjective (φ) The meta-loss we use in this work is based on few shot classification on the outputs of the base model, X L. This classification task is performed via linear ridge regression on one-hot feature label vectors. The benefits of ridge regression over e.g. cross entropy loss are that it is differentiable, with a closed form expression for the coefficients. In order to encourage the learning of features which generalize well, we estimate the regression coefficients on one minibatch {x a, y a } of data, and evaluate the classification performance on a second minibatch {x b, y b }, ˆv = argmin v ( ya x L a v 2 + λ v 2) ; MetaObjective = yb x L b ˆv 2, (4) 2

where xl is a function of the base-model parameters φ, and through φ is a function of the learned update rule s parameters θ. The meta-loss therefore encourages a learning rule that leads to good performance on a test set after semi-supervised training. 3 3.1 Experiments Training setup To explore transfer of our learned update rule, we use a variety of datasets. We train on 14x14 black and white images containing characters of the English alphabet and evaluate our technique on datasets such as 14x14 MNIST. We sample 10 way classification problems from the 26 letters of the alphabet. We train θ, by estimating [M etaobjective] via truncated backprop. During training θ gradients are accumulated from 512 CPU workers, and θ is updated with mini batch asynchronous SGD. By batching workers as they complete we eliminate most gradient staleness while retaining the compute efficiency of asynchronous SGD. 3.2 Application of the Learned Unsupervised Update Rule 1.6 0.9 0.7 1.2 Accuracy MetaObjective 1.4 1.0 0.3 0.2 0.5 0.1 2000 4000 6000 phi steps 8000 10000 0.0 LearnedUpdate Trained LearnedUpdate Initialization BetaVAE(B=0.1) VAE Supervised 2000 4000 6000 phi steps 8000 10000 Figure 1: Learning curves of our method for a test task of 14x14 MNIST (both before and after training), as well as comparisons with existing unsupervised learning techniques (VAE and β VAE(Higgins et al., 2016)). In addition, we show supervised learning baseline. In all settings, only 10 labeled examples per class are used. Despite being meta-trained on a different task, our learned update rule is capable of minimizing M etaobjective and creating useful features. We do not, however, perform as well as VAE base methods on accuracy. We evaluate our learned update rule s performance by iteratively applying it on a held out task. Results for 10 shot classification performance on a dataset of 14x14 MNIST digits can be found in Figure 1. Our learned update rule is able to reach higher performance when measured on the M etaobjective than several existing unsupervised methods, but does not perform as well as existing unsupervised techniques when evaluating classification accuracy. It does train faster than existing techniques in early learning. Next we evaluate performance on a much larger domain gap. We construct a dataset consisting of 3 28x28 MNIST digits stacked in the channel dimension. Results can be found in Figure 2. Early in meta-training, the learned update rule is able to learn some independent features for each of the color channels. This is surprising as there is no hand coded tendency to learn independent features (or any features). However, late in meta-training, the learned updater collapses to a single color channel. We hypothesize this an effect of overfitting to 10-way classification problems. Finally, we test base model generalization, by varying depth and number of hidden units. In both settings, we are able to generalize and learn useful representations early in φ iterations. We find that increasing the number of hidden units beyond the range used in meta-training leads to better performance. See Figure 3. 3.3 Analyzing Learning Strategy Our learned algorithm is capable of learning interesting features. We visualize first layer filters of f over the course of meta-training (updates of θ). Base-models appear to learn to create templates of input examples as filters, and then refine those input filters into more compositional features. 3

Figure 2: Filters from φ training obtained by applying our learned update rule to a dataset consisting of 3 concatenated MNIST digits. Left: Early in meta-training the filters show variation and some compositionallity. Top row corresponds to color filters, bottom three rows break out each color channel. Right: Later in meta-training the filters collapse to model only one color (in this case red). We believe this is a form of overfitting to the 10 way classification task. Different number of hidden units in base model 0 J 5 0 0.55 0.50 1.2 1.0 5 0 1 layers 2 layers 3 layers 4 layers 5 layers 6 layers 1.4 MetaObjective 0.70 Different number of layers in base model 1.6 32 units 64 units 128 units 160 units 192 units 256 units 320 units 384 units 448 units 512 units 0.75 0 20000 40000 60000 phi update steps 80000 100000 0 20000 40000 60000 phi update steps 80000 100000 Figure 3: Generalization of the unsupervised update rule to new base-model architectures. Left: different numbers of hidden units with a fixed layer size. Right: diferent numbers of layers with a fixed hidden size. The unsupervised update rule used here was meta-trained only on base models with 2 hidden layers, and < 200 units per layer. Figure 4: From left to right we show filters extracted from φ after every 10k steps of θ training. Each image represents a different point in θ training. At meta-initialization, the process is unstable and collapses. After some meta-training, the learned update rule first learns templates of increasing diversity, before finally transitioning to slightly more compositional, and less interpretable, features. 4 Limitations Optimization posed an enormous challenge over the course of this work. We used truncated backprop during training. Careful balancing was required between short truncation windows, which lead to a non-stationary meta-training distribution, and biased and non-stationary gradients, and long truncation windows, which often lead to gradients with diverging variance. This is similar to the issues encountered in off-policy reinforcement learning. Changes to the learned policy change the state visitation distribution. Unlike RL, it is not tractable for us to compute these probabilities, and thus we cannot easily use importance sampling to yield unbiased gradients. Currently, we encounter a performance gap when transitioning to more complex or noisier data. Upon inspection, we appear to have discovered a local minimum in algorithm space which consists of learning input templates. The computation and capacity available in higher layers is largely unused. Diagnosing this phenomenon is ongoing work. 5 Conclusion In this work we introduce a method of meta-learning an unsupervised representation learning algorithm. We show that this algorithm behaves reasonably in both performance, on a restricted setting, and generalization across task and model architecture. 4

Acknowledgments We would like to thank Hugo Larochelle, Nando de Freitas, Niru Maheswaranathan, Olga Wichrowska, Samy Bengio, Pavel Sountsov, and Alex Toshev for extremely helpful conversations and feedback on this work. References Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul, and Nando de Freitas. Learning to learn by gradient descent by gradient descent. In Advances in Neural Information Processing Systems, pp. 3981 3989, 2016. Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. 2016. Sepp Hochreiter, A Steven Younger, and Peter R Conwell. Learning to learn using gradient descent. In International Conference on Artificial Neural Networks, pp. 87 94. Springer, 2001. Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Francis Bach and David Blei (eds.), Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pp. 448 456, Lille, France, 07 09 Jul 2015. PMLR. URL http://proceedings. mlr.press/v37/ioffe15.html. Dougal Maclaurin, David Duvenaud, and Ryan Adams. Gradient-based hyperparameter optimization through reversible learning. In International Conference on Machine Learning, pp. 2113 2122, 2015. Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. 2016. Juergen Schmidhuber. On learning how to learn learning strategies. 1995. Olga Wichrowska, Niru Maheswaranathan, Matthew W Hoffman, Sergio Gomez Colmenarejo, Misha Denil, Nando de Freitas, and Jascha Sohl-Dickstein. Learned optimizers that scale and generalize. arxiv preprint arxiv:1703.04813, 2017. 5