Learning to Learn Gradient Descent by Gradient Descent. Andrychowicz et al. by Yarkın D. Cetin

Similar documents
System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Lecture 1: Machine Learning Basics

arxiv: v1 [cs.lg] 15 Jun 2015

Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

(Sub)Gradient Descent

Python Machine Learning

Artificial Neural Networks written examination

arxiv: v1 [cs.lg] 7 Apr 2015

Model Ensemble for Click Prediction in Bing Search Ads

Dropout improves Recurrent Neural Networks for Handwriting Recognition

Axiom 2013 Team Description Paper

arxiv: v4 [cs.cl] 28 Mar 2016

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Evolutive Neural Net Fuzzy Filtering: Basic Description

A study of speaker adaptation for DNN-based speech synthesis

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

Calibration of Confidence Measures in Speech Recognition

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Second Exam: Natural Language Parsing with Neural Networks

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Softprop: Softmax Neural Network Backpropagation Learning

Word Segmentation of Off-line Handwritten Documents

Knowledge Transfer in Deep Convolutional Neural Nets

Глубокие рекуррентные нейронные сети для аспектно-ориентированного анализа тональности отзывов пользователей на различных языках

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Dual-Memory Deep Learning Architectures for Lifelong Learning of Everyday Human Behaviors

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

CSL465/603 - Machine Learning

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

arxiv: v2 [cs.ir] 22 Aug 2016

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Georgetown University at TREC 2017 Dynamic Domain Track

Learning Methods for Fuzzy Systems

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

The A2iA Multi-lingual Text Recognition System at the second Maurdor Evaluation

HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Residual Stacking of RNNs for Neural Machine Translation

TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen

arxiv: v1 [cs.cl] 27 Apr 2016

An empirical study of learning speed in backpropagation

Learning to Schedule Straight-Line Code

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, VOL XXX, NO. XXX,

INPE São José dos Campos

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

Learning From the Past with Experiment Databases

Test Effort Estimation Using Neural Network

A Review: Speech Recognition with Deep Learning Methods

Classification Using ANN: A Review

TD(λ) and Q-Learning Based Ludo Players

Attributed Social Network Embedding

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

A Reinforcement Learning Variant for Control Scheduling

Generative models and adversarial training

Proposal of Pattern Recognition as a necessary and sufficient principle to Cognitive Science

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

arxiv: v5 [cs.ai] 18 Aug 2015

Soft Computing based Learning for Cognitive Radio

Massachusetts Institute of Technology Tel: Massachusetts Avenue Room 32-D558 MA 02139

THE enormous growth of unstructured data, including

CS177 Python Programming

An Introduction to Simulation Optimization

Human Emotion Recognition From Speech

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Offline Writer Identification Using Convolutional Neural Network Activation Features

Dialog-based Language Learning

Discriminative Learning of Beam-Search Heuristics for Planning

CS Machine Learning

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Reducing Features to Improve Bug Prediction

WHEN THERE IS A mismatch between the acoustic

SARDNET: A Self-Organizing Feature Map for Sequences

Major Milestones, Team Activities, and Individual Deliverables

Software Maintenance

Ph.D in Advance Machine Learning (computer science) PhD submitted, degree to be awarded on convocation, sept B.Tech in Computer science and

A SURVEY OF FUZZY COGNITIVE MAP LEARNING METHODS

CS224d Deep Learning for Natural Language Processing. Richard Socher, PhD

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

Ask Me Anything: Dynamic Memory Networks for Natural Language Processing

Introduction to Simulation

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Evolution of Symbolisation in Chimpanzees and Neural Nets

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

I-COMPETERE: Using Applied Intelligence in search of competency gaps in software project managers.

A Deep Bag-of-Features Model for Music Auto-Tagging

Experiment Databases: Towards an Improved Experimental Methodology in Machine Learning

Cultivating DNN Diversity for Large Scale Video Labelling

Seminar - Organic Computing

Deep Neural Network Language Models

Assignment 1: Predicting Amazon Review Ratings

arxiv: v1 [cs.cv] 10 May 2017

SORT: Second-Order Response Transform for Visual Recognition

Transcription:

Learning to Learn Gradient Descent by Gradient Descent Andrychowicz et al. by Yarkın D. Cetin

Introduction What does machine learning try to achieve? Model parameters What does optimizers try to achieve? Possible parameter space

Concrete examples of Optimizers Stochastic gradient descent is probably the most well-known technique. We have improvements of SGD such as; SGD with Momentum Rprop RMSProp Adagrad ADAM

No Free Lunch (Theorem) As demonstrated by Wolpert and Macready all optimizers are equal when their performance is averaged across all problems.[4] Which means, We can t have a generally better optimizer for every problem We must specialize in other words, Either we handcraft better optimizers for each problem or we have suboptimal performance with an already existing optimizing technique. Or something else?

Why not learn the optimizer? Deep learning techniques already demonstrated us that, learned features can perform better than handcrafted ones. Apply the idea to learn optimizers since, Optimizers are in essence, functions mapping The idea of this paper is simply; Learn this function!

Some History and Related Works Meta-learning as it is called in the literature, is around since the 90s.[1] Bengio et al.[2] uses a parametric function i.e. a simple algorithm to search for learning rules, however they search a wider region than hand-crafted rules for the optimum learning rule. Schmidhuber(The LSTM guy) in 1993[3] theorizes about recurrent neural networks which can use their own weights as inputs as well and modifying them to learn new algorithms. The paper mentions that the weights modifying weights can be updated as well, ad infinitum. They are proposed as building block of the artificial intelligence in 2016 by Lake et al.[4] The works of Hochreiter et al. [5] demonstrates the possibility of feeding the gradients to train an optimizer. This paper is mainly based on their works.

Hochreiter et al. Pros: Uses a recurrent neural network for differentiable optimizer errors Optimization of the optimizer done through gradient descent Cons: Not coordinate-wise, meaning less scalable to models with high number of parameters Transferability between domains is not tested/not available

What is the loss function? The loss function can be written as: Expected value of loss given optimal parameters where is the final optimizee parameters Meaning the expected value of our loss function should be zero when our parameters are optimal

Loss over time However for our optimizer to have a sense of optimization the loss can be written over time loss: Weights of different time intervals This is the gradient updates This is the hidden states The model(lstm)

What is? It is a multi-layer (2-layer) long-short term memory network (LSTM)

Briefly on LSTMs LSTM s are essentially recurrent neural networks They are created in 1997 by Hochreiter et al.[5] Purpose is to solve time dependencies over long periods A recurrent neural network (RNN)

Briefly on LSTMs The LSTM can forget and add information to its memory from previous outputs. The sigmoid layer outputs numbers between zero and one, describing how much of each component should be let through. An LSTM has three of these gates, to protect and control the cell state. [6] An LSTM diagram (top horizontal line is the flow of the hidden state)

Why is an LSTM? 1. It was demonstrated that the past information about the gradients lead to faster convergence [7] (e.g. Nesterov s Momentum, ADAM) 2. LSTMs are good with long-time dependencies. 3. They can use same optimizer parameters for all optimizee parameters.

Coordinatewise LSTM There are thousands of parameters and the optimizer should be model-free i.e. does not depend on how many parameters the model has. The parameters of the LSTM are shared The hidden states are not shared

Experimental Results Regression to Random Polynomials and the MNIST dataset

Different Layer Widths

Experimental Results As we can see the LSTM optimizer fails with ReLU since it was trained for sigmoid functions

Transferability

Some Neural Art

Conclusion The paper demonstrates the possibility of training networks specializing in training other networks. This trained LSTM network can outperform static state-of-the-art optimizers and generalize from small to large networks.

References [1] B. M. Lake, T. D. Ullman, J. B. Tenenbaum, and S. J. Gershman. Building machines that learn and think like people. arxiv Report 1604.00289, 2016. [2] Y. Bengio, S. Bengio, and J. Cloutier. Learning a synaptic learning rule. Université de Montréal, Département d informatique et de recherche opérationnelle, 1990. [3] J. Schmidhuber. A neural network that embeds its own meta-levels. In International Conference on Neural Networks, pages 407 412. IEEE, 1993. [4] S. Thrun and L. Pratt. Learning to learn. Springer Science & Business Media, 1998. [5] S. Hochreiter, A. S. Younger, and P. R. Conwell. Learning to learn using gradient descent. In International Conference on Artificial Neural Networks, pages 87 94. Springer, 2001. [6] D. H. Wolpert and W. G. Macready. No free lunch theorems for optimization. Transactions on Evolutionary Computation, 1(1):67 82, 1997. [7] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735 1780, 1997. [8] Olah, Christopher. "Understanding LSTM Networks -- Colah's Blog". Colah.github.io. N.p., 2017. Web. 3 Mar. 2017. [9] Y. Nesterov. A method of solving a convex programming problem with convergence rate o (1/k2). In Soviet Mathematics Doklady, volume 27, pages 372 376, 1983.