Learning to Learn Gradient Descent by Gradient Descent Andrychowicz et al. by Yarkın D. Cetin
Introduction What does machine learning try to achieve? Model parameters What does optimizers try to achieve? Possible parameter space
Concrete examples of Optimizers Stochastic gradient descent is probably the most well-known technique. We have improvements of SGD such as; SGD with Momentum Rprop RMSProp Adagrad ADAM
No Free Lunch (Theorem) As demonstrated by Wolpert and Macready all optimizers are equal when their performance is averaged across all problems.[4] Which means, We can t have a generally better optimizer for every problem We must specialize in other words, Either we handcraft better optimizers for each problem or we have suboptimal performance with an already existing optimizing technique. Or something else?
Why not learn the optimizer? Deep learning techniques already demonstrated us that, learned features can perform better than handcrafted ones. Apply the idea to learn optimizers since, Optimizers are in essence, functions mapping The idea of this paper is simply; Learn this function!
Some History and Related Works Meta-learning as it is called in the literature, is around since the 90s.[1] Bengio et al.[2] uses a parametric function i.e. a simple algorithm to search for learning rules, however they search a wider region than hand-crafted rules for the optimum learning rule. Schmidhuber(The LSTM guy) in 1993[3] theorizes about recurrent neural networks which can use their own weights as inputs as well and modifying them to learn new algorithms. The paper mentions that the weights modifying weights can be updated as well, ad infinitum. They are proposed as building block of the artificial intelligence in 2016 by Lake et al.[4] The works of Hochreiter et al. [5] demonstrates the possibility of feeding the gradients to train an optimizer. This paper is mainly based on their works.
Hochreiter et al. Pros: Uses a recurrent neural network for differentiable optimizer errors Optimization of the optimizer done through gradient descent Cons: Not coordinate-wise, meaning less scalable to models with high number of parameters Transferability between domains is not tested/not available
What is the loss function? The loss function can be written as: Expected value of loss given optimal parameters where is the final optimizee parameters Meaning the expected value of our loss function should be zero when our parameters are optimal
Loss over time However for our optimizer to have a sense of optimization the loss can be written over time loss: Weights of different time intervals This is the gradient updates This is the hidden states The model(lstm)
What is? It is a multi-layer (2-layer) long-short term memory network (LSTM)
Briefly on LSTMs LSTM s are essentially recurrent neural networks They are created in 1997 by Hochreiter et al.[5] Purpose is to solve time dependencies over long periods A recurrent neural network (RNN)
Briefly on LSTMs The LSTM can forget and add information to its memory from previous outputs. The sigmoid layer outputs numbers between zero and one, describing how much of each component should be let through. An LSTM has three of these gates, to protect and control the cell state. [6] An LSTM diagram (top horizontal line is the flow of the hidden state)
Why is an LSTM? 1. It was demonstrated that the past information about the gradients lead to faster convergence [7] (e.g. Nesterov s Momentum, ADAM) 2. LSTMs are good with long-time dependencies. 3. They can use same optimizer parameters for all optimizee parameters.
Coordinatewise LSTM There are thousands of parameters and the optimizer should be model-free i.e. does not depend on how many parameters the model has. The parameters of the LSTM are shared The hidden states are not shared
Experimental Results Regression to Random Polynomials and the MNIST dataset
Different Layer Widths
Experimental Results As we can see the LSTM optimizer fails with ReLU since it was trained for sigmoid functions
Transferability
Some Neural Art
Conclusion The paper demonstrates the possibility of training networks specializing in training other networks. This trained LSTM network can outperform static state-of-the-art optimizers and generalize from small to large networks.
References [1] B. M. Lake, T. D. Ullman, J. B. Tenenbaum, and S. J. Gershman. Building machines that learn and think like people. arxiv Report 1604.00289, 2016. [2] Y. Bengio, S. Bengio, and J. Cloutier. Learning a synaptic learning rule. Université de Montréal, Département d informatique et de recherche opérationnelle, 1990. [3] J. Schmidhuber. A neural network that embeds its own meta-levels. In International Conference on Neural Networks, pages 407 412. IEEE, 1993. [4] S. Thrun and L. Pratt. Learning to learn. Springer Science & Business Media, 1998. [5] S. Hochreiter, A. S. Younger, and P. R. Conwell. Learning to learn using gradient descent. In International Conference on Artificial Neural Networks, pages 87 94. Springer, 2001. [6] D. H. Wolpert and W. G. Macready. No free lunch theorems for optimization. Transactions on Evolutionary Computation, 1(1):67 82, 1997. [7] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735 1780, 1997. [8] Olah, Christopher. "Understanding LSTM Networks -- Colah's Blog". Colah.github.io. N.p., 2017. Web. 3 Mar. 2017. [9] Y. Nesterov. A method of solving a convex programming problem with convergence rate o (1/k2). In Soviet Mathematics Doklady, volume 27, pages 372 376, 1983.