Session 4: Regularization (Chapter 7)

Session 4: Regularization (Chapter 7) Tapani Raiko Aalto University 30 September 2015 Tapani Raiko (Aalto University) Session 4: Regularization (Chapter 7) 30 September 2015 1 / 27

Table of Contents Background Regularization methods Exercises Tapani Raiko (Aalto University) Session 4: Regularization (Chapter 7) 30 September 2015 2 / 27

Goal of Regularization Neural networks are very powerful (universal appr.). Easy to perform great on the training set (overfitting). Regularization improves generalization to new data at the expense of increased training error. Use held-out validation data to choose hyperparameters (e.g. regularization strength). Use held-out test data to evaluate performance. Tapani Raiko (Aalto University) Session 4: Regularization (Chapter 7) 30 September 2015 3 / 27

Example Without regularization training error goes to zero and learning stops. With noise regularization, test error keeps dropping. Tapani Raiko (Aalto University) Session 4: Regularization (Chapter 7) 30 September 2015 4 / 27

Expressivity demo: Training first layer only No regularization, training W (1) and b (1) only. 0.2% error on training set, 2% error on test set. Tapani Raiko (Aalto University) Session 4: Regularization (Chapter 7) 30 September 2015 5 / 27

What is overfitting? Probability theory states how we should make predictions (of y test) using a model with unknowns θ and data X = {x train, y train, x test}: P(y test X) = P(y test, θ X)dθ = P(y test θ, X)P(θ X)dθ. Probability of observing y test can be acquired by summing or integrating over all different explanations θ. The term P(y test θ, X) is the probability of y test given a particular explanation θ and it is weighted with the probability of the explanation P(θ X). However, such computation is intractable. If we want to choose a single θ to represent all the probability mass, it is better not to overfit to the highest probability peak, but to find a good representative of the mass. Posterior probability mass matters Center of gravity maximum Tapani Raiko (Aalto University) Session 4: Regularization (Chapter 7) 30 September 2015 6 / 27

Table of Contents Background Regularization methods Exercises Tapani Raiko (Aalto University) Session 4: Regularization (Chapter 7) 30 September 2015 7 / 27

Regularization methods Limited size of network Early stopping Weight decay Data augmentation Injecting noise Parameter sharing (e.g. convolutional) Sparse representations Ensemble methods Auxiliary tasks (e.g. unsupervised) Probabilistic treatment (e.g. variational methods) Adversarial training,... Tapani Raiko (Aalto University) Session 4: Regularization (Chapter 7) 30 September 2015 8 / 27

Limited size of network Rule of thumb: When #parameters is ten times less than #outputs #examples, overfitting will not be severe. Reducing input dimensionality (e.g. by PCA) helps in reducing parameters Easy. Low computational complexity Other methods give better accuracy Data augmentation increases #examples Parameter sharing decreases #parameters Auxiliary tasks increases #outputs Tapani Raiko (Aalto University) Session 4: Regularization (Chapter 7) 30 September 2015 9 / 27

Early stopping Monitor validation performance during training Stop when it starts to deteriorate With other regularization, it might never start Keeps solution close to the initialization Tapani Raiko (Aalto University) Session 4: Regularization (Chapter 7) 30 September 2015 10 / 27

Weight decay (Tikhonov, 1943) Add a penalty term to the training cost C = + Ω(θ) Note: only a function of parameters θ, not data. L 2 regularization: Ω(θ) = λ 2 θ 2 hyperparameter λ for strength. Gradient: Ω(θ) θ i = λθ i. L 1 regularization: Ω(θ) = λ/2 θ 1 Gradient: Ω(θ) θ i = λ sign(θ i ). Induces sparsity: Often many params become zero. Max-norm: Constrain row vectors w i of weight matrices to w i 2 c. Tapani Raiko (Aalto University) Session 4: Regularization (Chapter 7) 30 September 2015 11 / 27

Weight decay L2 (left) and L1 (right). w unregularized solution, w regularized solution. Tapani Raiko (Aalto University) Session 4: Regularization (Chapter 7) 30 September 2015 12 / 27

Weight decay as Bayesian prior Consider the maximum a posteriori solution Bayes rule: P(θ X) = P(X θ)p(θ) written on -log scale: C = log P(X θ) log P(θ) Assuming Gaussian prior P(θ) = N (0, λ 1 I) we get Ω(θ) = i log exp θ2 i 2λ = λ 1 2 θ 2 L 2 regularization Gaussian prior L 1 regularization Laplace prior Max-norm regularization Uniform prior with finite support Ω = 0 Maximum likelihood Tapani Raiko (Aalto University) Session 4: Regularization (Chapter 7) 30 September 2015 13 / 27

Data augmentation Image from (Dosovitskiy et al., 2014) Augmented data by image-specific transformations. E.g. cropping just 2 pixels gets you 9 times the data! Infinite MNIST: http://leon.bottou.org/projects/infimnist Tapani Raiko (Aalto University) Session 4: Regularization (Chapter 7) 30 September 2015 14 / 27

Injecting noise (Sietsma and Dow, 1991) Inject random noise during training separately in each epoch Can be applied to input data, to hidden activations, or to weights Can be seen as data augmentation Simple end effective Tapani Raiko (Aalto University) Session 4: Regularization (Chapter 7) 30 September 2015 15 / 27

Injecting noise to inputs (analysis) Inject small additive Gaussian noise at inputs Assume least squares error at output y Taylor series expansion around x Corresponds to penalizing the Jacobian J 2 y 1 J = dy x dx = 1.... y 1 x d y c x 1. y c x d For linear networks, this reduces to L 2 penalty Rifai et al. (2011) penalize the Jacobian directly Tapani Raiko (Aalto University) Session 4: Regularization (Chapter 7) 30 September 2015 16 / 27

Parameter sharing Force sets of parameters to be equal Reduces the number of (unique) parameters Important in convolutional networks (next week) Auto-encoders sometimes share weights between encoder and decoder (Oct 28 session) Tapani Raiko (Aalto University) Session 4: Regularization (Chapter 7) 30 September 2015 17 / 27

Sparse representations Penalize representation h using Ω(h) to make it sparse L 1 penalty on weights makes W sparse Similarly L 1 penalty can make h sparse Also possible to set a desired sparsity level Sparse coding is common in image processing Tapani Raiko (Aalto University) Session 4: Regularization (Chapter 7) 30 September 2015 18 / 27

Ensemble methods Train several models and take average of their outputs Also known as bagging or model averaging It helps to make individual models different by varying models or algorithms varying hyperparameters varying data (dropping examples or dimensions) varying random seed It is possible to train a single final model to mimick the performance of the ensemble, for test-time computational efficiency (Hinton et al., 2015) Tapani Raiko (Aalto University) Session 4: Regularization (Chapter 7) 30 September 2015 19 / 27

Dropout (Hinton et al., 2012) Each time we present data example x, randomly delete each hidden node with 0.5 probability Can be seen as injecting noise or as ensemble: Multiplicative binary noise Training an ensemble of 2 h networks with weight sharing At test time, use all nodes but divide weights by 2 Tapani Raiko (Aalto University) Session 4: Regularization (Chapter 7) 30 September 2015 20 / 27

Dropout training Tapani Raiko (Aalto University) Session 4: Regularization (Chapter 7) 30 September 2015 21 / 27

Dropout as bagging Tapani Raiko (Aalto University) Session 4: Regularization (Chapter 7) 30 September 2015 22 / 27

Auxiliary tasks Multi-task learning: Parameter sharing between multiple tasks E.g. speech recognition and speaker identification could share low-level representations Layer-wise pretraining (Hinton and Salakhutdinov, 2006) can be seen as using unsupervised learning as an auxiliary task (Nov 4 session) Tapani Raiko (Aalto University) Session 4: Regularization (Chapter 7) 30 September 2015 23 / 27

Probabilistic treatment Variational methods starting to appear in deep learning research T-61.5140 Machine Learning: Advanced Probabilistic Methods Jyri Kivinen might discuss these on Nov 11 session Tapani Raiko (Aalto University) Session 4: Regularization (Chapter 7) 30 September 2015 24 / 27

Adversarial training (Szegedy et al., 2014) Search for an input x near a datapoint x that would have very different output y from y Adversaries can be found surprisingly close! Miato et al. (2015) build a very effective regularizer Tapani Raiko (Aalto University) Session 4: Regularization (Chapter 7) 30 September 2015 25 / 27

Table of Contents Background Regularization methods Exercises Tapani Raiko (Aalto University) Session 4: Regularization (Chapter 7) 30 September 2015 26 / 27

Exercises Read Chapter 7 (Regularization) and Chapter 9 (Convolutional Networks) Read the Theano tutorial on Regularization: http://deeplearning.net/tutorial/gettingstarted.html#regularization Extend your MNIST classifier to include regularization. Consider at least L2 weight decay and additive Gaussian noise injected in the inputs. Choose a good regularization strength using a held-out validation set. Tapani Raiko (Aalto University) Session 4: Regularization (Chapter 7) 30 September 2015 27 / 27