Feature Transfer and Knowledge Distillation in Deep Neural Networks

Feature Transfer and Knowledge Distillation in Deep Neural Networks (Two Interesting Papers at NIPS 2014) LU Yangyang luyy11@sei.pku.edu.cn KERE Seminar Dec. 31, 2014

Deep Learning F4 (at NIPS 1 2014) From left to right: Yann Lecun (http://yann.lecun.com/) Geoffrey Hinton (http://www.cs.toronto.edu/ hinton/) Yoshua Bengio (http://www.iro.umontreal.ca/ bengioy) Andrew Ng (http://www-cs-faculty.stanford.edu/people/ang/) 1 Advances in Neural Information Processing Systems

Outline How transferable are features in deep neural networks? NIPS 14 Introduction Generality vs. Specificity Measured as Transfer Performance Experiments and Discussion Distilling the Knowledge in a Neural Network. NIPS 14 DL workshop Introduction Distillation Experiments: Distilled Models and Specialist Models Discussion

Authors How transferable are features in deep neural networks? NIPS 14 Jason Yosinski 1, Jeff Clune 2, Yoshua Bengio 3, and Hod Lipson 4 1 Dept. Computer Science, Cornell University 2 Dept. Computer Science, University of Wyoming 3 Dept. Computer Science & Operations Research, University of Montreal 4 Dept. Mechanical & Aerospace Engineering, Cornell University

Introduction A common phenomenon in many deep neural networks trained on natural images: on the first layer they learn features similar to Gabor filters and color blobs. occurs not only for different datasets, but even with very different training objectives. A 2-dimensional Gabor filter Features at different layers of a neural network: First-layer features:general - finding standard features on the first layer seems to occur regardless of the exact cost function and natural image dataset Last-layer features:specific - the features computed by the last layer of a trained network must depend greatly on the chosen dataset and task If first-layer features are general and last-layer features are specific, then there must be a transition from general to specific somewhere in the network.

Introduction(cont.) First-layer features(general) TRANSITION Last-layer Features(Specific) Can we quantify the degree to which a particular layer is general or specific? Does the transition occur suddenly at a single layer, or is it spread out over several layers? Where does this transition take place: near the first, middle, or last layer of the network? We are interested in the answers to these questions because, to the extent that features within a network are general, we will be able to use them for transfer learning. Transfer Learning 2 : first train a base network on a base dataset and task then repurpose or transfer the learned features to a second target network to be trained on a target dataset and task This process will tend to work if the features are general, meaning suitable to both base and target tasks, instead of specific to the base task. 2 http://www1.i2r.a-star.edu.sg/ jspan/surveytl.htm,http://www.cs.ust.hk/ qyang/publications.html

Introduction(cont.) The usual transfer learning approach: First n-layers: fine-tune or frozen? - depends on the size of the target dataset and the number of parameters in the first n layers This paper: - compares results from fine-tuned features and frozen features - How transferable are features in deep neural networks?

Generality vs. Specificity Measured as Transfer Performance This paper: define the degree of generality of a set of features learned on task A as the extent to which the features can be used for another task B. Task A and Task B: Image Classification Task randomly spilt 1, 000 ImageNet classes 500(Task A) + 500(Task B) train one N-layer (N = 8 here) convolutional network on A (basea) and another on B (baseb) define new networks: n {1, 2,..., 7} selffer networks BnB - first n layers: copied from baseb and frozen - higher (N n) layers: randomly initialized transfer networks AnB selffer networks BnB + : all layers are fine-tuned transfer networks AnB +

Overview of the experimental treatments and controls CNN for image classification: using Caffe 3 3 Jia Y, Shelhamer E, Donahue J, et al. Caffe: Convolutional architecture for fast feature embedding[c].proceedings of the ACM International Conference on Multimedia. ACM, 2014: 675-678.

Three sets of Experiments Hypothesis: If A and B are similar, authors expect that transferred features will perform better than when A and B are less similar. 4 {tabby cat, tiger cat, Persian cat, Siamese cat, Egyptian cat, mountain lion, lynx, leopard, snow leopard, jaguar, lion, tiger, cheetah} 5 Jarrett K, Kavukcuoglu K, Ranzato M, et al. What is the best multi-stage architecture for object recognition?[c]. Computer Vision, 2009 IEEE 12th International Conference on. IEEE, 2009: 2146-2153.

Three sets of Experiments Hypothesis: If A and B are similar, authors expect that transferred features will perform better than when A and B are less similar. Similar Datasets: Random A/B splits ImageNet contains clusters of similar classes, particularly dogs and cats. 4 On average, A and B will each contain approximately 6 or 7 of these felid classes, meaning that base networks trained on each dataset will have features at all levels that help classify some types of felids. Dissimilar Datasets: Man-made/Natural splits ImageNet also provides a hierarchy of parent classes. create a special split of the dataset into two halves that are as semantically different from each other as possible: - A: only man-made entities (551 classes) - B: only natural entities (449 classes) Random Weights: to ask whether or not the nearly optimal performance of random filters reported on small networks 5 carries over to a deeper network trained on a larger dataset 4 {tabby cat, tiger cat, Persian cat, Siamese cat, Egyptian cat, mountain lion, lynx, leopard, snow leopard, jaguar, lion, tiger, cheetah} 5 Jarrett K, Kavukcuoglu K, Ranzato M, et al. What is the best multi-stage architecture for object recognition?[c]. Computer Vision, 2009 IEEE 12th International Conference on. IEEE, 2009: 2146-2153.

Similar Datasets (Random A/B splits) BnB: performance drop when n = 4, 5 - the original network contained fragile co-adapted features on successive layers - features that interact with each other in a complex or fragile way such that this co-adaptation could not be relearned by the upper layers alone. BnB + : fine-tuning thus prevents the performance drop (BnB) AnB: the combination of the drop from lost co-adaptation and the drop from features that are less and less general AnB + : transferring features + fine-tuning: generalize better than directly training on the target dataset

Dissimilar Datasets (Man-made/Natural splits) & Random Weights

Summary If first-layer features are general and last-layer features are specific, then there must be a transition from general to specific somewhere in the network.

Summary If first-layer features are general and last-layer features are specific, then there must be a transition from general to specific somewhere in the network. Experiments: fine-tuned v.s. frozen features on selffer and transfer networks on Similar Datasets (Random A/B splits): - performance degradation when using transferred features without fine-tuning: i) the specificity of the features themselves ii) the fragile co-adaption between neurons on neighboring layers - initializing a network with transferred features from almost any number of layers can produce a boost to generalization performance after fine-tuning to a new dataset on Dissimilar Datasets (Man-made/Natural splits): the more dissimilar the base task and target task are, the more performance drops on Random Weights: find lower performance on the relatively large ImageNet dataset than has been previously reported for smaller datasets when using features computed from random lower-layer weights vs. trained weights

Outline How transferable are features in deep neural networks? NIPS 14 Distilling the Knowledge in a Neural Network. NIPS 14 DL workshop Introduction Distillation Experiments: Distilled Models and Specialist Models Discussion

Authors Distilling the Knowledge in a Neural Network NIPS 14 Deep Learning and Representation Learning Workshop 6 Geoffrey Hinton 1,2, Oriol Vinyals 1, Jeff Dean 1 1 Google Inc.(Mountain View) 2 University of Toronto and the Canadian Institute for Advanced Research Equal contribution 6 http://www.dlworkshop.org/

Introduction The Story for Analogy: insects a larval form: optimized for extracting energy and nutrients from the environment insects an adult form: optimized for the very different requirements of traveling and reproduction In large-scale machine learning (e.g. speech and object recognition) the training stage: extract structure from very large, highly redundant datasets; not in real time; can use a huge amount of computation the deployment stage: has much more stringent requirements on latency and computational resources We should be willing to train very cumbersome models if that makes it easier to extract structure from the data.

Introduction(cont.) Cumbersome models: an ensemble of separately trained models a single very large model trained with a very strong regularizer 7 Bucilua C, Caruana R, Niculescu-Mizil A. Model compression[c].proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2006: 535-541.

Cumbersome models: Introduction(cont.) an ensemble of separately trained models a single very large model trained with a very strong regularizer Distillation: Once the cumbersome model has been trained: using a different kind of training to transfer the knowledge from the cumbersome model to a small model that is more suitable for deployment. previous work 7 : demonstrate convincingly that the knowledge acquired by a large ensemble of models can be transferred to a single small model knowledge in a trained model: a learned mapping from input vectors to output vectors 7 Bucilua C, Caruana R, Niculescu-Mizil A. Model compression[c].proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2006: 535-541.

Introduction(cont.) Models are usually trained to optimize performance on the training data when the real objective is to generalize well to new data. Information about the correct way to generalize is not normally available. Distillation: transfer the generalization ability of the cumbersome model to a small model use the class probabilities produced by the cumbersome model as soft targets for training the small model 8 use the same training set or a separate transfer set for the transfer stage 8 Caruana et al., SIGKDD 06: using the using the logits (the inputs to the final softmax) for transferring

Distillation Neural networks for multi-class classification: a softmax output layer q i = exp(z i/t ) j exp(z j/t ) Cumbersome models the distilled model: z i : the logit, i.e. the input to the softmax layer q i : the class probability computed by the softmax layer T : a temperature that is normally set to 1 training the distilled model on a transfer set using a soft target distribution for each case in the transfer set that is produced by using the cumbersome model with a high temperature in its softmax The same high temperature is used when training the distilled model, but after it has been trained it uses a temperature of 1.

Distillation(cont.) When the correct labels are known for all or some of the transfer set, this method can be significantly improved by also training the distilled model to produce the correct labels. use the correct labels (hard targets) to modify the soft targets simply use a weighted average of two different objective functions Objective 1: the cross entropy with the soft targets (cumbersome and distilled models: using same high temperature) Objective 2: the cross entropy with the correct labels (using exactly the same logits in softmax of the distilled model but at a temperature of 1)

Distilled Models: on MNIST 10 A Single Large Neural Net: 2 hidden layers, 1200 rectified linear hidden units per layer, on all 60, 000 training cases, strongly regularized using dropout and weight-constraints 9 67 test errors A Small Model: 2 hidden layers, 800 hidden units per layer, no regularization 146 test errors additionally matching the soft targets of the large net at T = 20 74 test errors - 300 or more units per hidden layer: T > 8, fairly similar results - 30 units per hidden layer: T [2.5, 4], significantly better than higher or lower temperatures 9 G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. arxiv preprint arxiv:1207.0580, 2012. 10 Handwriting Recognition Dataset

Distilled Models: on MNIST 10 A Single Large Neural Net: 2 hidden layers, 1200 rectified linear hidden units per layer, on all 60, 000 training cases, strongly regularized using dropout and weight-constraints 9 67 test errors A Small Model: 2 hidden layers, 800 hidden units per layer, no regularization 146 test errors additionally matching the soft targets of the large net at T = 20 74 test errors - 300 or more units per hidden layer: T > 8, fairly similar results - 30 units per hidden layer: T [2.5, 4], significantly better than higher or lower temperatures omitting all examples of the digit 3 from the transfer set 206 test errors (133/1010 threes) - fine-tune bias for the 3 class 109 test errors (14/1010 threes) in the transfer set only containing the digit 7 and 8 from the training set: 47.3% test errors - fine-tune bias for the 7 and 8 class:13.2% test errors 9 G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. arxiv preprint arxiv:1207.0580, 2012. 10 Handwriting Recognition Dataset

Distilled Models: on Speech Recognition The Objective of Automatic Speech Recognition (ASR): 11 A Single Large Neural Net: 8 hidden layers, 2560 rectified linear units per layer a final softmax layer with 14, 000 labels (HMM targets h t) the total number of parameters: about 85M training set: 2000 hours of spoken English data, about 700M training examples Distilled Models: distilled under different temperatures:{1, 2, 5, 10} using a relative weight of 0.5 on the cross-entropy for the hard targets A single model distilled from an ensemble of models works significantly better than a model of the same size that is learned directly from the same training data. 11 map a (short) temporal context of features derived from the waveform to a probability distribution over the discrete states of a Hidden Markov Model (HMM)

Specialist Models: on Image Annotation Training an ensemble of models: An ensemble requires too much computation at test time can be dealt with by using distillation. If the individual models are large neural networks and the dataset is very large, the amount of computation required at training time is excessive, even though it is easy to parallelize. Learning specialist models: to show how learning specialist models that each focus on a different confusable subset of the classes can reduce the total amount of computation required to learn an ensemble to show how the overfitting of training specialist models may be prevented by using soft targets

Specialist Models: on Image Annotation(cont.) JFT: an internal Google dataset, 100 million labeled images, 15, 000 labels A Generalist Model: Google s baseline model for JFT a deep CNN had been trained for about six months using asynchronous stochastic gradient descent on a large number of cores used two types of parallelism Specialist Models: trained on data that is highly enriched in examples from a very confusable subset of the classes (e.g different types of mushroom) The softmax of this type of specialist can be made much smaller by combining all of the classes it does not care about into a single dustbin class. To reduce overfitting and share the work of learning lower level feature detectors: - each specialist model is initialized with the weights of the generalist model - training examples: 1/2 from its special subset, 1/2 sampled randomly from the remainder of the training set

Soft Targets as Regularizers Using soft targets instead of hard targets: A lot of helpful information can be carried in soft targets that could not possibly be encoded with a single hard target. Using far less data to fit the 85M parameters of the baseline speech model: Soft targets allow a new model to generalize well from only 3% of the training set. Using soft targets to prevent specialists from overfitting: If using a full softmax over all classes for specialists: soft targets may be a much better way to prevent them overfitting than using early stopping If a specialist is initialized with the weights of the generalist, we can make it retain nearly all of its knowledge about the non-special classes by training it with soft targets for the non-special classes in addition to training it with hard targets.

Summary The training stage and the deployment stage have different requirements.

Summary The training stage and the deployment stage have different requirements. Distillation: Cumbersome models transferring knowledge by matching soft targets (and hard targets) A small, distilled model On MNIST: - works well even when the transfer set that is used to train the distilled model lacks any examples of one or more of the classes On Speech Recognition: - nearly all of the improvement that is achieved by training an ensemble of deep neural nets can be distilled into a single neural net of the same size which is far easier to deploy. On Image Annotation: Specialist Models - the performance of a single really big net that has been trained for a very long time can be significantly improved by learning a large number of specialist nets. - Have not yet shown that this method can distill the knowledge in the specialists back into the single large net. Distillation v.s. Mixture of Experts, Transfer Learning

Appendix 12 12 http://www1.i2r.a-star.edu.sg/ jspan/surveytl.htm

Appendix (cont.)