DNN Low Level Reinitialization: A Method for Enhancing Learning in Deep Neural Networks through Knowledge Transfer

DNN Low Level Reinitialization: A Method for Enhancing Learning in Deep Neural Networks through Knowledge Transfer Lyndon White (20361362) Index Terms Deep Belief Networks, Deep Neural Networks, Neural Networks, Knowledge Transfer, Image Recognition, Digit Recognition, Handwriting Recognition, Representation Learning. Abstract It is a common problem for there to be a shortage of training data for the target domain (e.g. letter recognition), but plenty of data for a related domain (e.g. digit recognition). This paper presents a novel approach, Deep Neural Network Low Level Reinitialization for making use of auxiliary, out-of-domain. unlabelled training data to enhance the performance of a deep neural network in cases such as this. This new method makes use of high quality features learnt in the related domain to aid training and classification. The application of this approach is shown here for digit recognition, where improvement is found over merely using the limited target domain data alone.

Lyndon White 32B Pollard Street Glendalough, WA 6016 W/Prof John Dell Faculty of Engineering, Computing and Mathematics The University of Western Australia 35 Stirling Highway Crawley, WA 6009 Dear Professor Dell, I submit to you this paper entitled DNN Low Level Reinitialization: A Method for Enhancing Learning in Deep Neural Networks through Knowledge Transfer, together with its copious appendices, in partial fulfillment of the requirements for the award of Bachelor of Engineering. Yours Sincerely Lyndon White October 26, 2014

iii Acknowledgments I would like to express my sincere gratitude toward my supervisor, Dr Roberto Togneri, for his ongoing support and advice throughout this project, and for offering the chance to research in this fascinating area. In particular, for his prompt responses to emails and for helping me out with meetings on short notice. I must give credit to those who provided the hardware for this project. The final collection of data alone involved training 98,630 neural networks. This was only possible because of the vast amount of computing power I was given access to by the UWA Signals Processing Lab, and by the University Computer Club (UCC). UCC provided numerous services, beyond just the computational power, that made the completion of this project much smoother. ivec@uwa provided me with access to their high resolution tiled display. I was able to get more data analysis done in 3 days on this tremendous monitor than I had done in 3 weeks on a desktop PC. The ivec visualization lab is a fantastic resource to have access to for anyone doing a project involving a large amount of data. I acknowledge the great introduction I had to the field of machine learning via the online, archived copy, of Geoffrey Hinton s course Neural Networks for Machine Learning. This course remains freely available online from https://www.coursera.org/course/neuralnets. It is approximately equivalent in content, to a 6 point unit at UWA, and I strongly recommend it to anyone interested in learning more about neural networks. I am grateful for the support of the Stack-Exchange online communities, in particular TEX.SE (http://tex.stackexchange.com/) for invaluable tips and tricks that have seen this work through to its final presented form. I would like to show my appreciation for my peers completing their projects simultaneously with me. In particular: Sam Moore, David Gow, Rowan Ashwin and Varun Gobole. I have been in good company throughout the year. I would like to highlight Varun, who is the only person did not give me a blank look when I explained my project to him, but rather asked for more details. I must also thank my close friend Roland Kerr, for the sharing of his experiences from completing his honors project over the last 18 months and the constant advice and assistance which has stemmed from him. I would like to thank all my friends and family who have kept me happy and well. Most of all, my very special thanks to my beautiful wife Isobel; whom has kept me sane-ish through another year, Who looked after me through long days, and has actual grammatical skills. My love is for her always.

iv CONTENTS 1 Introduction 1 1.1 The Deep Neural Network Low Level Reinitialization Algorithm....... 1 2 Related Work 2 2.1 Improving knn and SVM Accuracy by Training on Auxiliary Data Sources 2 2.2 Zero Shot Learning.................................. 3 2.3 One Shot Learning.................................. 3 2.4 Domain Adaptation on Amazon Product Reviews............... 3 2.5 Multimodal Deep Learning............................. 4 3 Background 4 3.1 Deep Neural Networks................................ 4 3.2 Deep Belief Networks................................. 7 3.3 Other Deep Neural Networks............................ 9 4 The DNN Low Level Reinitialization Algorithm 9 4.1 Method......................................... 9 4.2 Transfer of Features for Better Structure...................... 10 4.3 Theoretical Justification for Reinitializing Layers................ 10 4.4 The Importance of Backpropagation........................ 11 5 Empirical Evaluation Methods 12 5.1 Experimental Setup and Evaluation........................ 12 5.2 The Control Experiments............................... 12 6 Empirical Results 14 6.1 Improvement Frequency............................... 14 6.2 Expected Improvement:................................ 15 6.3 The Requirement for Sufficient Target Domain Training Data......... 18 6.4 Performance in Deeper and Wider Topologies.................. 18 6.5 DNN LLR acts as a Superior Regularizer..................... 18 6.6 Best and Worst Domain Transitions......................... 19 6.7 The Consequences of Adding a Target Dataset Pretraining Step to the Reinitialization Process................................... 21 7 Conclusion 22 7.1 Further Work...................................... 22 7.2 Applications...................................... 23 7.3 Closing Remarks.................................... 23 References 25

v Appendix A: Nomenclature 28 A.1 Abbreviations..................................... 28 A.2 Terms.......................................... 28 Appendix B: Detailed Experimental Setup 29 B.1 Source and Target Domain Datasets........................ 29 B.2 Experimental Parameters............................... 30 B.3 Incremental Training and Evaluation........................ 31 B.4 Evaluation....................................... 32 B.5 MNIST Subdivisions Used.............................. 33 Appendix C: The Transformation of Subdatasets such that the Whole Dataset is Standardized 35 C.1 Motivation....................................... 35 C.2 Derivation of a Method................................ 35 C.3 Conclusion....................................... 37 Appendix D: On the Performance of the Linear Classifier Control 38 Appendix E: DBN Reuse Is it necessary to reinitialize the bottom layer. 39 E.1 Method......................................... 39 E.2 Results.......................................... 40 E.3 Conclusion....................................... 42 Appendix F: msda Reinitialization 43 F.1 Introduction...................................... 43 Appendix G: Cross Re-representation Based Techniques 48 G.1 Introduction...................................... 48 G.2 Experimental Setup.................................. 48 G.3 DBN Re-representation............................... 49 G.4 msda Re-representation.............................. 50 G.5 Conclusion....................................... 52 Appendix H: The YADLF Framework 54 H.1 Features......................................... 54 Appendix I: Data Analysis and Presentation Tools 56

1 1 INTRODUCTION Primates, including humans, are known to excel at learning to do new tasks that are similar to tasks they already know[1], their brains are made up of multiple layers of linked neurons[2]. Deep Neural Networks (DNNs 1 ) is a machine learning model which imitates these structures. Such machine learning models learn to solve problems within a particular domain particular types of problems or tasks with a particular skill set. It is expected that these deep artificial neural networks will be able to transfer knowledge from one domain to a related one, just as humans can[3][4][5]. Learning to read numbers should help to learn to read letters etc. It is often the case that there is plenty of training data for one domain, but little for another domain. For example, large corpora of training data have been collected for use in training machine learners to recognize handwritten digits. This was collected to train machine classifiers for sorting mail by postcode. Much less data is available for handwritten letters. Due to this lack of data, many optical character recognition systems offer recognition of all handwritten or typed digits, but only of typed text. A new model called Deep Neural Network Low Level Reinitialization (DNN LLR) was developed to help in these cases. DNN LLR is useful where there is limited training data for the targeted task, but plenty for a related task. The training data for the related task domain is the additional source dataset for the DNN LLR model. The limited training data for the task actually being solved forms the target dataset. DNN LLR makes use of the auxiliary source data, together with the limited quantity of target data, to solve problems in the target domain. Additional source data is even more plentiful for DNN LLR than for some other knowledge transfer algorithms. DNN LLR does not require labelled source domain training cases. This is a highly desirable trait in machine learning algorithms to be able to utilize unlabelled data[6]. There are huge quantities of publicly available unlabelled data, such as the over 97 million appropriately creative-commons licensed images on Flickr[7]. Contrast this to the the 14 million labelled images available from ImageNet[8], the largest comparable labelled dataset[9]. Being able to make use of knowledge from so many large and readily available sources makes DNN LLR a particularly useful learning algorithm. 1.1 The Deep Neural Network Low Level Reinitialization Algorithm The Deep Neural Network Low Level Reinitialization (DNN LLR) algorithm improves performance via the transfer of high level abstract knowledge. Low level learning about the source domain inputs is discarded. The higher level relationships however are maintained. This can be explained through an analogy of sports learning. When teaching children sports it is important to help them transfer tactical solutions between different games[10]. Low level skills like dribbling and kicking are not as transferable as higher level skills such as tactics and decision-making[10]. Difficulties in transferring knowledge are 1. While each abbreviation is introduced in turn, for the readers reference, a summary of abbreviation and terminology has been prepared and can be found in section A.

2 attributed to students being put off by the low level differences, such as using a stick instead of kicking[10]. In a neural network the capacity to forget the distracting low level differences exists. This information is stored in the lowest layer of the network. Thus it can be removed, by reinitializing the bottom layer while preserving the cross-applicable strategic knowledge stored in higher layers. 1.1.1 DNN LLR as a Curriculum Learner This research presents work toward one of Yoshua Bengio s Open Questions in deep neural architectures: Is a curriculum needed to learn the kinds of high-level abstractions that humans take years or decades to learn? [11] The DNN LLR method allows for a training curriculum to be created. This curriculum contains two courses. First a course containing the source dataset followed by a second course containing the target dataset. Using this curriculum results in a better high-level abstractions than is found on merely training on the target dataset. 1.1.2 Other Related Models During the course of this research several other models with similar goals to DNN LLR were investigated. Consideration was given to reinitializing (or not reinitializing) the other layers; and to transferring knowledge by a transferring a feature detecting model for simple feature classification. Summaries of the most interesting results from these other investigations can be found in appendices E, F and G. DNN LLR was the most promising of the algorithms investigated. 2 RELATED WORK 2.1 Improving knn and SVM Accuracy by Training on Auxiliary Data Sources The idea that out-of-domain data could be used to improve learning accuracy is not new. The work of [12] presents a method for making use of out-of-domain data when training Support Vector Machines (SVMs), and k-nearest-neighbors (knn) models. SVMs, and knn models are quite different machine learning algorithms compared to neural networks[13][14]. Thus the algorithm in that paper cannot be applied to neural networks, but it was suggested that such algorithms could be found[12]. The goals and reasoning behind improving SVM accuracy in this way are very similar to those for DNN LLR. The source dataset, containing related auxiliary data, provides additional information which helps to solve the target domain problem[12]. This source domain data is used to provide additional structure to help in the classifications. In knn this takes the form of additional neighbors[12]. In a SVM this takes the form of additional potential support vectors[12]. In DNN LLR this takes the form of additional feature detectors. Common between all three algorithms is that the extra training from the source dataset provides additional tools to describe the input target domain cases, thus allowing for easier classification.

3 2.2 Zero Shot Learning Zero shot learning algorithms are focused on learning to accomplish a task without complete[15], or in some cases any[16], target domain data. Zero shot learning algorithms are applied in a variety of cases: from mapping between brain patterns and words being thought of[15] (trained only with mappings for some words), through to learning to recognize objects trained only using textual descriptions of the objects[16]. DNN LLR has a lot in common with many zero shot learning algorithms in that it seeks to make use of common abstract structures within the greater domain that the source and target domains are part of. Unlike a zero shot learning algorithm DNN LLR does require some target domain data for retraining. 2.3 One Shot Learning One shot learning algorithms use a very small number of target domain training cases[16] to learn classifications (and other functions). Often just a single target domain example per output classification is required[17]. In [17] the task was to classify a letter-like symbol into one of 20 classes, of which there was only a single training case for each. While the work of [17] does not use neural networks, some notions are very similar to that used by DNN LLR. In [17], one shot learning was accomplished by pretraining the classifier to be able to dissect the symbols into strokes. These strokes were learnt from a large number of different alphabets[17] a related auxiliary source. This allowed the classifier to learn much faster from the single target cases. The high quality abstract representation allowed it to relate to the test data much better. DNN LLR seeks to do similarly: to use high quality representations based on abstract features learnt from outside the target domain training set. In [17], the features learnt from the source dataset were hand selected they were pen strokes. DNN LLR learns the abstract features from the source domain without guidance. Dependance on hand-engineered features adds human effort to the process, and relies on experts being consciously aware of the best features[18]. Learning the features, as done in DNN LLR, bypasses these issues. DNN LLR does however require significantly more than one target domain training case per class. 2.4 Domain Adaptation on Amazon Product Reviews In [19] and [20], a domain adaptation technique was applied to Amazon Product Reviews. The goal of this work was to take textual reviews of Amazon products and learn to predict the rating they gave. Further domain adaptation techniques were used to learn from reviews of one product area (e.g. Electronic Devices) and apply that knowledge to predict the rating associated with reviews from a different product area (e.g. DVDs). These works used Stacked Denoising Autoencoders (SDAs) and Marginalizing Stacked Denoising Autoencoders (msdas) respectively. SDAs and msdas are closely related models to the Deep Belief Networks[21] which form the basis of DNN LLR. In these works, the autoencoders were used to get a representation of their inputs from the source domain training data. These representations were then used to train a support vector machine to perform the final classification.

4 No target domain retraining data was allowed. This approach was possible because the source and target domains were very close they had the same inputs (text reviews) and outputs (positive or negative score). The difference was only in the product area (DVDs or Electronics etc). Such an approach is not possible in the cases being considered for DNN LLR, as there are different output classes between the source and target domains[16]. A method based around learning a linear classifier based on a improved representation from a DBN or a msda was investigated as part of this project. Performance gain from transfer was found to be much less than in DNN LLR. A summary of the better results can be found in section G. 2.5 Multimodal Deep Learning Novel work has been done on multimodal deep learning for video and audio recognition, using a variation on a DBN architecture called a bimodal deep autoencoder[22]. In [22], the goal was to use both audio and video of speech to recognize spoken letters and digits. The supplementary data comes from the additional source. Multiple different training datasets were used, of both audio only and video and audio. Most interesting was the use of the TIMIT dataset[23]. The TIMIT dataset is not a dataset of spoken letters or numbers, it is a corpus of spoken English sentences. The use of TIMIT is thus knowledge transfer from a related domain, like in DNN LLR. The TIMIT data is clearly not within the target domain of the problem, though it is related. The paper does not comment further on its use beyond stating it was used for unsupervised audio feature pretraining[22]. Its impact on the learner is expected to be significant. TIMIT contains 6300 training cases of sentences[23]. Combined, the other audio datasets used for training had just 2638 training cases of spoken letters and numbers. Thus most of the audio knowledge in the model came from source domain of spoken sentences, rather than from the spoken letter/number target domain. Transfer learning from a related domain is demonstrated in this paper. The use of out-of-domain data was not the focus of this work. Surprisingly, the paper does not go into any further detail about precisely how or why this was done, nor its effects[22]. It seems likely that it was similar to the methods discussed in section E. 3 BACKGROUND 3.1 Deep Neural Networks A neural network has a number of layers of neurons. Raw data is fed into the bottom layer, is processed through a number of intermediate hidden layers, and the final layer produces the desired output. A key parameter of the neural network is how many hidden layers to have and how many neurons to have in each i.e. how deep and wide to make the network. The number of neurons in the input and output layers is fixed by the problem domain (e.g. 784 pixels input, 10 output classifications). Conventionally every neuron s output is an input to each neuron in the layer above it is layer-wise fully connected[24]. These parameters, the connectedness and the hidden layers sizing define the neural network s topology.

5 Output Layer s m ( ) s m ( ) s m ( ) s m ( ) s m ( ) Softmax Output: Vector of probabilities for each class 1 σ( ) σ( ) σ( ) σ( ) σ( ) 3rd Order relationships between pixels. Eg Relationships between line features, such as relative position of corners. Hidden Layers σ( ) σ( ) σ( ) σ( ) σ( ) 1 1 Relationships between relationships between pixels Eg: Relationships between lines, line features such as corners, intersections, relative positions σ( ) σ( ) σ( ) σ( ) σ( ) Relationships between pixels Eg: Lines 1 Input Layer x 1 x 2 x 3 x 4 x 5 Vector of pixel intensities Figure 1: A fairly typical DNN, for image classification. Each node is a neuron, each edge (connecting arrow) between neurons has an associated weight, including the bias neuron (shown dashed) that always outputs 1. The examples on the right are an analogy for the logic which may be occurring on those layers. For readability, only 5 neurons have been shown per layer. A traditional neural net has only one or two hidden layers, neural networks with more are called deep neural networks (DNNs). These have the potential to generalize better than shallow neural networks[25]. While a sufficiently wide shallow neural net can achieve any task 2 [26][27], a deep neural network can be a more efficient solution[3]. This is the premise on which deep neural architectures are based. A deep neural network is shown in figure 1. Its inputs are pixel intensities, and the output is a vector of classification probabilities. Each neuron in the hidden layer has an output value, which is determined by applying the activation function σ to a weighted sum of its inputs. This 2. The Universal Approximation Theorem[26][27] states that a sufficiently wide shallow neural net can approximate any function to arbitrary accuracy. It does however assume that ideal weights and biases can be discovered. It makes no requirement that existing (or any) algorithms can train a network to have those ideal weights. Merely that such weights exist to approximate any function.

6 activation function allows each neuron to make a fuzzy binary 3 decision based on its inputs. The decision is a value between 0 and 1, which is output by the neuron. In these experiments used to validate DNN LLR the sigmoid function (see figure 2) was used in the hidden layers. The output layer neurons have a softmax activation function[28] 4. This gives a discrete probability of the input belonging to a particular output class. Training the neural network is the process of determining the correct weights for each input to each neuron. This can be done with the backpropagation algorithm[29]. 3.1.1 Backpropagation Backpropagation[29] is the most well known algorithm for training neural networks. The rules of multivariate calculus are used to determine the slope of the error surface, with respect to the inter-neuron weights. Once the error/weight derivatives are calculated, the weights are adjusted using gradient descent down the error slope. That is for some learning rate ε, changes are made to each weight w ij, by w ij = ε E w ij, and similarly for the biases. The learning rate ε is how much significance is placed on each training case. Thus the neural net being trained moves through the weight-space, towards values with low error. Figure 2: The Sigmoid Activation Function 1 σ(t) =. For a neuron with a vector of inputs x, with weights w and biases b, this becomes 1 + e t element-wise h = σ( w x + b) 1 = ( ) 1 + e w x+ b 3.1.2 Deep Neural Networks Once trained appropriately, each neuron acts as a feature detector for a relationship between its inputs[31]. The deeper the network, the more abstract the relationships which can be described[3]. Consider identifying the number 7 (as shown in figure 3). It could be described by first identifying lines of pixels, and then recognizing a 7 if two lines meet in the upper right corner of the image; or by identifying all possible combinations of pixels that make up a 7. The latter example is a shallow architecture, whereas the former is deep. This allows for a Figure 3: Two examples of the digit Seven from MNIST[30]. It is easier to describe their similarities in terms of lines and corners than in terms of pixels. 3. Some specialized neural networks do not use 0-1 bounded outputs, such as the linear neuron which outputs the linear weighted sum of its inputs. 4. The softmax neuron is another specialized activation function for K neuron in the layer, with w k and b k being the weights and biases of the kth neuron. The output for the jth neuron is is y j = s m( w k, x, j) = e w j x+ b j k<k ew k x+ b k [28]. The effect of using this activation function is that each neuron is given a discrete probability of the output being in the class that that neuron reflects. The total probability for the output will total to 1 across all K neurons.

7 more compact neural network, with more abstract descriptions allowing better generalizations. Furthermore, in a deep architecture these feature detectors can be reused [3]; for example detecting a line at the top of the image is also used for recognizing a 5. However there are issues training deep neural networks directly with backpropagation, which can be overcome by using a Deep Belief Network (DBN)[32]. 3.2 Deep Belief Networks In 2006, a new method for training deep neural networks was devised[32]. This method functions by first training a deep belief network (DBN) to learn the structure of the domain s input elements. This allowed knowledge to be gained from unlabelled domain data which is much more available then labelled data. As discussed above, there are many more images of letters, than there are images of letters paired with a digital label saying which letter they are. The DBN technique also allowed the training of deeper networks[32]. The deep belief network is a stack of Restricted Boltzmann Machines (RBMs). It can be used to initialize a deep neural network[25]. The algorithm for training a DBN is known as greedy layer-wise training[32], or greedy layer-wise pretraining when it is used to initialize a deep neural network[25]. This refers to it greedily training each layer without considering the larger network. 3.2.1 Restricted Boltzmann Machines A Restricted Boltzmann Machine (RBM) is a generative, stochastic, energy based model for learning the probability distributions of its inputs[34]. As a generative model it is unsupervised it is trained to recreate its input. The nodes in a RBM are in two layers the visible layer (input/output) and the hidden layer. RBMs are trained using Contrastive Divergence[35], learning a weight matrix between the layers, which allows each layer to be used to reconstruct the other. The hidden layer can encode the input, and from this encoding the most likely input (visible layer values) can be reconstructed. This means that through learning the best values for the weights, the hidden layer is being trained to encode the most important features of the input layer. Depending on the distribution expected in the visible and hidden layers different variations of the RBM are used. For the Bernoulli-Bernoulli and Gaussian-Bernoulli RBM[31] used in this research, the) probability distribution of the hidden layer ( h) given the visible layer ( x) is P ( h x) = σ ( b + W x [36][37]. ) The similarity in the form of P ( h x) = σ ( b + W x to the neural network activation function: ) h = σ ( b + W x is important it is why a DBN can be used to initialize a DNN. The difference is the change from a probability vector for a Bernoulli distribution, to a vector of values between 0 and 1. 3.2.2 Generative Model Greedy Layer-Wise Pretraining (Unsupervised) A deep belief network is equivalent to a stack of RBMs[32]. Each layer is trained to reconstruct the layer below (see figure 4). As discussed above, the hidden layer of the RBM is a set of

8 Output Layer W 4 BB-RBM Hidden Layer 3 DBN Hidden Layer 3 Hidden Layer 3 W 3 W T 3 W 3 W 3 Sample BB-RBM from Hidden Layer 2 Visible Layer Hidden Layer 2 Hidden Layer 2 W 2 W T 2 W 2 Sample GB-RBM from Hidden Layer 1 Visible Layer Hidden Layer 1 Hidden Layer 1 W 1 W T 1 W 1 Training cases Visible Layer Visible Layer Visible Layer Greedy Layer-Wise Training Figure 4: Left: Greedy layer-wise pretraining: each RBM is trained fully then the next is trained to learn the hidden layer of the layer below. Center: The RBMs are combined to form a DBN the input distribution can be sampled by randomly initializing the top layer, then running the top two layers as a RBM until it reaches equilibrium, then generating all the layers below[32]. Right: A output layer is appended to initialize a DNN using the DBN. (Diagram adapted from[33].) DBN DNN feature detectors for the visible layer. This means the third layer reconstructs the second layer, which reconstructs the input. Each layer is more abstract which is desired for a deep neural network[38]. This greedy layer-wise pretraining is used to learn from the supplementary source dataset in the DNN LLR algorithm. 3.2.3 DBN to DNN Backpropagation (Supervised) Once the DBN has been trained, it can be used to initialize a deep neural network, by discarding the generative biases, appending an output layer, then training with back-propagation[33][31] (see figure 4). This makes intuitive sense: in a DBN each layer must encode the data required to reconstruct the layer below thus it must detect the most important features. The weight matrix for the added output layer establishes the relationships between the features in the DBN top layer, and a particular output. The back-propagation trains this top level, and also

9 adjusts the other levels[39]. As the normal DBN greedy layer-wise pretraining algorithm only considers the layer below, this global training algorithm can be useful to facilitate better overall performance[32]. A DNN is initialized using these stacked feature detectors (the DBN) and trained to create a function approximator (such as a classifier). Backpropagation is used during DNN LLR, to refit the DBN trained on the source dataset to be a classifier that functions on the target domain. 3.3 Other Deep Neural Networks Not all deep neural networks are initialized with a deep belief network. By a strict definition: a deep neural network is any neural network with more than 2 hidden layers. These can be trained directly with backpropagation, however performance is generally worse[25]. There are also other methods that can be used to create useful features in deeper nets, such as convolutional neural networks, which forces structure into the neural network to give rise to translational invariance[24]. There are also other methods of training a DBN, including some supervised methods which can be used to output a classification[32][40]. Through this paper, the phrase deep neural network (DNN) refers to a neural network which has been initialized with an unsupervised DBN and fine-tuned with backpropagation, unless specifically stated otherwise. 4 THE DNN LOW LEVEL REINITIALIZATION ALGORITHM 4.1 Method Greedly Layerwise Train Reinitialize Bottom Layer Append Output Layer Backpropagation Train Evaluate Source Unlabelled Training Dataset Target Labelled Training Dataset Target Labelled Test Dataset Figure 5: Block diagram showing the training process of a deep neural net, using the DNN LLR Algorithm The DNN LLR algorithm is performed as shown in the block diagram in figure 5. DBNs are not the only generative model that could be used in step 1, another approach based on using msdas is discussed in section F. 4.1.1 Reinitialization In the second step of the DNN LLR algorithm, the lowest level RBM is reinitialized. In this step all the weights and biases in that RBM are reset to small random values. Equivalently the process could be described as resetting all the weights and biases leading from the input layer to the lowest hidden layer. The reinitialized values were Gaussian distributed with mean 0.00

10 and standard deviation 0.01. This reinitialization allows new weights to be learnt more easily as the neuron connection weights will no longer be set to any large values. Large values would otherwise take many iterations of gradient descent to change. This requirement has been confirmed empirically. Results can be found in section E. 4.2 Transfer of Features for Better Structure It is suggested that this initializing a DNN with a DBN puts the neural net into a state such as it learns it is better able to generalize from its training data to different test data[25]. The DBN pretraining causes useful feature detectors to be created, and learning an output mapping based on these features is more generalizable than directly training a DNN with backpropagation alone. This idea of more abstract features makes intuitive sense based on the example of how to recognize the number 7 given in section 3.1.2. In the work on one-shot learning discussed in section 2.3, significant benefits were found by having the classifier able to reason in terms of high level features[17]. It was also shown that these high level features do not have to be based only on the target domain. Any additional features learnt by the DBN, not needed in the final DNN, will simply be ignored during final supervised training[25]. Further, by transferring the feature detectors from another, related domain, new knowledge is added, which may repair flaws in the target training dataset. For example, if the source training dataset contained 7s with the top line having angle of ±10, and the smaller target training dataset only had contained 5s with the top line having an angle of ±5, then when the ±10 top-line feature detector is transferred the final network will be able to recognize 5s with a ±10 top line even though no such 5s existed in either training dataset. Thus new knowledge is added which allows the network to be overcome limitations in its target training dataset. Small datasets are more likely to have such flaws and this leads to poor generalization. However, some desirable features for the target domain will not be found, since the feature detectors are coming from the source. As the target dataset is smaller than the source, it is expected that there will be fewer desirable features missed, compared to those gained by transfer. This does depend on the closeness 5 of the domains and on the relative sizes of the source and target datasets. These expectations are confirmed by the results found. 4.3 Theoretical Justification for Reinitializing Layers A DBN is a set of feature detectors. Those features are tied to features in the input space, but the direct ties are released by reinitializing the lowest layer. 5. Closeness of domains is a rather difficult to define concept. It is hard to describe what makes two domains particularly good (or bad) for knowledge transfer. See 7.1.2.

11 The features are not, in the DBNs considered, tied to the output space (classification). The DBNs are trained unsupervised without using output labels 6. The features only become tied to the output space once the output layer is appended and used to train the DNN. Thus after the low-level reinitialization and appending of the output layer, the neural net contains knowledge in the form of feature detectors which are not closely tied to its inputs, nor to its outputs. Thus it is largely independent of the specifics of the domain, and therefore transferable. The bottom and top layers are responsible for this transfer. During the backpropagation training of the DNN, new relationships are learnt. The output layer maps between the abstract features and the output classification probabilities. The bottom layer learns to link the raw pixel values to the most useful features described by the central layers. Simultaneously the central features are adjusted to be most applicable to the new domain. 4.4 The Importance of Backpropagation It should be noted that the final stage of training is to train the Deep Neural Net with backpropagation on the target data, without first pretraining it using greedy layer-wise training on the target. This is because backpropagation tunes the whole network; whereas its name suggests, greedy layer-wise training trains each layer greedily without taking into account the optimal solution for the whole network. The bottom layer which has now been reinitialize, must be trained to utilize the feature detectors maintained in the higher layers, for new inputs. Greedy layer-wise training would result in the bottom layer modeling the new (target) input domain, and then as it is applied upwards that model would be placed over the maintained relationships in higher layers destroying a lot of the information that DNN LLR is designed to keep. These results have been confirmed experimentally (see section 6.7). Experimentation was performed on the introduction of a greedy layer-wise training step using the target training data after the reinitialization. Results are presented in section 6.7. It confirmed the expectation of decreased performance. Using backpropagation alone is a more suitable algorithm for the retraining. Backpropagation will retrain the bottom layer based on the state of the whole network. The change in weights in the layers trained with back-propagation are based on both its inputs, and on the calculated error vectors of the relationships above. Thus the bottom layer is trained to create new connections between the inputs and the existing maintained knowledge. It also fine-tunes the transferred relationships for the new domain and for the desired output, and trains the output layer[25]. Sufficient training data is needed however, to allow the bottom layer to make these new connections. 6. A number of alternative supervised fine-tuning algorithms exist for deep belief networks[31][32] other than converting to a DNN with the appending of an output layer. They are shown to produce significant improvement, over purely unsupervised pretraining [31]. These supervised DBN training algorithms would tie the feature detectors to the output classification, so would not be suitable for use with DNN LLR without adaptation being made to the algorithm.

12 5 EMPIRICAL EVALUATION METHODS 5.1 Experimental Setup and Evaluation To evaluate DNN LLR, as well as other knowledge transfer techniques, a multitude of experiments were carried out. Each experiment involved training a model with a particular quantity of target training data and a fixed amount of source data. The quantity of target data was varied to investigate how well the algorithm performed at each stage. All target and source datasets were created by partitioning the MNIST[30] dataset into subdatasets (the source and target datasets) with 5 different classes in each. The MNIST Dataset is a collection of labelled greyscale images of handwritten digits[30]. It is large enough that splitting it as discussed still leaves viable training and evaluation datasets. One such partitioning is shown in figure 6. 40 such divisions for source and target datasets were evaluated to test the algorithms. A full list of the target and source domains/datasets evaluated together with their performance is can be found in table 2. As well as the 40 different domain transitions, four different network topologies were evaluated in order to highlight how well the algorithm scales. Figure 6: A small sample of the partitioned MNIST dataset. In this example is partitioned in to 2 domains, either of which could be the source or target. Using the full dataset which has its samples shown on the left as source data, to aid in training of a classifier to solve the target domain problem of recognizing the digits like those on the right was one of the highest performing transitions with DNN LLR. In the empirical evaluations a variety of metrics were considered to highlight its strengths and weaknesses. They are detailed in each experiment as evaluated in section 6. The full details of these methods and the rationale behind their use may be found in section B. All experimental results included here were carried out on software developed for this project (detailed in section H). 5.2 The Control Experiments To assess the advantages gained through the use of the supplementary source data and the knowledge transfer models, control experiments were carried out. Two types of models were trained for use as a control. A Deep Neural Net (DNN) and a Linear Classifier were trained on the target dataset alone. These provide a baseline to compare the knowledge transfer methods against.

13 5.2.1 DNN Control Greedly Layerwise Train Append Softmax Output Layer Backpropagation Train Evaluate Target Unlabelled Training Dataset Target Labelled Training Dataset Target Labelled Test Dataset Figure 7: Block diagram showing the training process for the Control DNN The Deep Neural Network Control experiment is a conventional approach to classifying digits. It is a DNN trained only using the target domain data. In the control, a deep neural net was trained as shown in figure 7. Throughout this process all the same standard techniques, such as early-stopping and L2 weight decay (see section B) were applied as in the knowledge transfer experiments. A DNN Control model was trained for each domain transition, at each training set size increment, for each topology matching the cases for experimental models (such as DNN LLR) being investigated. The corresponding control experiments are used for comparison throughout the rest of this paper. 5.2.2 Linear Classifier Control Backpropagation Train Evaluate Target Labelled Training Dataset Target Labelled Test Dataset Figure 8: Block diagram showing the training process for the Control Linear Classifier A secondary control was also trained. This one is referred to as a linear classifier. This Linear Classifier Control is a neural network with no hidden layers and a softmax output layer. It is the same for all topologies, as it is not, by its very definition, scalable in that way. As it does not have any hidden layers it does not meet the requirements of the universal approximation theorem[27][26]. It is only able to solve linearly separable problems. This means there are significant numbers of input images that it can not correctly classify even when trained ideally.

14 As shown in figure 8, it was trained in a similar fashion to the DNN Control but without any pretraining. There was no need, or capacity, for a DBN to be used to initialize the weights[25]. This control case allows for additional validation of the results. 6 EMPIRICAL RESULTS 6.1 Improvement Frequency The key question in evaluating a technique that may gain a improvement in this way, is how often that improvement will be realized. In the case of domains as close as different subsets of the classes of handwritten digits, improvement is very likely. The additional training time taken to train on the extra related data (the source dataset) is not significant, particularly since the DBN training algorithm is quite fast[32]. The free improvement gained by making use of the free source data with DNN LLR is shown in table 1. Hidden Layer Sizes: Target Set Size 50, 200 Portion of Transitions Improved 50, 50, 200 Portion of Transitions Improved 100, 100, 400 Portion of Transitions Improved 100, 100, 100, 400 Portion of Transitions Improved 50 25.64% 82.05% 76.92% 46.15% 100 5.13% 76.92% 56.41% 46.15% 250 33.33% 64.10% 20.51% 46.15% 300 35.90% 61.54% 15.38% 46.15% 500 15.38% 74.36% 23.08% 58.97% 1000 74.36% 97.44% 38.46% 51.28% 2000 89.74% 100.00% 82.05% 61.54% 4000 92.31% 94.87% 76.92% 76.92% 8000 89.74% 94.87% 69.23% 82.05% 16000 87.18% 87.18% 71.79% 69.23% 22530 82.05% 89.74% 51.28% 64.10% Table 1: The portion of domain transitions where using DNN LLR to learn from the source and the target domain, results in a lower error rate when evaluated on target domain test data, than the DNN Control (which is trained on the target domain training data alone). A result of 0% would indicate that for that quantity of target data, and that topology, no DNN LLR transfer cases did better than the DNN Control. Conversely a result of 100% indicates that in all source-target domain transitions investigated, the neural net trained with DNN LLR performed better than the control.

15 6.2 Expected Improvement: Figure 9: The improvement of DNN LLR vs DNN Control Case Control Error Rate DNN LLR Error Rate Improvement =, mean Control Error Rate over all domain transitions, shown for all four topologies considered. A single standard deviation is shown around the means. This plot corresponds to table 10 The second question is how much improvement in classification accuracy is expected to be gained by using DNN LLR. The absolute performance of the DNN Control and of the DNN LLR algorithm are shown in figure 11. The mean relative difference in error rate to the control (i.e. improvement) considered when trained with 2000 target domain cases is shown in table 2. This is close to the point where DNN LLR provides maximum benefit, as can be seen in figure 9 and table 10. While what improvement can be gained is determined by which domains are being transferred between, there are some trends based on the neural network topology and the quantity of target domain training data used. There was a greater spread of results for neural networks with the hidden layers sized [100,100,400] and [100,100,100,400], and the average improvements were smaller. The use of DNN LLR helps to get the neural network to a high standard of performance earlier than not making use of the additional data. This is shown in figure 11. The initial performance gain is quiet large, but its lead over the control shrinks as more target data is added. As the target dataset set size becomes large, performance is very similar with or without using DNN LLR; confirming the expectation that there would be less gain in transferring feature detectors for datasets of similar size. For the wider networks evaluated, the average gain when the source and target datasets were the same size was actually marginally worse though with a high degree of variance depending on the particular transition being considered. This is as expected given enough data for the target task, there is no need to transfer knowledge from elsewhere. With plenty of training data high quality background knowledge can be extracted from the target dataset alone. In general it can be seen that the utilization of the extra source domain data better initializes the network for learning, resulting in the early improvements.

16 Hidden Layer Sizes: 50, 200 50, 50, 200 100, 100, 100, 100, Mean 400 100, 400 Transition: Improvement Improvement Improvement Improvement Improvement 24589 to 01367 71.22% 83.16% 70.62% 81.07% 76.52% 02589 to 13467 65.77% 71.54% 69.01% 25.08% 57.85% 12348 to 05679 50.75% 59.83% 56.94% 49.14% 54.17% 02468 to 13579 27.63% 66.92% 50.90% 66.56% 53.00% 23578 to 01469 75.40% 78.98% 72.09% -15.59% 52.72% 45679 to 01238 63.24% 67.38% 60.02% 19.37% 52.50% 23679 to 01458 35.04% 73.59% 74.68% 21.26% 51.14% 05679 to 12348 66.11% 54.55% 58.56% 25.19% 51.10% 03458 to 12679 33.84% 69.27% 50.82% 38.43% 48.09% 12345 to 06789 60.47% 57.30% 66.92% 1.16% 46.46% 03689 to 12457 20.99% 64.03% 36.71% 60.24% 45.49% 02456 to 13789 48.83% 62.11% 35.15% 19.45% 41.39% 45678 to 01239 75.65% 74.67% 50.16% -39.86% 40.16% 12458 to 03679 53.31% 61.74% 36.27% 3.84% 38.79% 02378 to 14569 60.19% 58.54% 41.42% -8.52% 37.91% 01567 to 23489 52.95% 58.93% 26.18% 11.37% 37.36% 14569 to 02378 58.35% 58.45% 38.46% -7.50% 36.94% 01357 to 24689 38.43% 60.05% 51.80% -10.67% 34.90% 23457 to 01689 42.78% 69.78% 35.10% -11.35% 34.08% 12679 to 03458 42.63% 45.67% 22.70% 18.98% 32.50% 23456 to 01789 31.79% 39.32% 18.18% 40.53% 32.46% 01689 to 23457 26.54% 50.22% 34.60% 3.56% 28.73% 02367 to 14589 29.94% 28.99% 45.88% 6.71% 27.88% 01469 to 23578 12.60% 41.24% 25.66% 31.15% 27.66% 12457 to 03689 29.54% 53.11% 11.95% 10.53% 26.28% 01239 to 45678-2.10% 47.53% 29.49% 20.81% 23.93% 02357 to 14689 46.30% 64.72% 19.24% -36.29% 23.49% 01248 to 35679 41.94% 42.55% 12.73% -3.29% 23.48% 01238 to 45679 43.16% 43.16% 4.30% -8.06% 20.64% 13789 to 02456 45.26% 33.52% -23.10% 25.49% 20.29% 01234 to 56789-0.36% 34.40% 16.63% -0.07% 12.65% 01367 to 24589-18.83% 36.94% 9.67% 5.68% 8.37% 13579 to 02468 40.48% 46.01% 8.51% -62.83% 8.04% 13467 to 02589 34.98% 19.31% -9.98% -42.18% 0.53% 14689 to 02357 10.26% 52.07% -6.83% -58.34% -0.71% 06789 to 12345 32.27% 50.78% -124.86% 29.74% -3.02% 01458 to 23679 9.51% 28.32% -72.01% 15.21% -4.74% 14589 to 02367-14.77% 17.86% -11.94% -57.05% -16.48% 35679 to 01248 0.99% 10.70% -70.88% -45.06% -26.06% Table 2: The improvement found for each transition by using DNN LLR over the DNN Control Case. Improvement = Control Error Rate DNN LLR Error Rate, at 2000 training cases, across the 4 different topologies. For compactness, the transition has been written as: <digits in source> to <digits in Control Error Rate target>.

17 Hidden Layer Sizes: 50, 200 50, 50, 200 100, 100, 400 100, 100, 100, 400 Target Set Size (Portion of Source Set Size) 50 (0.22%) 100 (0.44%) 150 (0.67%) 200 (0.89%) 250 (1.11%) 300 (1.33%) 500 (2.22%) 1000 (4.44%) 2000 (8.88%) 4000 (17.75%) 8000 (35.51%) 16000 (71.02%) 22530 (100.0%) Mean (std. dev.) Im- prove- ment -11.39% (19.29%) -35.32% (36.12%) -40.31% (42.7%) -31.85% (44.69%) -24.53% (51.84%) -19.8% (43.69%) -28.83% (45.53%) 12.8% (31.06%) 37.0% (24.01%) 31.86% (29.21%) 27.61% (22.24%) 14.22% (39.64%) 10.59% (33.36%) Mean (std. dev.) Im- prove- ment 5.17% (8.71%) 8.92% (10.95%) 6.98% (16.47%) 4.76% (16.24%) 3.94% (17.38%) 7.13% (20.64%) 7.33% (20.36%) 46.29% (18.6%) 52.24% (17.31%) 45.08% (20.12%) 31.07% (17.18%) 19.98% (19.73%) 18.09% (20.44%) Mean (std. dev.) Im- prove- ment 2.86% (6.31%) 1.41% (11.25%) -13.89% (21.47%) -18.41% (26.6%) -17.18% (22.58%) -18.92% (36.13%) -23.65% (35.06%) -12.94% (39.17%) 23.63% (41.77%) 17.02% (46.5%) 4.01% (52.18%) -1.57% (65.44%) -6.8% (56.48%) Mean (std. dev.) Im- prove- ment 0.63% (3.95%) 0.61% (7.24%) -2.44% (12.22%) -3.73% (13.83%) -2.66% (13.77%) -3.39% (15.37%) -4.59% (20.16%) -9.8% (28.15%) 5.74% (33.73%) 14.13% (42.96%) 10.23% (42.42%) -0.1% (48.42%) -2.63% (55.7%) Figure 10: The Mean and Standard Deviation of the Improvement of DNN LLR over the DNN Control Case. Improvement = Control Error Rate DNN LLR Error Rate. Control Error Rate A 0% Improvement occurs when the DNN LLR model does exactly as well as the DNN Control. A 50% Improvement occurs when the DNN LLR model has half the error rate of the DNN Control. This table corresponds to figure 9. Figure 11: Mean performance of the Control and the DBN LLR algorithms across all domain transitions.

18 6.3 The Requirement for Sufficient Target Domain Training Data DNN LLR, unlike single shot learning algorithms, requires a moderate quantity of target domain training data to provide benefit. A sharp jump in frequency of improvement (table 1), and quality of improvement (figure 11) is seen when the quantity of source data exceeds 1000 or 2000 cases (depending on topology). The exception to this is the [50,50,200] topology, and all topologies for extremely small quantities of target data, which always have a high likelihood of the DNN LLR doing better than the DNN Control (table 1). However, while improvement is likely in these cases, it is very small in magnitude (see figure 11), it certainly does not exceed the Linear Classifier Control (see section D). This delay during which the DNN LLR model is out-performed by the control can be linked to the quantity of target training data required to learn useful weights in the bottom reset layer. The exact quantity of data required to reach this point is thus expected to be linked to the learning rate (see section 3.1.1), and to the width of the bottom layer. Verification of the exact effects of adjusting the learning rate remains as future work in this area (see 7.1.1). The increased delay for wider experiments was seen in evaluation. 6.4 Performance in Deeper and Wider Topologies It can be seen that the deeper networks perform better under DNN LLR, than their shallower counterparts. The network with hidden layers sized [50,50,200] saw better gains than the network with [50,200]. The network with hidden layers [100,100,100,400] performed better than that with [100,100,400]. This can be attributed to the deeper networks having higher quality features available for transfer. Their higher level features are more abstract and are thus more able to be generalized to the new domain. Conversely the gain in wider networks is lesser, and later, than in narrower topologies. As discussed above the reason for the delay is the additional training data required to retrain the larger bottom layer. Simultaneously the extra data that is needed to retrain that layer, is also available to the control to directly improve its performance. Thus once the DNN LLR experiment begins functioning, it must be compared to a well trained neural network, thus gains are harder to get. The wider network also has more capacity to discover good features from the limited target dataset in the Control DBN pretraining due to having more neurons. DNN LLR improvement is worse in wider networks, and better in deeper networks. 6.5 DNN LLR acts as a Superior Regularizer Regularization algorithms are a techniques which reduce overfitting. Overfitting is when the model learns based on coincidence (i.e sampling noise) in the training data. Overfitting is directly opposed to good generalization when the neural network is keying off of facts that are only true in the training data it is not going to generalize well to real world (or test) data. There are several algorithms being used in these experiments (including the controls) to reduce overfitting. In particular, early-stopping[41] and L2 weight decay [42][43]. The use of DBNs to

19 initialize DNNs is itself believed to act as a regularizer[25]. Initializing the DNN using DNN LLR is superior as a regularizer to initializing it with DBN training on the target data (as is done in the DNN Control). If the neural network generalizes very well then it will perform as well, or better, on the test dataset than it does on the training dataset. The portion of all experiments where such occurs across all domain transitions, all topologies, all training set sizes is shown in table 3. These results support the work of [25], as the DNN Control outperforms the Linear Classifier Control. The use of DNN LLR significant outperform the DNN Control in this regard. This does not, itself, mean that it performs better on a absolute Algorithm Portion with test error training error DNN LLR 30.2% DNN Control 20.5% Linear Classifier Control 8.7% Table 3: A comparison of the performance of test error vs training error with the various algorithms, across all experiments. scale. The results shown in table 3 are comparing the model s test dataset performance against its own training dataset performance. It does however indicated that a notable improvement in generalization capacity of the network can be gained by using DNN LLR. 6.6 Best and Worst Domain Transitions It is expected that some transitions consistently do better than others. This expectation is confirmed in the results shown in table 4. It was also seen in the results shown in table 2, at training set size 2000. Here a transition from A to B refers to transferring knowledge from source dataset containing the digits in A, to the target domain problem of recognizing the digits in B. For brevity here, target and source domains will be expressed as a string of digits representing those that they contain. So 01234 is the domain containing the digits 0, 1, 2, 3, and 4. The best performance on a given domain was fairly consistent across DNN topology and training set size. Most of the transitions that are the best for that configuration are also the best for multiple other configurations. Experiments found this to be even more true for the worst transitions. For the majority of configurations when applying DNN LLR the worse transition was 35679 to 01248. This makes sense as few features are common between the domains, so the suitable feature detectors could not be transferred. Which transitions do well compared to others is determined almost exclusively by the information contained in their training datasets. The same transitions are expected to do well in all configurations because their source training dataset contains notions that are particularly reusable in the target domain. This is further confirmed by transitions that contain similar but slightly different items in their domains; such as 45678 to 01239 and 45678 to 01239. The strong performance of these overlapping transitions is indicative of common knowledge in the sources that is transferable to the targets. However it is also worth noting that these transfers are not symmetric. It might be expected that if a particular transition does well (or poorly) then the reverse transition will also do well (or poorly). However it can be noted in the results in table 4 that

20 no such flipped duplicates occur in the very best or very worst cases. For example the worst transition was with hidden layers sized [100, 100, 400] and 16000 training cases, the transition 35679 to 01248 performs 382% worse than the control, the reverse transition 01248 to 35679 has a 29% improvement over the control. The reason for this asymmetry in performance is the asymmetry in the algorithm. DNN LLR is not a symmetrical algorithm. As discussed earlier, and as can be seen in figure 5, a different learning algorithm is applied to the source and to the target. The source dataset is learnt from using greedy layer-wise training[32], this learns feature detectors for reconstruction. The target training dataset is learnt from using backpropagation[29], learning ideal for the features to get a good classification. These are very different algorithms in purpose and implementation. Extensive discussion of the differences of what is learnt by the DBN pretraining, and by the traditional backpropagation fine-tuning method can be found in [5], with examples from the MNIST dataset used in this experiment. Hidden Layer Sizes: Target Set Size (Portion of Source Set Size) 1000 (4.44%) 2000 (8.88%) 4000 (17.75%) 8000 (35.51%) 16000 (71.02%) 22530 (100.0%) 50, 200 Best 02378 to 14569 45678 to 01239 45678 to 01239 45678 to 01239 23679 to 01458 24589 to 01367 50, 50, 200 Best 23457 to 01689 24589 to 01367 45679 to 01238 45679 to 01238 02589 to 13467 23578 to 01469 100, 100, 400 Best 45678 to 01239 23679 to 01458 23679 to 01458 23457 to 01689 23457 to 01689 23457 to 01689 100, 100, 100, 400 Best 13789 to 02456 24589 to 01367 45679 to 01238 24589 to 01367 23456 to 01789 23456 to 01789 Hidden Layer Sizes: Target Set Size (Portion of Source Set Size) 1000 (4.44%) 2000 (8.88%) 4000 (17.75%) 8000 (35.51%) 16000 (71.02%) 22530 (100.0%) 50, 200 Worst 01239 to 45678 01367 to 24589 35679 to 01248 35679 to 01248 35679 to 01248 35679 to 01248 50, 50, 200 Worst 01458 to 23679 35679 to 01248 35679 to 01248 01367 to 24589 35679 to 01248 35679 to 01248 100, 100, 400 Worst 12458 to 03679 06789 to 12345 06789 to 12345 35679 to 01248 35679 to 01248 35679 to 01248 100, 100, 100, 400 Worst Transition Transition Transition Transition Transition Transition Transition Transition 12348 to 05679 13579 to 02468 35679 to 01248 35679 to 01248 35679 to 01248 35679 to Table 4: The Best (left) and Worse (right) Transitions for a given Neural Net Topology, at varying quantities of target domain training data. The cells have been colorized to make reoccurring transitions more apparent. For compactness, the transition has been written as: <digits in source> to <digits in target>. 01248

21 6.7 The Consequences of Adding a Target Dataset Pretraining Step to the Reinitialization Process Greedly Layerwise Train Reinitialize Bottom Layer Greedly Layerwise Train Append Output Layer Backpropagation Train Evaluate Source Unlabelled Training Dataset Target Unlabelled Training Dataset Target Labelled Training Dataset Target Labelled Test Dataset Figure 12: Block diagram showing the training process of a deep neural net, using the DNN LLR algorithm with an extra step of pretraining on the target dataset added. Figure 13: The improvement of DNN LLR without (top) and with (bottom) Target Domain Pretraining. As in figure 9: Control Error Rate DNN LLR Error Rate Improvement =, Control Error Rate averaged over all domain transitions, shown for all four topologies considered. A single standard deviation is shown around the means. As discussed in section 4.4, investigations were carried out on the effect of adding a pretraining step, to use the DBN training method with the target as well as the source training data as shown in figure 12. The mean improvement of the algorithms is shown in figure 13. While applying the extra target pretraining initially helps for very low quantities of total target training data, it is not enough of a performance increase to allow it to it exceed the Linear Classifier Control (not shown). For larger quantities of target training data adding the target pretraining step causes the model to do worse than the than the control. This indicates that there is conflict between the features which would be learnt from the target dataset and those from the source dataset. Indicating a different approach to the target domain problem is used by networks trained with DNN LLR. These results support the expectation that the greedy layer-wise method is too destructive on the structures desired to transfer. Backpropagation is better for its capacity to adjust the whole network, particularly the reinitialized bottom layer

22 (together with the just appended output layer), to re-purpose the structures. Thus DNN LLR does not have the target pretraining step in the final algorithm. 7 CONCLUSION 7.1 Further Work 7.1.1 Verification of Results These results are promising and have had some validation performed on them. They are the results of considering large numbers of different domain transitions (see section B.5), across the 4 different topologies, and 13 differing quantities of training data. However, other neural network parameters (see section B.2.2) have not had extensive variational testing applied to them. Also, while many different transitions were considered, each was only evaluated once for each set of parameters. Additional repetitions would increase confidence of the individual results, rather than the overarching trends. 7.1.2 Defining Similarity and Designing a Heuristic for Predicting Wow Well DNN LLR will Perform As was discussed in section 6.6, some domain transitions consistently function better than others. This is because the feature detectors transferred are more cross applicable. Deep analysis of which transitions are successful and exactly why has not been carried out. If this could be roughly determined without going to the full extent of training DNN LLR and a Control, then it would enhance the utility of the algorithm significantly. It will allow the source dataset to be selected optimally. Some research on defining similarities has been conducted to other learning models, and has resulted in superior, more transfer-aware models[44]. A similar set of research could be done for DNN LLR, potentially even making use of the already collected results. 7.1.3 Performance Loss in Wider Networks As discussed in section 6.4, performance is worse in wider networks. Due to the heavily increased computational time in training wider networks, very wide networks have not been evaluated. The current cutting edge DNNs for MNIST have hidden layers sized around [500,500,2000] 7. Further investigation on the performance of DNN LLR at similar sizes is necessary. Adaptations such as Convolutional Deep Neural Nets[45] which have been designed for better scaling to wider networks, may be required to see continued performance gain. 7.1.4 Mixing Source and Target Datasets In the section 6.7 and section E, an additional step of DBN pretraining on the target dataset after the source dataset pretraining was completed. 7. Comparing performance: Best DNN LLR, hidden layers [100,100,100,400], source: 02589, target: 13467, source training cases: 22530, target training cases: 22530, target validation cases 5000, Error Rate 1.79%. vs Wake-Sleep DBN[32] with [500,500,2000], target: 0123456789, training cases 54216[30], error rate 1.25%

23 It may be expected that a better result would be seen by mixing the target and source datasets, followed by a single run of mixed pretraining. This expectation is supported by Hinton s recommendations on how to train RBMs (and thus DBNs) using minibatch[42]. When the domain contains multiple different output classes, it is best to have them dispersed throughout the training set[42]. This could have resulted in less destruction of data, and creation of feature detectors based on the features from all data, and thus a better result. However, this is difficult to test empirically due to the limited capacity to reuse the trained networks by adding more training data. Most of the experimental techniques allowing for faster experimentation discussed in section B.3 are not applicable. As such, the additional computational time required for such investigations put it beyond the scope of this project. It remains an interesting area for future research. 7.2 Applications The algorithm here has only been shown on digit recognition, however it is expected to work in other deep learning application areas. It would be beneficial in any area with limited training data for the task at hand, but where related training data is available. 7.2.1 Applications in Natural Language Processing A lot of interesting work has recently been done on natural language processing using neural networks. Such as [46], where shallow neural networks were trained to parse English and Chinese. A deep neural network could be used instead gaining the benefits ascribed to deep neural networks[3]. Further, DNN LLR could be employed to use the Chinese language dataset be used to improve the learning of the English language, or vice-versa. It has been shown that there are structural similarities in English and Chinese that can be accessed through machine learning techniques[47]. Thus the transfer techniques of DNN LLR could be expected to function to produce superior results. 7.3 Closing Remarks A method has been presented to allow for cross domain knowledge transfer in deep neural architectures. DNN Low Level Reinitialization, allows the transfer of high-level strategic knowledge, while limiting the transfer of non-applicable knowledge of the source domain. It provides significant enhancement over conventional deep neural networks for generalization. It is shown here to improve recognition of MNIST sub-domains, however it has applications in many areas. The experiments performed show it works better in deeper neural networks and worse in wider networks. Its scalability to very wide networks is thus questionable. It provides significant improvement in the narrower and faster training networks. In suitable cases, as discussed, DNN LLR is a viable method to improve performance though the use of supplementary data. In cases when the applicability is less certain, it may be worth

training DNN LLR along side a traditional model, and performing evaluation to find the preferred. With the right choice of source domain, it is believed that any target domain will receive a performance boost from using this knowledge transfer technique. The DNN LLR algorithm demonstrates that knowledge can be transferred between deep neural networks and provides one method to do so. 24

25 REFERENCES [1] D. A. Braun, C. Mehring, and D. M. Wolpert, Structure learning in action, Behavioural brain research, vol. 206, no. 2, pp. 157 165, 2010. [2] T. Serre, G. Kreiman, M. Kouh, C. Cadieu, U. Knoblich, and T. Poggio, A quantitative theory of immediate visual recognition, Progress in brain research, vol. 165, pp. 33 56, 2007. [3] Y. Bengio, Learning deep architectures for AI. Now Publishers Inc., 2009, vol. 2, no. 1, ch. 2 Theoretical Advantages of Deep Architectures, pp. 13 18. [4] Y. Bengio, A. Courville, and P. Vincent, Representation learning: A review and new perspectives, 2013. [Online]. Available: http://arxiv.org/pdf/1206.5538 [5] D. Erhan, P.-A. Manzagol, Y. Bengio, S. Bengio, and P. Vincent, The difficulty of training deep architectures and the effect of unsupervised pre-training, in International Conference on Artificial Intelligence and Statistics, 2009, pp. 153 160. [6] Y. Bengio, Learning deep architectures for AI. Now Publishers Inc., 2009, vol. 2, no. 1, ch. 1 Introduction, pp. 2 12. [7] Flickr, Creative commons, Website, October 2014. [Online]. Available: https://www.flickr.com/creativecommons/ [8] P. Stanford Vision Lab, Stanford University, Imagenet, Website, October 2014. [Online]. Available: http://www. image-net.org/ [9] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, Imagenet: A large-scale hierarchical image database, in Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 2009, pp. 248 255. [10] L. M. G. López, O. R. C. Jordán, D. Penney, and T. Chandler, The role of transfer in games teaching: Implications for the development of the sports curriculum, European Physical Education Review, vol. 15, no. 1, pp. 47 63, 2009. [Online]. Available: http://epe.sagepub.com/content/15/1/47.full.pdf [11] Y. Bengio, Learning deep architectures for AI. Now Publishers Inc., 2009, vol. 2, no. 1, ch. 9 Looking Forward, pp. 68 72. [12] P. Wu and T. G. Dietterich, Improving svm accuracy by training on auxiliary data sources, in Proceedings of the twenty-first international conference on Machine learning. ACM, 2004, p. 110. [13] N. S. Altman, An introduction to kernel and nearest-neighbor nonparametric regression, The American Statistician, vol. 46, no. 3, pp. 175 185, 1992. [Online]. Available: http://www.jstor.org/stable/pdfplus/2685209.pdf [14] C. Cortes and V. Vapnik, Support-vector networks, Machine learning, vol. 20, no. 3, pp. 273 297, 1995, what this paper calls a Support Vector Network, has now come to be commonly refered to as a Support Vector Machine. [15] M. Palatucci, D. Pomerleau, G. E. Hinton, and T. M. Mitchell, Zero-shot learning with semantic output codes, in Advances in neural information processing systems, 2009, pp. 1410 1418. [16] R. Socher, M. Ganjoo, C. D. Manning, and A. Ng, Zero-shot learning through cross-modal transfer, in Advances in Neural Information Processing Systems, 2013, pp. 935 943. [17] B. M. Lake, R. Salakhutdinov, J. Gross, and J. B. Tenenbaum, One shot learning of simple visual concepts, in Proceedings of the 33rd Annual Conference of the Cognitive Science Society, 2011, pp. 2568 2573. [18] P. Tokarczyk, J. Wegner, S. Walk, and K. Schindler, Beyond hand-crafted features in remote sensing, ISPRS Annals of Photogrammetry, Remote Sensing and Spatial Information Sciences, vol. 1, no. 1, pp. 35 40, 2013. [19] X. Glorot, A. Bordes, and Y. Bengio, Domain adaptation for large-scale sentiment classification: A deep learning approach, in Proceedings of the 28th International Conference on Machine Learning (ICML-11), 2011, pp. 513 520. [20] M. Chen, Z. Xu, K. Weinberger, and F. Sha, Marginalized denoising autoencoders for domain adaptation, arxiv preprint arxiv:1206.4683, 2012. [21] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, Extracting and composing robust features with denoising autoencoders, in Proceedings of the 25th international conference on Machine learning. ACM, 2008, pp. 1096 1103. [22] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng, Multimodal deep learning, in Proceedings of the 28th International Conference on Machine Learning (ICML-11), 2011, pp. 689 696. [Online]. Available: http: //machinelearning.wustl.edu/mlpapers/paper_files/icml2011ngiam_399.pdf [23] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett, and N. L. Dahlgren, DARPA TIMIT acoustic phonetic continuous speech corpus CDROM, 1993. [Online]. Available: http://www.ldc.upenn.edu/catalog/ldc93s1.html [24] Y. LeCun and Y. Bengio, Convolutional networks for images, speech, and time series, The handbook of brain theory and neural networks, vol. 3361, 1995. [Online]. Available: http://www.iro.umontreal.ca/labs/neuro/pointeurs/handbook-convo.pdf [25] D. Erhan, Y. Bengio, A. Courville, P.-A. Manzagol, P. Vincent, and S. Bengio, Why does unsupervised pre-training help deep learning? The Journal of Machine Learning Research, vol. 11, pp. 625 660, 2010. [26] G. Cybenko, Approximation by superpositions of a sigmoidal function, Mathematics of control, signals and systems, vol. 2, no. 4, pp. 303 314, 1989.

[27] K. Hornik, Approximation capabilities of multilayer feedforward networks, Neural networks, vol. 4, no. 2, pp. 251 257, 1991. [28] J. S. Bridle, Training stochastic model recognition algorithms as networks can lead to maximum mutual information estimation of parameters, in Advances in Neural Information Processing Systems 2, D. Touretzky, Ed. Morgan-Kaufmann, 1990, pp. 211 217. [Online]. Available: http://papers.nips.cc/paper/ 195-training-stochastic-model-recognition-algorithms-as-networks-can-lead-to-maximum-mutual-information-estimation-of-paramet pdf [29] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, Learning representations by back-propagating errors, Nature, vol. 323, no. 6088, pp. 533 536, 1986. [30] C. J. B. Yann LeCun, Corinna Cortes, The mnist database of handwritten digits, November 1998. [Online]. Available: http://yann.lecun.com/exdb/mnist/ [31] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, Greedy layer-wise training of deep networks, Advances in neural information processing systems, vol. 19, p. 153, 2007. [Online]. Available: http://papers.nips.cc/paper/ 3048-greedy-layer-wise-training-of-deep-networks.pdf [32] G. E. Hinton, S. Osindero, and Y. Teh, A fast learning algorithm for deep belief nets, Neural computation, vol. 18, no. 7, pp. 1527 1554, 2006. [33] G. Hinton, Recent developments in deep learning, Lecture, The University of British Columbia, 2009. [Online]. Available: http://www.youtube.com/watch?v=vshmxxqtdds [34] Y. Freund and D. Haussler, Unsupervised learning of distributions of binary vectors using 2-layer networks, in Advances in Neural Information Processing Systems 4, J. Moody, S. Hanson, and R. Lippmann, Eds. Morgan-Kaufmann, 1992, pp. 912 919. [Online]. Available: http://papers.nips.cc/paper/ 535-unsupervised-learning-of-distributions-of-binary-vectors-using-2-layer-networks.pdf [35] G. E. Hinton, Training products of experts by minimizing contrastive divergence, Neural computation, vol. 14, no. 8, pp. 1771 1800, 2002. [36] J. Melchior, Learning natural image statistics with gaussian-binary restricted boltzmann machines, Master s thesis, University of Bochum, Germany, 2012. [Online]. Available: http://www.ini.rub.de/data/documents/tns/masterthesis_ janmelchior.pdf [37] Y. Bengio, Learning deep architectures for AI. Now Publishers Inc., 2009, vol. 2, no. 1, ch. 5 Energy-Based Models and Boltzmann Machines, pp. 48 68. [38], Learning deep architectures for AI. Now Publishers Inc., 2009, vol. 2, no. 1, ch. 6 Greedy Layer-Wise Training of Deep Architecture, pp. 68 72. [39] Y. LeCun and M. Ranzato, Deep learning tutorial, ICML, 2013. [Online]. Available: http://www.cs.nyu.edu/~yann/ talks/lecun-ranzato-icml2013.pdf [40] D. P. F. B. J. Hinton, G. E. and R. Neal, The wake-sleep algorithm for unsupervised neural networks. Science, vol. 268, pp. 1158 1161, 1995. [41] L. Prechelt, Early stopping-but when? in Neural Networks: Tricks of the trade. Springer, 1998, pp. 55 69. [42] G. Hinton, A practical guide to training restricted boltzmann machines, Momentum, vol. 9, no. 1, 2010. [Online]. Available: http://www.cs.toronto.edu/~hinton/absps/guidetr.pdf [43] G. E. Hinton and D. Van Camp, Keeping the neural networks simple by minimizing the description length of the weights, in Proceedings of the sixth annual conference on Computational learning theory. ACM, 1993, pp. 5 13. [44] M. T. Rosenstein, Z. Marx, L. P. Kaelbling, and T. G. Dietterich, To transfer or not to transfer, in NIPS 2005 Workshop on Transfer Learning, vol. 898, 2005. [45] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng, Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations, in Proceedings of the 26th Annual International Conference on Machine Learning. ACM, 2009, pp. 609 616. [46] D. Chen and C. D. Manning, A fast and accurate dependency parser using neural networks. [47] W. Y. Zou, R. Socher, D. M. Cer, and C. D. Manning, Bilingual word embeddings for phrase-based machine translation. in EMNLP, 2013, pp. 1393 1398. [48] Y. Bengio, Learning deep architectures for AI. Now Publishers Inc., 2009, vol. 2, no. 1, ch. 4 Neural Networks for Deep Architectures, pp. 30 48. [49] M. Chen, K. Weinberger, F. Sha, and Y. Bengio, Marginalized denoising auto-encoders for nonlinear representations, in Proceedings of The 31st International Conference on Machine Learning, 2014, pp. 1476 1484. [Online]. Available: http://jmlr.org/proceedings/papers/v32/cheng14.pdf 26

[50] Y. A. LeCun, L. Bottou, G. B. Orr, and K.-R. Müller, Efficient backprop, in Neural networks: Tricks of the trade. Springer, 2012, pp. 9 48. [51] G. Van Rossum, Python 2.7 documentation, 2010. [52] T. E. Oliphant, Python for scientific computing, Computing in Science & Engineering, vol. 9, no. 3, pp. 10 20, 2007. [Online]. Available: http://scitation.aip.org/content/aip/journal/cise/9/3/10.1109/mcse.2007.58 [53] E. Jones, T. Oliphant, P. Peterson et al., SciPy: Open source scientific tools for Python, 2001. [Online]. Available: http://www.scipy.org/ [54] S. Behnel, R. Bradshaw, C. Citro, L. Dalcin, D. Seljebotn, and K. Smith, Cython: The best of both worlds, Computing in Science Engineering, vol. 13, no. 2, pp. 31 39, 2011. [55] F. Perez and B. E. Granger, Ipython: A system for interactive scientific computing, Computing in Science & Engineering, vol. 9, no. 3, pp. 21 29, 2007. [Online]. Available: http://scitation.aip.org/content/aip/journal/cise/9/3/10.1109/mcse. 2007.53 [56] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, Scikit-learn: Machine learning in Python, Journal of Machine Learning Research, vol. 12, pp. 2825 2830, 2011. [57] R. Bilina and S. Lawford, Python for unified research in econometrics and statistics, Econometric Reviews, vol. 31, no. 5, pp. 558 591, 2012. [58] W. McKinney, Data structures for statistical computing in python, in Proceedings of the 9th Python in Science Conference, S. van der Walt and J. Millman, Eds., 2010, pp. 51 56. [59] J. D. Hunter, Matplotlib: A 2d graphics environment, Computing In Science & Engineering, vol. 9, no. 3, pp. 90 95, 2007. [60] M. Waskom, Seaborn 0.4.0, Software, 2012. [Online]. Available: http://web.stanford.edu/~mwaskom/software/seaborn/ [61] T. Tantau, The TikZ and PGF Packages. [Online]. Available: http://sourceforge.net/projects/pgf/ [62] M. Ettrich et al., The lyx document processor, 1995. [63] H. Hagen, Luatex: Howling to the moon, Communications of the Tex Users Group Tugboat, p. 152, 2005. 27

28 APPENDIX A NOMENCLATURE A.1 Abbreviations All abbreviations and terms have been defined inline, above, however the following is provided as a quick reference for the reader. DBN: Deep Belief Network. A stacked generative model normally based on RBMs[32]. DNN: Deep Neural Network. A neural network containing two or more hidden layers[48]. DNN LLR: Deep Neural Network Low Level Reinitialization. The knowledge transfer algorithm which is the subject of this paper. knn: k-nearest Neighbors. A machine learning model[13]. SDA: Stacked Denoising Autoencoder. A stacked generative model, based on denoising autoencoders[21]. msda: Marginalizing Denoising Autoencoder: A stacked generative model, based on marginalizing denoising autoencoders[49]. SVM: Support Vector Machine. A machine learning model[14]. RBM: Restricted Boltzmann Machine. An unsupervised learning algorithm that learns to regenerate its input[35] A.2 Terms Dataset: A collection of data for training or evaluating a learning model. It may be labelled or unlabelled. Domain: A general class of problems requiring a single skill set. Linear Classifier: A Neural Network without any hidden layers. It is capable only of solving a particular linearly separable class of problems. MNIST: A machine learning dataset containing greyscale images of handwritten digits[30]. Neural Network: A class of machine learners based on the architecture of the human brain[29]. Target Domain/Dataset: The domain of the problem for which the learning model is being trained to solve, and the training dataset contained information about it. Source Domain/Dataset: A domain of problems, and a dataset for training a model to learn about it, which is not the same domain/dataset as the target domain/dataset but from which it is desirable to transfer knowledge from.

29 APPENDIX B DETAILED EXPERIMENTAL SETUP B.1 Source and Target Domain Datasets DNN LLR is an algorithm for transferring knowledge learnt from one domain to another. It can be used to improve performance when there is additional data from another domain available. In the case of the domains used for the empirical evaluation of the algorithm, partitions of MNIST[30] were used. The target domain task was to recognize 5 particular handwritten digits. The target domain training data was the portion of the MNIST training set corresponding to those digits. A variety of quantities of that training data was made available incrementally to allow for evaluation at the different sizes (see section B.3). The source domain problem used was the classification of the remaining 5 digits. To be precise, since the source domain is used the train a DBN, the source domain task is to be able to regenerate 5 digits based. The source domain training data was thus chosen to be the 22530 elements from the training dataset corresponding to the 5 digits not with the target domain 8. Across all experiments, the quantity of source data was kept constant. This matches the real world use-case, of trying to take advantage of an existing dataset: as this dataset already exists, its size is fixed. Conversely the quantity of target domain data was varied, to provide insight on how the algorithm performs dependent on this. In a real world application, more target data can be collected, if necessary to put the algorithm into its ideal condition. B.1.1 Target Training Set Quantities The following differing quantities (Value in brackets is target training data were used for evaluation: 50 (0.22%) 100 (0.44%) 150 (0.67%) 200 (0.89%) 250 (1.11%) 300 (1.33%) 500 (2.22%) 1000 (4.44%) 2000 (8.88%) 4000 (17.75%) 8000 (35.51%) 16000 (71.02%) 22530 (100.0%) target training quantity quantity training quantity 100%) of Up to 300 elements the quantity variations were linearly spaced with a net trained and evaluated every 50. The reasoning behind this linear sizing was to investigate the very small amounts of 8. To be precise, depending on the exact classification in the source domain, as the quantity of MNIST training data for each digit is different, for some source domains the dataset difference by up to 3 training cases to each side of 22530.

30 data.for 500 16,000 target training cases, the quantity was doubled at each step, to consider the full results. Finally the full target training set of 22,530 elements were used. B.2 Experimental Parameters Throughout all experiments consistent algorithm parameters and techniques were used. B.2.1 Dataset Preparation The input training data was standardized, to be input feature-wise 9 zero mean and unit variance. In the training of a neural net with backpropagation this is useful to gain improved learning by not saturating the neural activation function[50]. An alternative would be to initialize the bias and weight values to achieve the same affect as can be done for RBMs[42]. However a much simpler implementation of the Gaussian-Bernoulli RBM is possible with pre-standardised inputs[36][37]. The scaling and shift factor of the transformation used in the standardization for the target domain training data is stored and used on the validation and test data (see section B.4.1). B.2.2 Greedy Layer-wise Training Parameters The Deep Belief Networks used were trained as described above. As the dataset was standardized (see section B.2.1), the bottom layer RBM, needs to be a Gaussian-Bernoulli RBM[31] in order to accept these Gaussian distributed input features. The remaining layers were Bernoulli-Bernoulli RBMs [37]. The weights and biases were initialized using random values from a Gaussian distribution with mean 0 and standard deviation 0.01. This, together with standardizing the inputs, helps to avoid the neuron activations saturating, which would decrease the rate of learning[42]. A learning rate of 0.001 is used throughout the greedy layer-wise training. It is very small, this improves the stability of the Gaussian-Bernoulli bottom layer[42]. With higher learning rates preliminary experiments showed severe numerical instability. It is possible to use a higher learning rate for the Bernoulli-Bernoulli layers, but that was not done in these experiments. It is expected that doing such would result in similar improvements in both the control and the DNN LLR experiments. A L2 Weight Decay cost coefficient of 0.0002 was used. This is slightly larger than the 0.0001 suggested as a starting point in [42]. It was found to give an improvement learning during preliminary investigations. The Contrastive Divergence[35] training algorithm was used to train all layers, with a single step of sampling (CD-1), as described above. In the bottom Gaussian-Bernoulli layer, a mean-field method was used. In this case the reconstructed input is found by v = E(P (v h)) rather than sampling v P (v h). Preliminary investigative experiments found that this improved the learning for the GB-RBM. It also improved stability. This is because it removes the spread in values that are reconstructed. 9. Input feature-wise i.e. pixel-wise. for MNIST input features are pixels. Higher level features are learnt in a DNN.

31 B.2.3 Backpropagation Parameters The softmax output layer in all experiments had 10 neurons, based on the original labels. Though the final target dataset only had 5 distinct labels, the sparse output representation containing 5 extra values which were always zero does not affect the training. Even the smallest training set is enough to bias those neurons to be always off. The weights and biases were normally initialized using the values from a DBN as discussed above. In the case of the linear classifier control (see section 5.2.2), and in the reinitialized bottom layer in DNN LLR: the weights and biases were again initialized using mean 0, standard deviation 0.01 Gaussian distribution. During backpropagation, a learning rate of 0.01 was used. This aligned with the order or magnitude increase from the greedy layer-wise pretraining learning rate suggested in [31]. As in the DBN pretraining, a L2 Weight Decay cost coefficient of 0.0002 was used. B.3 Incremental Training and Evaluation The models were trained and evaluated incrementally on more target training data. This allowed more data to be collected, and the change in particular networks to be viewed as more target training data was made available. The training increments were standardized such that at time of evaluation, and at time of DBN training completing, the learning algorithm has had a full set of standardized inputs. that is to say, the inputs appeared approximately Gaussian distributed with zero mean and unit variance. Derivation of the method for doing this can be found in section C. B.3.1 Neural Network Reuse in Experimental Cases In the case of the DBN LLR experimental cases, as the DBN stage is trained only on the fixed sized source dataset, the addition of more target training data did not require repeating the DBN training stage. Further, the deep neural net could be reused, since training was done in minibatch the addition of another increment of target training data is just further minibatch to process. It should be kept in mind that the early stopping rollback (see section B.3.3) must be reverted prior to the next increment being trained, as otherwise the full amount of training data is not used. B.3.2 Reuse of DBN in Control Case The DNN cannot be used in the control case, as now more training data is available for use in DBN pretraining. To allow that to be done, the DBN trained on the last increment is further trained on the new training data, and is then used to initialize a fresh feed-forward deep neural network. This is then trained, without early-stopping, on the target training data from earlier training increments (but see below, early stopping is applied in post-processing), before being trained as with the experimental case on the latest training data increment, using early stopping.

32 B.3.3 Early Stopping For each increment of new training data, early-stopping was used [41]. Early stopping was used to help prevent over-fitting. A validation data containing 5,000 instances of the target dataset was used. The viability of such a large validation dataset when the training dataset is very small is questionable in this context. However this same advantage was extended to both the Control and the Experimental cases. Each element in the new training set increment was passed though the network once. Rather than the other option of cycling though the set until an early-stopping criteria occurred. After each mini-batch was processed, the error rate on the validation data was checked. At the end of the training increment, the neural network was rolled back to its best state, prior to evaluation. As early stopping was only used on the last increment of training data, that denies the opportunity to roll back to the states found in earlier increments. However this effect can achieved in post-processing. Rolling back to the best neural network state found in the previous increment, results in the score of the previous training increment being used if it performs better in validation than than that found this increment does. As the validation scores were recording during the experiments, this score substitution can be implemented. It was done so in the evaluations 10. B.4 Evaluation B.4.1 Test Dataset The MNIST Dataset has a separate test set[30]. The different authors wrote the digits in the test set, to the authors used in the training set. In all experiments performed this separation has been maintained. The test set for the experiments is the appropriate target domain half of the full test set. Depending on the exact digits in the target domain, this will have close to 5000 test cases. The same test set is used throughout the experiments not matter the size of the target dataset. Using a larger test set does not benefit the learner, it does however allow better evaluation of how well the classifier would perform the real world. The test dataset, was not standardized with itself, or with the training sets. It was however transformed using the shift and scaling factor that was found to be needed to standardize the training set. This is a viable method in training real world classifiers to transform incoming input data as it is easy to store what final transformation had been done to the training data, and reuse that. Doing this ensures the evaluation input features are similar to the training input features in value. 10. As a interesting aside, during analysis, doing post-processing to early-stopping, while monotonically improving results with more training data was not the case for individual neural nets for a particular dataset, it monotonicity was almost entirely maintained in the averages across all datasets. This is as expected of the early-stopping algorithm. The validation error approximates the test error.

33 B.4.2 Training Error Rate The training error rate was also recorded. As the training set was fairly large (22530 elements), training error rate was found by evaluating a sample of the last training increment. For each 5000 T increment of additional target training data of size T, a sample containing 4 T + 20000 target training cases was evaluated 11. Over the range of training increments considered this was a bit greater than 10%. B.4.3 Error Rate Function The error rate for evaluation and early-stopping validation was determined with a winner-takesall approach. In winner-takes-all an error occurs when the output does not assign the highest probability to the correct output class. This approach was chosen for its real world applicability. A output is correct if it is correct, and incorrect otherwise there is no middle ground of almost correct. This contrasts with a sum-of-squared-errors or similar method where fractional errors could occur if the output was fractionally correct As is usual for a softmax output layer during backpropagation a entropy based error rate was used[28] such that the error derivative remained the usual difference between the labels and actual outputs. B.5 MNIST Subdivisions Used 44 subdivisions of MNIST were used to assess and ensure the method worked predictably across various areas. They consisted of 6 chosen as interesting, and their reverses: [0, 1, 2, 3, 4] [5, 6, 7, 8, 9] [0, 1, 2, 4, 8] [3, 5, 6, 7, 9] [0, 2, 3, 5, 7] [1, 4, 6, 8, 9] [0, 2, 4, 6, 8] [1, 3, 5, 7, 9] [1, 2, 3, 4, 5] [0, 6, 7, 8, 9] [0, 2, 3, 5, 7] [1, 4, 6, 8, 9] [0, 3, 6, 8, 9] [1, 2, 4, 5, 7] and 16 chosen randomly, as well as their reverses. [0, 2, 5, 8, 9] [1, 3, 4, 6, 7] [0, 1, 2, 3, 8] [4, 5, 6, 7, 9] [2, 4, 5, 8, 9] [0, 1, 3, 6, 7] [0, 1, 4, 6, 9] [2, 3, 5, 7, 8] [0, 1, 5, 6, 7] [2, 3, 4, 8, 9] [4, 5, 6, 7, 8] [0, 1, 2, 3, 9] [2, 3, 4, 5, 6] [0, 1, 7, 8, 9] [1, 2, 3, 4, 8] [0, 5, 6, 7, 9] 11. This seemingly unintuitive sampling frequency is the result of a convenient programming idiom to ensure sample contains a equal number of each class (ie digit). While also ensuring a suitable large (but not too large) absolute quantity of test data was used for evaluation.

34 [1, 4, 5, 6, 9] [0, 2, 3, 7, 8] [1, 2, 4, 5, 8] [0, 3, 6, 7, 9] [0, 1, 6, 8, 9] [2, 3, 4, 5, 7] [2, 3, 6, 7, 9] [0, 1, 4, 5, 8] [1, 3, 7, 8, 9] [0, 2, 4, 5, 6] [0, 1, 3, 5, 7] [2, 4, 6, 8, 9] [1, 2, 6, 7, 9] [0, 3, 4, 5, 8] [0, 2, 3, 6, 7] [1, 4, 5, 8, 9] B.5.1 Non-convergent Experiments Of those 44 experiments 5 failed to converge during the back-propagation stage, for both the DNN Control and the DNN LLR experiments. Hitting floating point under/overflow. Those results have been discarded from the evaluations. They do not effect the evaluation of the technique as the instability occurs both in the control and in the experiment, resulting in neither getting accuracy above the 20% guess rate (i.e. a 0.8 error rate). [2, 3, 4, 8, 9] [0, 1, 5, 6, 7] [2, 4, 6, 8, 9] [0, 1, 3, 5, 7] [5, 6, 7, 8, 9] [0, 1, 2, 3, 4] [0, 1, 7, 8, 9] [2, 3, 4, 5, 6] [0, 3, 6, 7, 9] [1, 2, 4, 5, 8] It is however very interesting to note that same 5 transitions failed in all 4 topologies, and for both the control and the experiment. It indicates that the target domain training set at its smallest size, after its preprocessing, causes this error.

35 APPENDIX C THE TRANSFORMATION OF SUBDATASETS SUCH THAT THE WHOLE DATASET IS STANDARDIZED C.1 Motivation A machine learning algorithm learns from a dataset. This dataset has two components: The test dataset which emulates real world data that might be input to network this is used for evaluating the performance of the network at the end. the total training dataset. The Total Training Dataset is divided into two portions: the training set and the validation set. The learning algorithm is performed using training data from the training set. Periodically (e.g. once every minibatch), it is evaluated against the validation set in order to check if it is over fitting (see section B.3.3). The validation set is a practice test set, to see how well the machine learning algorithm is generalizing. Many machine learning algorithms, including backpropagation[50], perform better if the training set has been feature-wise standardized so that each feature has zero mean, and having a unit standard deviation. Some algorithms, such as non-variance learning GB-RBMs, require this. For this reason, the training set is standardized. After standardizing the training set, the parameters of the transformation are stored, and are then used to transform the validation and test datasets. Note that this does not mean the test and validation sets have zero mean and unit variance transforming them just keeps them similar in magnitude to the training data. For evaluating how well the algorithms perform, with varying quantities of target domain data, they need to be trained on training sets of various sizes. These training sets still need to be standardized, to ensure training algorithms perform optimally thus keeping results relevant. Ideally to minimize training time, the net would first be trained on the first raining subset, then would be evaluated against the test set. Then it would be further trained on the next training subset, and again evaluated on the test set; and so on. However over the course of this training, before it is evaluated it should been presented with a training set which is overall standardized. If this isn t possible them it may mean training from scratch for each training data size. C.2 Derivation of a Method C.2.1 Introduction and Definitions Consider a dataset X = A B X, A, and B are standardized versions of X, A, and B respectively. Scaling and shifting must be done to A and B such the for X = A B, µ X = 0 σ X = 1 Note: A = A and B = B and X = X as it is just a transform of the data. x = x µ X σ X

36 µ X = x X x X σ 2 X = 1 X (x µ X ) 2 x X C.2.2 RTP Scaling A and B to have Zero Mean causes their Union to have zero mean x A µ A + µ B = x x B + x A B A µ A + B µ B = A X µ A + B X µ B = x X x x X x X A X µ A + B X µ B = 0 = µ X = µ X C.2.3 RTP Scaling A and B to have zero mean and unit variance causes there Union to have unit variance. 1 σ A = (x µ A ) A 2 σ B = 1 B A σ 2 A + B σ 2 B = x A x A (x µ B ) 2 x B (x µ A ) 2 + x B (x µ B ) 2 As µ X = µ A = µ B = 0: A σa 2 + B σ2 B = (x 0) 2 + (x 0) 2 x A x B A σa 2 + B σ2 B = (x ) 2 x X Since X σx 2 = x X (x ) 2 As B = X A A σ 2 A + B σ2 B = X σ2 X A σ 2 A + ( X A )σ2 B = X σ2 X

37 Which with σ 2 A = σ2 B = 1 A (1) + ( X A )(1) = X σ 2 X X 1 = X σ 2 X σ 2 X = 1 C.3 Conclusion The union of two datasets which as standardized, to have zero mean and unit variance, does itself have zero mean and unit variance also. Ergo to have a the method for transforming subdatasets such that there union is standardized to zero mean and unit variance, is simply to standardize the subdatasets.

38 APPENDIX D ON THE PERFORMANCE OF THE LINEAR CLASSIFIER CONTROL It can be observed that the Linear Classifier Control has particularly good comparative performance for small quantities of training data. This may seem counter intuitive, as the Linear Classifier can only classify linearly separable inputs. Thus can not ever correctly recognize all of the test cases. However it trains very quickly. It is a simple model and so can learn quickly. It almost instantly fits to its training data. It does also seem to over-fit to its training data as seen in section 6.5. Fitting quickly to the training data does result in good early performance. The mean performance with both Training and Test Error rates of the algorithms is shown in figure 14. It can be seen that the Linear Classifier immediately fits well to its test dataset and its able to classify them correctly. In fact as the test dataset grows Linear Classifier the Training Error Rate increases. This is because as more elements are added to the training set (which is evaluated to get the training error), fewer of them are linearly separable in the input space. Conversely for DNN LLR and the Control DNN, the high training error rate is indicative of the network having not yet converged to a solution. One Figure 14: Mean Training Data Error Rate (top) and Test Data Error Rate (bottom) of the algorithms across all target domains and all topologies. Note that this plot is clipped at 4000 target training cases to highlight the areas where the linear classifier outperforms the more powerful models. way to encourage this would be to use a higher learning rate for smaller training set sizes. More advanced approaches along these lines form the basis for adaptive learning rate algorithms[50].

39 APPENDIX E DBN REUSE IS IT NECESSARY TO REINITIALIZE THE BOTTOM LAYER. E.1 Method Greedly Layerwise Train Append Softmax Output Layer Backpropagation Train Evaluate Source Unlabelled Training Dataset Target Labelled Training Dataset Target Labelled Test Dataset Greedly Layerwise Train Greedly Layerwise Train Append Softmax Output Layer Backpropagation Train Evaluate Source Unlabelled Training Dataset Target Unlabelled Training Dataset Target Labelled Training Dataset Target Labelled Test Dataset Figure 15: Block diagram showing the training process for DBN Reuse (top), and DBN Reuse with Target Pretraining (bottom).. Reinitializing the bottom layer is all about untying the higher level structures from the low level details. The neural network is forced to forget the superficial differences between source and target domains, and remember only the abstract features. The results of experiments confirm the necessity of resetting these bottom layers. A method based on directly Reusing the DBN without resetting layers was investigated 12 to confirm this. Both with and without the step of having pretraining on target data. The training procedure is shown in figure 15. Notice that DBN Reuse is identical to DNN LLR with the resetting of the bottom layer removed. Notice also DBN Reuse with Target Pretraining is the same as the Control, but with a earlier step of Pretraining the DBN on the Source Dataset. It is also the same as DNN LLR with Target Pretraining with the resetting of the bottom layer removed. Both ultimately prove inferior to the Control and to DNN LLR. Evaluation was carried out as described in B. 12. Infact every permutation of resetting layers was investigated, from reset [0,0,0,0] which is DBN Reuse, to reset [1,0,0,0] which is DNN LLR, to reset [1,1,1,1] which is to reset everything and train a deep network on target data only with backpropagation only. All were found to be inferior to DNN LLR.

40 E.2 Results A plot of the results are shown in figure 16. It is outperformed by the Control, and thus even more outperformed by DNN LLR. Adding the step of pretraining with the target dataset, causes the performance increased, to the extent that for small quantities of target domain data it outperforms the Control. It does not however outperform the Linear Classifier Control. The functionality of DBN Reuse in this way, is to get whatever structure it can into the deep network to allow it to function. This is however not enough to exceed the performance of a simpler model (details on why the linear classifier control performs well can be can be found in section D).

Figure 16: Performance of the algorithms of DBN Reuse (ie directly reusing the DBN), Mean across all domain transitions show. DNN LLR and the control cases shown for contrast. 41

42 E.3 Conclusion It can be seen from the results that it is indeed necessary to reinitialize the bottom layer. DNN LLR is a superior method to simply reusing the DBN. However this results would need to be confirmed for a mixed target and source used in pretraining (as discussed in 7.1.4)

43 APPENDIX F MSDA REINITIALIZATION F.1 Introduction F.1.1 msda In section 3.2.3 it is discussed that a Deep Neural Network can be initialized using a Deep Belief Network. Other stack autoencoder models can be used instead: such as Stacked Denoising Autoencoders (SDA) [25]. A denoising autoencoder is a shallow (single hidden layer) neural network, trained by inputting a noisy images and asked to reconstruct the original image[21]. The noise used is normally uniform distributed deletion noise, where each pixel has a fixed chance of being turned off set to zero[21]. These are stacked up in a similar way to the stacking of RBMs for DBN initialization. A Marginalizing Stacked Denoising Autoencoder, is very similar to a SDA (and DBN). It is a stack of marginalizing denoising autoencoders, with outputs passed through a non-linearity, such as the sigmoid function. A marginalizing denoising autoencoder (mda) is a particular approximation of a conventional denoising autoencoder with linear activation function[20]. This approximation allows the ideal weights to be solved using least squares regression, which can be done by simple linear algebra[49]. This makes training a mda incredibly fast on modern vectorizing CPUs 1/5th the time taken to train a comparable denoising encoder [49]. The mdas then have the non-linearity applied, such as a sigmoid function, and are stacked up using the same layer-wise approach seen in DBNs and SDAs[49]. F.1.2 msda for initializing of DNN Little work has been done up until now on using the weights from a msda to initialize a DNN, however in principle it should be almost no different to doing so with a SDA. There is a additional restriction that all layers in a msda must have the same width as the input layers. This restriction does make the backpropagation stage much slower. F.1.3 Experiments were Terminated Early Experiments were terminated after running for over 1 month. Results were not sufficiently promising. The results presented here are from these incomplete experiments. While they are incomplete they highlight interesting differences between how DBN Initialized DNNs differ from msda initialized DNNs. F.1.4 Experimental Procedure Numerous experiments were carried out, of resetting various combinations of layers. Though the experiment a msda noise probability of 0.5 was used and L2 weight decay of 0.00001. In other steps the evaluation was carried out very similarly to as detailed in B.

44 Figure 17: The mean performance of a number of different topologies of msda Initialized DNN. The Asterisk Marked layers are reinitialized. (left) Reinitializing the first layer ths are similar to DNN LLR. (right) reinitialize the high level layers, thus beign the opposite of DNN LLR. F.1.5 Results The results shown in table 17 results can be contrasted with those from a DBN based DNN shown in table 17. Quite the opposite of the DBN based DNN, it is seen that deleting the top layer gets the best gain. It can be noticed that the deeper the msda based DBN the more harm is caused by reinitializing the bottom layer.

45 Figure 18: For contrast to table 17, the corresponding layer resets sare shown here for DBNs. (left) the DNN LLR. Right: DBN based DNNs with the higher layers reinitialized. F.1.6 msda DNN performs worse than DBN DNN Considering only the control data, we see that the DBN based DNN outperforms the msda DNN. It can be seen that the deeper msda does worse than the shallower ones. This however can be partially attributed to lack of training data to properly train such a deep model. Another issue may have been the noise probability used, being constant no matter the number of layers. Some of my preliminary experiments suggested that it should be scaled with the number of layers. In any case the msda however does perform very well at its designed task, the removal of noise, as can be seen in 19. The input layer is noisy, the output of he first layer clearer, the output

46 of second layer is almost ideal, the output of the 3rd is a bit abstract, and the 4th layer output is almost meaningless. It has removed all the noise that is not common to all numbers by this point. This would explain why resetting the higher layers results in a performance jump. If the msda pretraining causes the higher layers to destroy most image features, then resetting these layers will be beneficial. Rather than the performance jump being attributed to the transfer of knowledge from another domain. Further investigation in to this would be beneficial. Figure 19: Output from a msda removing noise. Top is input, each successive layer the output of a deeper msda than the first. The oval shaped burr in the background is the result of the standardization of the inputs. F.1.7 Recommendations into the study of knowledge storage in msdas, SDAs and DBNs These results are all about considering the transfer of knowledge, rather then explicitly about discovering where knowledge is stored within the network. They hint that DBNs and msdas store important knowledge in different layers of the network. This is not surprising given that

the denoising autoencoder and the restricted Boltzmann machine are very different algorithms, albeit with the same purpose (to recreate their input). A better study into where knowledge is stored in the various stacked generative models, would perform stacked generative pretraining trained on a single domain, followed by layer reinitialization, followed by backpropagation. This would allow for knowledge to be better isolated and would have benefits in design of techniques for transfer like DNN LLR. 47

48 APPENDIX G CROSS RE-REPRESENTATION BASED TECHNIQUES G.1 Introduction While this paper has focused on techniques based on direct transfer of feature detectors from models trained on the source into those for final use in the target, many other techniques exist. One such technique involves using a DBN or similar structure to create a representation or encoding of its input data. This encoding is a non-linearly derived set of features. These features can then be used to classify the input using a linear classifier. The feature detectors used to create the encoding can be devised based on training data from a different domain than the final evaluation. This is what was done in the work of [19] and [20] discussed in section 2.4. In [19] and [20] it was possible to reuse the linear classifier as well as the feature encoder between source and target. In the cases being considered here output domain is different. Thus a new linear classifier must be trained, learning from a source feature based encoding of the target data. This is reasonable to do since linear classifiers can be trained very quickly, requiring much training less data than deeper network (as found in section E). In the aforementioned works, the final classifier was a support vector machine (SVM). In the research presented in this appendix a neural network based linear classifier similar to that used for the Linear Classifier Control was used. This is very similar to the Linear Classifier Control discussed insection 5.2.2. Training a neural network without any hidden layers, taking as input the a encoding based on the top layer of a DBN, is almost identical to using a DBN to initialize a DNN (see section 3.2.3). It is strictly identical to appending the top layer and training with backpropagation, while locking the weights that came from the DBN. This method is used when training a DNN with very few labelled training cases[39]. G.2 Experimental Setup G.2.1 Experimental Parameters The parameters for the linear classifier (neural net) and the DBN were as described in section B. the msda L2 weight decay coefficient was 0.00001. The noise probability for the msda was evaluated across the values of 0.5, 0.2, 0.05, 0.0. G.2.2 Cross Re-representation Models The Cross Re-representation models were was trained as follows: 1) The Generative Model (msda/dbn) was trained on the unlabelled source dataset. 2) The target training data inputs were passed though the Generative Model to get representations 3) a Linear Classifier was trained to map from the representations, to there classifications.

49 G.2.3 Control Re-representation Models The Control Re-representations were trained as follows: 1) The Generative Model (msda/dbn) was trained on the unlabelled target dataset. 2) The target training data inputs were passed though the Generative Model to get representations 3) a Linear Classifier was trained to map from the representations, to there classifications. G.3 DBN Re-representation Results on using a DBN for getting a representation that could then be fed to a linear classifier were extremely poor. It may be that the networks were simply not wide enough to capture sufficient quantity of features that would make it possible for the linear classifier to use. However networks of this width were used in the lower layers of the DNN Control, and DBNs for reprepressentation. The best results found from a knowledge transfer perspective are shown in table 20. As can be seen the methods only gain over the control for very small quantities of target domain retraining data. Even in those cases they do not perform well enough for the mean the exceed the mean of the Linear Classifier Control. This method does not seem viable for domain transfer, or even for use in normal classification.

50 Figure 20: The mean performance of the single hidden layer reprepressentation technique. (left) Absolute mean performance across all transitions. Also shown for comparison, is a DNN Control case, with hidden layers sized [100,100,100,400]. (Right) Improvement over the Re-representational Control (a single standard deviation shown) G.4 msda Re-representation Results on using msdas for re-representation were also not promising as a knowledge transfer technique. Consistently exceeding the control happened in very few cases. One of the few functional cases was a single hidden layer msda (so not truly a deep network, by any description. In fact technically it was just a mda) base network, with noise probability 0.00, or 0.05.

It might be expected that with zero noise probability and a hidden layer the same size as the input layer, then the msda would just learn the identity function. This is not the case, as if it had then the Control Re-representation performance would have been the same as that of the Linear Classifier Control. It was not. As a transfer technique this msda, worked well for very small quantities of training data. However unlike the other models discussed that perform well for small quantities of training data, it outperforms the Linear Classifier Control. The Re-representation Control also outperforms the Linear Classifier Control. Results are shown in table 21. The deeper msdas with zero noise performed on a absolute scale similarly, but marginally worse than the single hidden layer plotted. The deeper msdas did not perform as well for knowledge transfer. 51

52 Figure 21: The mean performance of the single hidden layer reprepressentation technique. (left) Absolute mean performance across all transitions. Also shown for comparison, is a DNN Control case, with hidden layers sizes [100,100,100,400]. (Right) Improvement over the Rerepresentational Control (a single standard deviation shown) G.5 Conclusion Representational based techniques are not as promising for transfer of knowledge with retraining, as they are for the direct domain adaptation case discussed in section 2.4. For non-knowledge transfer purposes using a single layer msda to get a representation that is feed to a linear classifier is is a very promising technique particularly for small datasets. Further