Deep multi-task learning with evolving weights ESANN 2016 Soufiane Belharbi Romain Hérault Clément Chatelain Sébastien Adam soufiane.belharbi@insa-rouen.fr LITIS lab., DocApp team - INSA de Rouen, France 27 April, 2016 LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights
Context Training deep neural networks Deep neural network are interesting models (Complex/hierarchical features, complex mapping) Improve performance Training deep neural networks is difficult Vanishing gradient More parameters Need more data Some solutions: Pre-training technique [Y.Bengio et al. 06, G.E.Hinton et al. 06] Use unlabeled data LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 1/20
Context Training deep neural networks Deep neural network are interesting models (Complex/hierarchical features, complex mapping) Improve performance Training deep neural networks is difficult Vanishing gradient More parameters Need more data Some solutions: Pre-training technique [Y.Bengio et al. 06, G.E.Hinton et al. 06] Use unlabeled data LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 1/20
Context Semi-supervised learning General case: Data = { labeled }{{ data }, unlabeled }{{ data } } expensive (money, time), few cheap, abundant E.g: medical images semi-supervised learning: Exploit unlabeled data to improve the generalization LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 2/20
Context Semi-supervised learning General case: Data = { labeled }{{ data }, unlabeled }{{ data } } expensive (money, time), few cheap, abundant E.g: medical images semi-supervised learning: Exploit unlabeled data to improve the generalization LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 2/20
Context Pre-training and semi-supervised learning The pre-training technique can exploit the unlabeled data A sequential transfer learning performed in 2 steps: 1 Unsupervised task (x labeled and unlabeled data) 2 Supervised task ( (x, y) labeled data) LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 3/20
Pre-training technique and semi-supervised learning Layer-wise pre-training: auto-encoders x 1 x 2 x 3 x 4 ŷ 1 ŷ 2 x 5 A DNN to train LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 4/20
Pre-training technique and semi-supervised learning Layer-wise pre-training: auto-encoders 1) Step 1: Unsupervised layer-wise training Train layer by layer sequentially using only x (labeled or unlabeled) x 1 ˆx 1 x 2 ˆx 2 x 3 ˆx 3 x 4 ˆx 4 x 5 ˆx 5 LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 5/20
Pre-training technique and semi-supervised learning Layer-wise pre-training: auto-encoders 1) Step 1: Unsupervised layer-wise training Train layer by layer sequentially using only x (labeled or unlabeled) x 1 h 1,1 x 2 h 1,2 x 3 h 1,3 x 4 h 1,4 x 5 h 1,5 LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 5/20
Pre-training technique and semi-supervised learning Layer-wise pre-training: auto-encoders 1) Step 1: Unsupervised layer-wise training Train layer by layer sequentially using only x (labeled or unlabeled) x 1 h 1,1 ĥ 1,1 x 2 h 1,2 ĥ 1,2 x 3 h 1,3 ĥ 1,3 x 4 h 1,4 ĥ 1,4 x 5 h 1,5 ĥ 1,5 LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 5/20
Pre-training technique and semi-supervised learning Layer-wise pre-training: auto-encoders 1) Step 1: Unsupervised layer-wise training Train layer by layer sequentially using only x (labeled or unlabeled) x 1 x 2 h 2,1 x 3 h 2,2 x 4 h 2,3 x 5 LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 5/20
Pre-training technique and semi-supervised learning Layer-wise pre-training: auto-encoders 1) Step 1: Unsupervised layer-wise training Train layer by layer sequentially using only x (labeled or unlabeled) x 1 x 2 h 2,1 ĥ 2,1 x 3 h 2,2 ĥ 2,2 x 4 h 2,3 ĥ 2,3 x 5 LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 5/20
Pre-training technique and semi-supervised learning Layer-wise pre-training: auto-encoders 1) Step 1: Unsupervised layer-wise training Train layer by layer sequentially using only x (labeled or unlabeled) x 1 x 2 h 3,1 x 3 h 3,2 x 4 h 3,3 x 5 LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 5/20
Pre-training technique and semi-supervised learning Layer-wise pre-training: auto-encoders 1) Step 1: Unsupervised layer-wise training Train layer by layer sequentially using only x (labeled or unlabeled) x 1 x 2 x 3 x 4 x 5 At each layer: When to stop training? What hyper-parameters to use? How to make sure that the training improves the supervised task? LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 5/20
Pre-training technique and semi-supervised learning Layer-wise pre-training: auto-encoders 2) Step 2: Supervised training x 1 Train the whole network using (x, y) x 2 x 3 x 4 ŷ 1 ŷ 2 x 5 Back-propagation LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 6/20
Pre-training technique and semi-supervised learning Pre-training technique: Pros and cons Pros Improve generalization Can exploit unlabeled data Provide better initialization than random Train deep networks Circumvent the vanishing gradient problem Cons Add more hyper-parameters No good stopping criterion during pre-training phase Good criterion for the unsupervised task But May not be good for the supervised task LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 7/20
Pre-training technique and semi-supervised learning Pre-training technique: Pros and cons Pros Improve generalization Can exploit unlabeled data Provide better initialization than random Train deep networks Circumvent the vanishing gradient problem Cons Add more hyper-parameters No good stopping criterion during pre-training phase Good criterion for the unsupervised task But May not be good for the supervised task LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 7/20
Pre-training technique and semi-supervised learning Proposed solution Why is it difficult in practice? Sequential transfer learning Possible solution: Parallel transfer learning Why in parallel? Interaction between tasks Reduce the number of hyper-parameters to tune Provide one stopping criterion LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 8/20
Pre-training technique and semi-supervised learning Proposed solution Why is it difficult in practice? Sequential transfer learning Possible solution: Parallel transfer learning Why in parallel? Interaction between tasks Reduce the number of hyper-parameters to tune Provide one stopping criterion LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 8/20
Pre-training technique and semi-supervised learning Proposed solution Why is it difficult in practice? Sequential transfer learning Possible solution: Parallel transfer learning Why in parallel? Interaction between tasks Reduce the number of hyper-parameters to tune Provide one stopping criterion LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 8/20
Proposed approach Parallel transfer learning: Tasks combination Train cost = supervised task + unsupervised task }{{} reconstruction l labeled samples, u unlabeled samples, w sh : shared parameters. Reconstruction (auto-encoder) task: Supervised task: l+u J r (D; w = {w sh, w r }) = C r (R(x i ; w ), x i ). J s (D; w = {w sh, w s }) = i=1 l C s (M(x i ; w), y i ). i=1 Weighted tasks combination J (D; {w sh, w s, w r }) = λ s J s (D; {w sh, w s }) + λ r J r (D; {w sh, w r }). λs, λr [0, 1]: importance weight, λs + λr = 1. LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 9/20
Proposed approach Parallel transfer learning: Tasks combination Train cost = supervised task + unsupervised task }{{} reconstruction l labeled samples, u unlabeled samples, w sh : shared parameters. Reconstruction (auto-encoder) task: Supervised task: l+u J r (D; w = {w sh, w r }) = C r (R(x i ; w ), x i ). J s (D; w = {w sh, w s }) = i=1 l C s (M(x i ; w), y i ). i=1 Weighted tasks combination J (D; {w sh, w s, w r }) = λ s J s (D; {w sh, w s }) + λ r J r (D; {w sh, w r }). λs, λr [0, 1]: importance weight, λs + λr = 1. LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 9/20
Proposed approach Parallel transfer learning: Tasks combination Train cost = supervised task + unsupervised task }{{} reconstruction l labeled samples, u unlabeled samples, w sh : shared parameters. Reconstruction (auto-encoder) task: Supervised task: l+u J r (D; w = {w sh, w r }) = C r (R(x i ; w ), x i ). J s (D; w = {w sh, w s }) = i=1 l C s (M(x i ; w), y i ). i=1 Weighted tasks combination J (D; {w sh, w s, w r }) = λ s J s (D; {w sh, w s }) + λ r J r (D; {w sh, w r }). λs, λr [0, 1]: importance weight, λs + λr = 1. LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 9/20
Proposed approach Parallel transfer learning: Tasks combination Train cost = supervised task + unsupervised task }{{} reconstruction l labeled samples, u unlabeled samples, w sh : shared parameters. Reconstruction (auto-encoder) task: Supervised task: l+u J r (D; w = {w sh, w r }) = C r (R(x i ; w ), x i ). J s (D; w = {w sh, w s }) = i=1 l C s (M(x i ; w), y i ). i=1 Weighted tasks combination J (D; {w sh, w s, w r }) = λ s J s (D; {w sh, w s }) + λ r J r (D; {w sh, w r }). λs, λr [0, 1]: importance weight, λs + λr = 1. LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 9/20
Proposed approach Tasks combination with evolving weights Weighted tasks combination: J (D; {w sh, w s, w r }) = λ s J s (D; {w sh, w s }) + λ r J r (D; {w sh, w r }). λs, λr [0, 1]: importance weight, λs + λr = 1. Problems How to fix λ s, λ r? At the end of the training, only J s should matters Tasks combination with evolving weights (our contribution) J (D; {w sh, w s, w r }) = λ s (t) J s (D; {w sh, w s }) + λ r (t) J r (D; {w sh, w r }). t: learning epochs, λs(t), λr (t) [0, 1]: importance weight, λs(t) + λr (t) = 1. LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 10/20
Proposed approach Tasks combination with evolving weights Weighted tasks combination: J (D; {w sh, w s, w r }) = λ s J s (D; {w sh, w s }) + λ r J r (D; {w sh, w r }). λs, λr [0, 1]: importance weight, λs + λr = 1. Problems How to fix λ s, λ r? At the end of the training, only J s should matters Tasks combination with evolving weights (our contribution) J (D; {w sh, w s, w r }) = λ s (t) J s (D; {w sh, w s }) + λ r (t) J r (D; {w sh, w r }). t: learning epochs, λs(t), λr (t) [0, 1]: importance weight, λs(t) + λr (t) = 1. LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 10/20
Proposed approach Tasks combination with evolving weights Weighted tasks combination: J (D; {w sh, w s, w r }) = λ s J s (D; {w sh, w s }) + λ r J r (D; {w sh, w r }). λs, λr [0, 1]: importance weight, λs + λr = 1. Problems How to fix λ s, λ r? At the end of the training, only J s should matters Tasks combination with evolving weights (our contribution) J (D; {w sh, w s, w r }) = λ s (t) J s (D; {w sh, w s }) + λ r (t) J r (D; {w sh, w r }). t: learning epochs, λs(t), λr (t) [0, 1]: importance weight, λs(t) + λr (t) = 1. LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 10/20
Proposed approach Tasks combination with evolving weights J (D; {w sh, w s, w r }) = λ s (t) J s (D; {w sh, w s })+λ r (t) J r (D; {w sh, w r }). 1 Exponential schedule Importance weights 0.8 0.6 0.4 0.2 { λ r (t) = exp( t σ, σ : slope λ s(t) = 1 λ r (t) λ r (t) λ s (t) 0 start t: Train epochs LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 11/20
Proposed approach Tasks combination with evolving weights: Optimization Algorithm 1 Training our model for one epoch 1: D is the shuffled training set. B a mini-batch. 2: for B in D do 3: Make a gradient step toward J r using B (update w ) 4: B s labeled examples of B, 5: Make a gradient step toward J s using B s (update w) 6: end for [R.Caruana 97, J.Weston 08, R.Collobert 08, Z.Zhang 15] LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 12/20
Results Experimental protocol Objective: Compare Training DNN using different approaches: No pre-training (base-line) With pre-training (Stairs schedule) Parallel transfer learning (proposed approach) Studied evolving weights schedules: Importance weights 1 0 1 Stairs (Pre-training) 0 start t 1 Linear until t 1 start t: Train epochs Linear Exponential λ r λ s LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 13/20
Results Experimental protocol Task: Classification (MNIST) Number of hidden layers K : 1, 2, 3, 4. Optimization: Epochs: 5000 Batch size: 600 Options: No regularization, No adaptive learning rate Hyper-parameters of the evolving schedules: t 1 : 100 σ: 40 LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 14/20
Results Shallow networks: (K = 1, l = 1E2) 32.5 Evaluation of the eloving weight schedules (size of labeled data l = 100), K = 1 32.0 Calssification error MNIST test (%) 31.5 31.0 30.5 30.0 29.5 29.0 baseline stairs 100 lin 100 lin exp 40 28.5 28.0 0 1E+03 2E+03 5E+03 1E+04 2E+04 4E+04 49900 Size of unlabeled data (u) LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 15/20
Results Shallow networks: (K = 1, l = 1E3) 14.5 Evaluation of the eloving weight schedules (size of labeled data l = 1000), K = 1 14.0 Calssification error MNIST test (%) 13.5 13.0 12.5 12.0 baseline stairs 100 lin 100 lin exp 40 11.5 11.0 0 1E+03 2E+03 5E+03 1E+04 2E+04 4E+04 49900 Size of unlabeled data (u) LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 16/20
Results Deep networks: exponential schedule (l = 1E3) 13.0 Evaluation of the exp 40 eloving weight schedule (size of labeled data l = 1000) 12.5 Calssification error MNIST test (%) 12.0 11.5 11.0 10.5 K = 2 K = 3 K = 4 10.0 0 1E+03 2E+03 5E+03 1E+04 2E+04 4E+04 49900 Size of unlabeled data (u) LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 17/20
Conclusion and perspectives Conclusion An alternative method to the pre-training. Parallel transfer learning with evolving weights Improve generalization easily. Reduce the number of hyper-parameters (t 1, σ) LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 18/20
Conclusion and perspectives Perspectives Evolve the importance weight according to the train/validation error. Explore other evolving schedules (toward automatic schedule) Adjust the learning rate: Adadelta, Adagrad, RMSProp LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 19/20
Questions Conclusion and perspectives Thank you for your attention, Questions? LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 20/20