Deep multi-task learning with evolving weights

Size: px

Start display at page:

Download "Deep multi-task learning with evolving weights"

Osborn Carpenter
6 years ago
Views:

1 Deep multi-task learning with evolving weights ESANN 2016 Soufiane Belharbi Romain Hérault Clément Chatelain Sébastien Adam LITIS lab., DocApp team - INSA de Rouen, France 27 April, 2016 LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights

2 Context Training deep neural networks Deep neural network are interesting models (Complex/hierarchical features, complex mapping) Improve performance Training deep neural networks is difficult Vanishing gradient More parameters Need more data Some solutions: Pre-training technique [Y.Bengio et al. 06, G.E.Hinton et al. 06] Use unlabeled data LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 1/20

3 Context Training deep neural networks Deep neural network are interesting models (Complex/hierarchical features, complex mapping) Improve performance Training deep neural networks is difficult Vanishing gradient More parameters Need more data Some solutions: Pre-training technique [Y.Bengio et al. 06, G.E.Hinton et al. 06] Use unlabeled data LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 1/20

4 Context Semi-supervised learning General case: Data = { labeled }{{ data }, unlabeled }{{ data } } expensive (money, time), few cheap, abundant E.g: medical images semi-supervised learning: Exploit unlabeled data to improve the generalization LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 2/20

5 Context Semi-supervised learning General case: Data = { labeled }{{ data }, unlabeled }{{ data } } expensive (money, time), few cheap, abundant E.g: medical images semi-supervised learning: Exploit unlabeled data to improve the generalization LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 2/20

6 Context Pre-training and semi-supervised learning The pre-training technique can exploit the unlabeled data A sequential transfer learning performed in 2 steps: 1 Unsupervised task (x labeled and unlabeled data) 2 Supervised task ( (x, y) labeled data) LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 3/20

7 Pre-training technique and semi-supervised learning Layer-wise pre-training: auto-encoders x 1 x 2 x 3 x 4 ŷ 1 ŷ 2 x 5 A DNN to train LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 4/20

8 Pre-training technique and semi-supervised learning Layer-wise pre-training: auto-encoders 1) Step 1: Unsupervised layer-wise training Train layer by layer sequentially using only x (labeled or unlabeled) x 1 ˆx 1 x 2 ˆx 2 x 3 ˆx 3 x 4 ˆx 4 x 5 ˆx 5 LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 5/20

9 Pre-training technique and semi-supervised learning Layer-wise pre-training: auto-encoders 1) Step 1: Unsupervised layer-wise training Train layer by layer sequentially using only x (labeled or unlabeled) x 1 h 1,1 x 2 h 1,2 x 3 h 1,3 x 4 h 1,4 x 5 h 1,5 LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 5/20

10 Pre-training technique and semi-supervised learning Layer-wise pre-training: auto-encoders 1) Step 1: Unsupervised layer-wise training Train layer by layer sequentially using only x (labeled or unlabeled) x 1 h 1,1 ĥ 1,1 x 2 h 1,2 ĥ 1,2 x 3 h 1,3 ĥ 1,3 x 4 h 1,4 ĥ 1,4 x 5 h 1,5 ĥ 1,5 LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 5/20

11 Pre-training technique and semi-supervised learning Layer-wise pre-training: auto-encoders 1) Step 1: Unsupervised layer-wise training Train layer by layer sequentially using only x (labeled or unlabeled) x 1 x 2 h 2,1 x 3 h 2,2 x 4 h 2,3 x 5 LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 5/20

12 Pre-training technique and semi-supervised learning Layer-wise pre-training: auto-encoders 1) Step 1: Unsupervised layer-wise training Train layer by layer sequentially using only x (labeled or unlabeled) x 1 x 2 h 2,1 ĥ 2,1 x 3 h 2,2 ĥ 2,2 x 4 h 2,3 ĥ 2,3 x 5 LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 5/20

13 Pre-training technique and semi-supervised learning Layer-wise pre-training: auto-encoders 1) Step 1: Unsupervised layer-wise training Train layer by layer sequentially using only x (labeled or unlabeled) x 1 x 2 h 3,1 x 3 h 3,2 x 4 h 3,3 x 5 LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 5/20

14 Pre-training technique and semi-supervised learning Layer-wise pre-training: auto-encoders 1) Step 1: Unsupervised layer-wise training Train layer by layer sequentially using only x (labeled or unlabeled) x 1 x 2 x 3 x 4 x 5 At each layer: When to stop training? What hyper-parameters to use? How to make sure that the training improves the supervised task? LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 5/20

15 Pre-training technique and semi-supervised learning Layer-wise pre-training: auto-encoders 2) Step 2: Supervised training x 1 Train the whole network using (x, y) x 2 x 3 x 4 ŷ 1 ŷ 2 x 5 Back-propagation LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 6/20

16 Pre-training technique and semi-supervised learning Pre-training technique: Pros and cons Pros Improve generalization Can exploit unlabeled data Provide better initialization than random Train deep networks Circumvent the vanishing gradient problem Cons Add more hyper-parameters No good stopping criterion during pre-training phase Good criterion for the unsupervised task But May not be good for the supervised task LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 7/20

17 Pre-training technique and semi-supervised learning Pre-training technique: Pros and cons Pros Improve generalization Can exploit unlabeled data Provide better initialization than random Train deep networks Circumvent the vanishing gradient problem Cons Add more hyper-parameters No good stopping criterion during pre-training phase Good criterion for the unsupervised task But May not be good for the supervised task LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 7/20

18 Pre-training technique and semi-supervised learning Proposed solution Why is it difficult in practice? Sequential transfer learning Possible solution: Parallel transfer learning Why in parallel? Interaction between tasks Reduce the number of hyper-parameters to tune Provide one stopping criterion LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 8/20

19 Pre-training technique and semi-supervised learning Proposed solution Why is it difficult in practice? Sequential transfer learning Possible solution: Parallel transfer learning Why in parallel? Interaction between tasks Reduce the number of hyper-parameters to tune Provide one stopping criterion LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 8/20

20 Pre-training technique and semi-supervised learning Proposed solution Why is it difficult in practice? Sequential transfer learning Possible solution: Parallel transfer learning Why in parallel? Interaction between tasks Reduce the number of hyper-parameters to tune Provide one stopping criterion LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 8/20

21 Proposed approach Parallel transfer learning: Tasks combination Train cost = supervised task + unsupervised task }{{} reconstruction l labeled samples, u unlabeled samples, w sh : shared parameters. Reconstruction (auto-encoder) task: Supervised task: l+u J r (D; w = {w sh, w r }) = C r (R(x i ; w ), x i ). J s (D; w = {w sh, w s }) = i=1 l C s (M(x i ; w), y i ). i=1 Weighted tasks combination J (D; {w sh, w s, w r }) = λ s J s (D; {w sh, w s }) + λ r J r (D; {w sh, w r }). λs, λr [0, 1]: importance weight, λs + λr = 1. LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 9/20

22 Proposed approach Parallel transfer learning: Tasks combination Train cost = supervised task + unsupervised task }{{} reconstruction l labeled samples, u unlabeled samples, w sh : shared parameters. Reconstruction (auto-encoder) task: Supervised task: l+u J r (D; w = {w sh, w r }) = C r (R(x i ; w ), x i ). J s (D; w = {w sh, w s }) = i=1 l C s (M(x i ; w), y i ). i=1 Weighted tasks combination J (D; {w sh, w s, w r }) = λ s J s (D; {w sh, w s }) + λ r J r (D; {w sh, w r }). λs, λr [0, 1]: importance weight, λs + λr = 1. LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 9/20

23 Proposed approach Parallel transfer learning: Tasks combination Train cost = supervised task + unsupervised task }{{} reconstruction l labeled samples, u unlabeled samples, w sh : shared parameters. Reconstruction (auto-encoder) task: Supervised task: l+u J r (D; w = {w sh, w r }) = C r (R(x i ; w ), x i ). J s (D; w = {w sh, w s }) = i=1 l C s (M(x i ; w), y i ). i=1 Weighted tasks combination J (D; {w sh, w s, w r }) = λ s J s (D; {w sh, w s }) + λ r J r (D; {w sh, w r }). λs, λr [0, 1]: importance weight, λs + λr = 1. LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 9/20

24 Proposed approach Parallel transfer learning: Tasks combination Train cost = supervised task + unsupervised task }{{} reconstruction l labeled samples, u unlabeled samples, w sh : shared parameters. Reconstruction (auto-encoder) task: Supervised task: l+u J r (D; w = {w sh, w r }) = C r (R(x i ; w ), x i ). J s (D; w = {w sh, w s }) = i=1 l C s (M(x i ; w), y i ). i=1 Weighted tasks combination J (D; {w sh, w s, w r }) = λ s J s (D; {w sh, w s }) + λ r J r (D; {w sh, w r }). λs, λr [0, 1]: importance weight, λs + λr = 1. LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 9/20

25 Proposed approach Tasks combination with evolving weights Weighted tasks combination: J (D; {w sh, w s, w r }) = λ s J s (D; {w sh, w s }) + λ r J r (D; {w sh, w r }). λs, λr [0, 1]: importance weight, λs + λr = 1. Problems How to fix λ s, λ r? At the end of the training, only J s should matters Tasks combination with evolving weights (our contribution) J (D; {w sh, w s, w r }) = λ s (t) J s (D; {w sh, w s }) + λ r (t) J r (D; {w sh, w r }). t: learning epochs, λs(t), λr (t) [0, 1]: importance weight, λs(t) + λr (t) = 1. LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 10/20

26 Proposed approach Tasks combination with evolving weights Weighted tasks combination: J (D; {w sh, w s, w r }) = λ s J s (D; {w sh, w s }) + λ r J r (D; {w sh, w r }). λs, λr [0, 1]: importance weight, λs + λr = 1. Problems How to fix λ s, λ r? At the end of the training, only J s should matters Tasks combination with evolving weights (our contribution) J (D; {w sh, w s, w r }) = λ s (t) J s (D; {w sh, w s }) + λ r (t) J r (D; {w sh, w r }). t: learning epochs, λs(t), λr (t) [0, 1]: importance weight, λs(t) + λr (t) = 1. LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 10/20

27 Proposed approach Tasks combination with evolving weights Weighted tasks combination: J (D; {w sh, w s, w r }) = λ s J s (D; {w sh, w s }) + λ r J r (D; {w sh, w r }). λs, λr [0, 1]: importance weight, λs + λr = 1. Problems How to fix λ s, λ r? At the end of the training, only J s should matters Tasks combination with evolving weights (our contribution) J (D; {w sh, w s, w r }) = λ s (t) J s (D; {w sh, w s }) + λ r (t) J r (D; {w sh, w r }). t: learning epochs, λs(t), λr (t) [0, 1]: importance weight, λs(t) + λr (t) = 1. LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 10/20

28 Proposed approach Tasks combination with evolving weights J (D; {w sh, w s, w r }) = λ s (t) J s (D; {w sh, w s })+λ r (t) J r (D; {w sh, w r }). 1 Exponential schedule Importance weights { λ r (t) = exp( t σ, σ : slope λ s(t) = 1 λ r (t) λ r (t) λ s (t) 0 start t: Train epochs LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 11/20

29 Proposed approach Tasks combination with evolving weights: Optimization Algorithm 1 Training our model for one epoch 1: D is the shuffled training set. B a mini-batch. 2: for B in D do 3: Make a gradient step toward J r using B (update w ) 4: B s labeled examples of B, 5: Make a gradient step toward J s using B s (update w) 6: end for [R.Caruana 97, J.Weston 08, R.Collobert 08, Z.Zhang 15] LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 12/20

30 Results Experimental protocol Objective: Compare Training DNN using different approaches: No pre-training (base-line) With pre-training (Stairs schedule) Parallel transfer learning (proposed approach) Studied evolving weights schedules: Importance weights Stairs (Pre-training) 0 start t 1 Linear until t 1 start t: Train epochs Linear Exponential λ r λ s LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 13/20

31 Results Experimental protocol Task: Classification (MNIST) Number of hidden layers K : 1, 2, 3, 4. Optimization: Epochs: 5000 Batch size: 600 Options: No regularization, No adaptive learning rate Hyper-parameters of the evolving schedules: t 1 : 100 σ: 40 LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 14/20

32 Results Shallow networks: (K = 1, l = 1E2) 32.5 Evaluation of the eloving weight schedules (size of labeled data l = 100), K = Calssification error MNIST test (%) baseline stairs 100 lin 100 lin exp E+03 2E+03 5E+03 1E+04 2E+04 4E Size of unlabeled data (u) LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 15/20

33 Results Shallow networks: (K = 1, l = 1E3) 14.5 Evaluation of the eloving weight schedules (size of labeled data l = 1000), K = Calssification error MNIST test (%) baseline stairs 100 lin 100 lin exp E+03 2E+03 5E+03 1E+04 2E+04 4E Size of unlabeled data (u) LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 16/20

34 Results Deep networks: exponential schedule (l = 1E3) 13.0 Evaluation of the exp 40 eloving weight schedule (size of labeled data l = 1000) 12.5 Calssification error MNIST test (%) K = 2 K = 3 K = E+03 2E+03 5E+03 1E+04 2E+04 4E Size of unlabeled data (u) LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 17/20

35 Conclusion and perspectives Conclusion An alternative method to the pre-training. Parallel transfer learning with evolving weights Improve generalization easily. Reduce the number of hyper-parameters (t 1, σ) LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 18/20

36 Conclusion and perspectives Perspectives Evolve the importance weight according to the train/validation error. Explore other evolving schedules (toward automatic schedule) Adjust the learning rate: Adadelta, Adagrad, RMSProp LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 19/20

37 Questions Conclusion and perspectives Thank you for your attention, Questions? LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 20/20

Python Machine Learning

Python Machine Learning Unlock deeper insights into machine learning with this vital guide to cuttingedge predictive analytics Sebastian Raschka [ PUBLISHING 1 open source I community experience distilled