Deep Ensemble Learning ABDELHAK LEMKHENTER 07/03/2017
Presentation Outline 2 Ensemble Learning Stacking Boosting Simple Deep Ensemble Learning A heterogenous stack More Advanced Deep Ensemble Learning Multi-Resolution Stacking Deep Incremental Boosting
Ensemble Learning 3 An ensemble learning is the practice of training multiple estimators and combining them into one robust estimators. For ensemble learning to be efficient, the set of learners should be as diverse as possible : this allows each learner to capture a different pattern. Diversity can be obtain by : Using different hyperparameters when using the same base learner; Subsampling the training data (useful when we have too little data, or too much data ); Using different algorithms.
Different ensemble techniques 4 Ensemble learning includes various techniques. The most commonly used ones are the following : Stacking Boosting
Stacking 5 In stacking, we train a meta-learner to combine our base learners. The base learners are different machine learning algorithms. [1] Output Meta-learner Model 2 Model n Model 1 Model 3 Input
Boosting 6 In boosting, we iteratively combine a set of weak estimators using the same machine learning algorithm- into a strong learner. Weak learner only needs to be slightly better than random guessing.[2] Gradient boosting
Adaptive Boosting 7 At each iteration step: We train a weak learner using a sampling distribution D i ; We update D i by giving more weight to miss labeled data points.
Deep Learning and Ensemble Learning 8 The two field share some similar guide lines (symmetry breaking ~ increasing diversity). Deep Neural Networks have various architectures and many hyperparameters which make them a good candidate for creating diverse sets of learners.
Ensemble Deep Learning for Speech Recognition[4] 9 A simple Ensemble model by stacking 3 types of Neural Nets.
Evaluation of the model 10 Evaluation on the TIMIT phone recognition task : Training set : 462 speakers Dev set : 50 speakers Test set : 24 speakers
Monaural Speech Separation 11 Task of separating a speech signal of a target from background noise or an interfering a speech signal, using data from a single microphone. We will focus on three approach : A Masking method using a DNN; A Mapping method using a DNN; Multi-Resolution Stacking.
Masking based DNN 12 We are trying to predict the Ideal Ratio Mask, where each T-F unit encodes the ratio of the target signal over the mixed signal.
Mapping based DNN 13 In this approach, we are trying to learn how to directly the mixed signal to the target signal.
Multi-resolution stacking 14 Module n Postprocessing Output Module 1 Input Preprocessing
Preprocessing and post-processing 15 Preprocessing Postprocessing Mixed Signal Target signal Inverse STFT STFT Target signal in TF domain Phase of the mixed signal y n Estimated RM y n
A Learning module 16 Output of the previous module + spectra of the mixed signal Expanding features in resolution R1 Expanding features in resolution R2 Expanding features in resolution Rp DNN 1 DNN 2 DNN p RM 1 RM 2 RM p The last module only has one DNN
Feature expansion 17 For a given resolution R, For each frame m, we expand the input with window of size 2*R+1 centered around the frame m. This is done for each RM passed down from the previous module and for the magnitude spectra of the STFT of the mixed signal
Model evaluation 18 Training and test set are generated using the SSC,TIMIT and IEEE-TIMIT datasets. Three different settings are used : Same target and interfering speakers with : Different SNR lev els Randomly chosen SNR lev el Same target but using a different interfering speaker
Results 1/3 19 SSC TIMIT
Results 2/3 20 SSC TIMIT
Results 3/3 21
Deep Incremental Boosting 22 Deep Ensemble Learning requires training more Neural Nets DIB is a combination of deep learning, transfer learning and ensemble learning, suggested to tackle this issue.
Application of Transfer Learning 23
DIB 24
Benchmark 25 Mislabeling ratio Training time
DIB for Spoken digit recognition 26 Data set : Training set :10 digits x 10 utterance x 66 speakers (Male and female) Test set : 10 digits x 10 utterance x 33 speakers (Male and female)
Architecture and results 27 Architecture Conv2D : 64 2x2 MaxPooling :1x2 Conv2D :128 2x2 MaxPooling :1x2 For 0 to 8 Conv2D :64 2x2 Fully connected layer :128 Softmax output layer Results For 2 epoch per training One CNN : 0.932272727056 DIB :0.971818181818 If we use equivalent time (40 epoch) for the single CNN : 0.979545454545
Thank you for your attention 28
Reference 29 [1] Wolpert, D. H., (1992). Stacked Generalization, Neural Networks, 5, 241. [2] Y. Freund, R.E. Schapire, A short introduction to boosting, in: Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence, Morgan Kaufmann, 1999, pp. 1401 1406. [2] Alan Mosca and George Magoulas. Deep incremental boosting. In Christoph Benzmuller, Geoff Sutcliffe, and Raul Rojas (eds.), GCAI 2016. 2nd Global Conference on Artificial Intelligence, volume 41 of EPiC Series in Computing, pp. 293 302. EasyChair, 2016a. [3] L. Deng and John Platt, Ensemble Deep Learning for Speech Recognition, Interspeech, 2014. [4] Zhang, Xiao-Lei, and Deliang Wang. "A Deep Ensemble Learning Method for Monaural Speech Separation." IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) 24.5 (2016): 967-77. Web. 23 Jan. 2017 [5] Alan Mosca and George Magoulas. Deep incremental boosting. In Christoph Benzmuller, Geoff Sutcliffe, and Raul Rojas (eds.), GCAI 2016. 2nd Global Conference on Artificial Intelligence, volume 41 of EPiC Series in Computing, pp. 293 302. EasyChair, 2016a.