Towards Speaker Adaptive Training of Deep Neural Network Acoustic Models Yajie Miao Hao Zhang Florian Metze Language Technologies Institute School of Computer Science Carnegie Mellon University 1 / 23
Outline 1 Motivation 2 Speaker normalized feature space with i vectors 3 AdaptNN Bottom adaptation layers ivecnn Linear feature shift Speaker adaptive training of DNNs 4 Experiments 5 Summary & Future Work 2 / 23
Motivation Deep neural networks become the state of the art for acoustic modeling For GMM models, speaker adaptive training has been a standard technique for improving WERs. Various methods [1,2,3,4,5] have been proposed to perform speaker adaptation for DNNs. However, how we can do SAT for DNN is not clear. In this work, we aim to achieve complete speaker adaptive training for DNN acoustic models. 3 / 23
SAT for HMM/GMM model update in fmllr features space fmllr Matrix HMM/ GMM fmllr re estimation Starts with an initial GMM model and estimates fmllr affine transforms Updates model parameters with fmllr applied and then re estimates estimates fmllr transforms. Repeats until convergence. 4 / 23
SAT for HMM/GMM model update in fmllr features space fmllr Matrix HMM/ GMM fmllr re estimation Starts with an initial GMM model and estimates fmllr affine transforms Updates model parameters with fmllr applied and then reestimatesfmllr transforms. Repeats until convergence. We want to do the similar thing for DNNs!! 5 / 23
Basic Idea for SAT DNN original features... initial DNN i vector new features Start with an initial DNN model which is the regular DNN we train for hybrid systems Learn a function which takes advantage of i vectors and projects DNN inputs into a speaker normalized feature space Update the DNN model in the new feature space 6 / 23
Bottom Adaptation Layers Adapt NN... Initial DNN Insert a smaller adaptation network between the initial DNN andthe inputs I vectors are appended to the outputs of each hidden layer By using i vectors, this network AdaptNN transforms the original DNN inputs into a speaker normalized space 7 / 23
Bottom Adaptation Layers Adapt NN... Initial DNN The output layer of AdaptNN has the same dimension as the originalinput input features The output layer adopts the linear activation function, while others use sigmoid id Parameters of AdaptNN can be estimated by the standard error back propagation by fixing initial DNN 8 / 23
Linear Feature Shift ivecnn... initial DNN a = o + f ( i ) t t s ivecnn takes speaker i vectors as inputs and generates a linear feature shift for each speaker This feature shift is added to the original DNN inputs, and the resulting features become more speaker normalized 9 / 23
Linear Feature Shift ivecnn... initial DNN The output layer of ivecnn has the same dimension as DNN inputs and tk takes linear activation function Parameters of ivecnn can be estimated by the standard error back propagation More flexible. It can be applied both to DNNs and also to convolutional neural nets (CNNs) [6] 10 / 23
Procedures of SAT DNN Training... initial DNN Step 1 Train the initial DNN model. This DNN can be trained on SI (e.g., fbank) or SA features (e.g., fmllr) 11 / 23
Procedures of SAT DNN Training original ii features... initial DNN i vectors Step 1 Train the initial DNN model. This DNN can be trained on SI (e.g., fbank) or SA features (e.g., fmllr) Step 2 Learn the feature function (AdaptNN or ivecnn) by keeping the initial DNN fixed. This step requires speaker i vectors as the side information for feature transformation 12 / 23
Procedures of SAT DNN Training original ii initial iti DNN features... i vectors SAT DNN Step 1 Train the initial DNN model. This DNN can be trained on SI (e.g., fbank) or SA features (e.g., fmllr) Step 2 Learn the feature function (AdaptNN or ivecnn) by keeping the initial DNN fixed. This step requires speaker i vectors as the side information for feature transforming Step 3 Re finetune the DNN parameters in the new feature space while keeping the feature function fixed. This finally gives us SAT DNN 13 / 23
Procedures of SAT DNN Decoding original ii features... SAT DNN i vectors Step 1 Given a testing speaker, just extract the i vector for adaptation. I vector extraction is totally unsupervised Step 2 Input the speech features and the i vectors into this architecture for decoding. This projects the input features into the speaker normalized space and adapts the SAT DNN model automatically to this testing speaker 14 / 23
Procedures of SAT DNN Decoding original features... SAT DNN i vectors Since i vector extraction is totally unsupervised, no initial decoding di pass and no fine tuning i on the adaptation ti dt data Only one single pass of decoding, although we are doing unsupervised adaptation Very Efficient Unsupervised Adaptation 15 / 23
Comparison to related work... G. Saon, H. Soltau, D. Nahamoo, and M. Picheny. Speaker adaptation of neural network acoustic models using i vectors. ASRU 2013. Concatenate i vectors with the original features directlyand trainthe the whole network from scratch We failed to get obvious gains from this proposal, most likelydue to normalizationofof i vectors. The i vectors should be normalized very carefully, which is also observed by: A. Senior, I. Lopez Moreno. Improving DNN speaker independence with i vector inputs. ICASSP 2014. When using our SAT DNN, no need to worry about i vector normalization. The feature function will do this job! 16 / 23
Experiments Switchboard A 110 Hour training setup [7] = 100k utterances Kaldi for GMM: mono delta lda+mllt sat Kaldi+PDNN: http://www.cs.cmu.edu/~ymiao/kaldipdnn.html Two types of DNN inputs: SI filterbanks and SA fmllrs Tested on the SWBD part of HUB 00 I Vector Extractor Building Open source ALIZE toolkit [8] A 100 dimensional i vector is extracted for each training and testing speaker 17 / 23
Experiments Switchboard Models Filterbank fmllr Baseline (initial) DNN 21.4 19.9 SAT DNN + AdaptNN 19.8 7.5% 18.7 6.0% SAT DNN + ivecnn 19.9 7.0% 19.0 4.8% Initial DNN + AdaptNN 20.8 (2.8%) 19.2 (3.5%) Initial DNN + ivecnn 21.2 (0.9%) 19.7 (1.0%) 18 / 23
Experiments Switchboard Models Filterbank fmllr Baseline (initial) DNN 21.4 19.9 SAT DNN + AdaptNN 19.8 7.5% 18.7 6.0% SAT DNN + ivecnn 19.9 7.0% 19.0 4.8% Initial DNN + AdaptNN 20.8 (2.8%) 19.2 (3.5%) Initial DNN + ivecnn 21.2 (0.9%) 19.7 (1.0%) Our recent work enlarges the improvement to 11.1% 1% and 68% 6.8% relatively on Filterbank and fmllr respectively 19 / 23
Experiments BABEL More challenging gbabel dataset Conversational telephone speech from low resource languages g 80 hours of training data for each language Tagalog (IARPA babel106 v0.2f ) and Turkish (IARPAbabel105b v0.4) ; only on the SI filterbank features Models Tagalog Turkish Baseline (initial) DNN 49.3 51.3 SAT DNN + AdaptNN 47.1 4.5% 48.6 5.3% SAT DNN + ivecnn 47.3 4.1% 49.3 3.9% 20 / 23
Summary & Future Work Summary We can do SAT for DNNs! To achieve this, we propose two feature learning approaches to get the speaker normalized space We get nice improvement! Our experiments show SAT DNN outperforms DNNs regardless of the feature types of the DNN inputs Our code is open source! You can check out the code and run the experiments http://www.cs.cmu.edu/~ymiao/satdnn.html Future Work Comparison with speaker adaptation ti methods; perform sequence training [9] over the resulting SAT DNN Extend the SAT framework to other architectures, e.g., eg to bottleneck feature extraction [10] and convolutional neural networks [6] 21 / 23
References [1] F. Seide, G. Li, X. Chen, and D. Yu, Feature engineering in context dependent deep neural networks for conversational speech transcription, in Proc. ASRU, pp. 24 29, 2011. [2] B. Li, and K. C. Sim, Comparison of discriminative input and output transformations for speaker adaptation in the hybrid NN/HMM systems, in Proc. Interspeech, pp. 526 529, 2010. [3] K. Yao, D. Yu, F. Seide, H. Su, L. Deng, and Y. Gong, Adaptation of context dependent deep neural networks for automatic speech recognition, in Proc. IEEE Spoken Language Technology Workshop, pp. 366 369, 369 2012. [4] S. M. Siniscalchi, J. Li, and C. H. Lee, Hermitian based hidden activation functions for adaptation of hybrid HMM/ANN models, in Proc. Interspeech, pp. 526 529, 2012. [5] G. Saon, H. Soltau, D. Nahamoo, and M. Picheny, Speaker adaptation of neural network acoustic models using i vectors, in Proc. ASRU, pp. 55 59, 2013. [6] T. N. Sainath, A. Mohamed, B. Kingsbury, and B. Ramabhadran, Deep convolutional neural networks for LVCSR, in Proc. ICASSP, pp. 8614 8618, 2013. [7] S. P. Rath, D. Povey, K. Vesely, and J. Cernocky, Improved feature processing for deep neural networks, in Proc. Interspeech, 2013. [8] J. F. Bonastre, N. Scheffer, D. Matrouf, C. Fredouille, A. Larcher, A. Preti, G. Pouchoulin, N. Evans, B. Fauve, and J. Mason, ALIZE/SpkDet: a state of the art open source software for speaker recognition, in Proc. ISCA/IEEE Speaker Odyssey 2008. [9] B. Kingsbury, Lattice based optimization of sequence classification criteria for neural network acoustic modeling, in Proc. ICASSP, pp. 3761 3764, 2009. [10] J. Gehring, Y. Miao, F. Metze, and A. Waibel, Extracting deep bottleneck features using stacked auto encoders, in Proc. ICASSP, 2013. 22 / 23
Thank You Yajie Miao Hao Zhang Florian Metze Language Technologies Institute t School of Computer Science Carnegie Mellon University Acknowledgements. This work was supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Defense U.S. Army Research Laboratory (DoD / ARL) contract number W911NF 12 C 0015. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DoD/ARL, or the U.S. Government. 23 / 23