A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Similar documents
INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

arxiv: v1 [cs.lg] 7 Apr 2015

A study of speaker adaptation for DNN-based speech synthesis

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

DNN ACOUSTIC MODELING WITH MODULAR MULTI-LINGUAL FEATURE EXTRACTION NETWORKS

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Modeling function word errors in DNN-HMM based LVCSR systems

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Modeling function word errors in DNN-HMM based LVCSR systems

Deep Neural Network Language Models

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Vowel mispronunciation detection using DNN acoustic models with cross-lingual training

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, VOL XXX, NO. XXX,

Calibration of Confidence Measures in Speech Recognition

Learning Methods in Multilingual Speech Recognition

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Improvements to the Pruning Behavior of DNN Acoustic Models

LOW-RANK AND SPARSE SOFT TARGETS TO LEARN BETTER DNN ACOUSTIC MODELS

Speech Recognition at ICSI: Broadcast News and beyond

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Speech Emotion Recognition Using Support Vector Machine

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

arxiv: v1 [cs.cl] 27 Apr 2016

Lecture 1: Machine Learning Basics

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Investigation on Mandarin Broadcast News Speech Recognition

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Dropout improves Recurrent Neural Networks for Handwriting Recognition

Edinburgh Research Explorer

Python Machine Learning

A Review: Speech Recognition with Deep Learning Methods

The 2014 KIT IWSLT Speech-to-Text Systems for English, German and Italian

WHEN THERE IS A mismatch between the acoustic

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

Probabilistic Latent Semantic Analysis

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

Improved Hindi Broadcast ASR by Adapting the Language Model and Pronunciation Model Using A Priori Syntactic and Morphophonemic Knowledge

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Generative models and adversarial training

CSL465/603 - Machine Learning

Human Emotion Recognition From Speech

Artificial Neural Networks written examination

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures

SPEECH RECOGNITION CHALLENGE IN THE WILD: ARABIC MGB-3

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Speaker recognition using universal background model on YOHO database

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Segregation of Unvoiced Speech from Nonspeech Interference

THE world surrounding us involves multiple modalities

Switchboard Language Model Improvement with Conversational Data from Gigaword

Softprop: Softmax Neural Network Backpropagation Learning

On the Formation of Phoneme Categories in DNN Acoustic Models

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Attributed Social Network Embedding

Word Segmentation of Off-line Handwritten Documents

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Online Updating of Word Representations for Part-of-Speech Tagging

Speaker Identification by Comparison of Smart Methods. Abstract

(Sub)Gradient Descent

Глубокие рекуррентные нейронные сети для аспектно-ориентированного анализа тональности отзывов пользователей на различных языках

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Offline Writer Identification Using Convolutional Neural Network Activation Features

arxiv: v2 [cs.cv] 30 Mar 2017

The IRISA Text-To-Speech System for the Blizzard Challenge 2017

Lecture 9: Speech Recognition

Time series prediction

Residual Stacking of RNNs for Neural Machine Translation

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Second Exam: Natural Language Parsing with Neural Networks

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

SARDNET: A Self-Organizing Feature Map for Sequences

Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques

Axiom 2013 Team Description Paper

An Online Handwriting Recognition System For Turkish

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Support Vector Machines for Speaker and Language Recognition

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Georgetown University at TREC 2017 Dynamic Domain Track

INPE São José dos Campos

Transcription:

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation SLSP-2016 October 11-12 Natalia Tomashenko 1,2,3 natalia.tomashenko@univ-lemans.fr Yuri Khokhlov 3 khokhlov@speechpro.com Yannick Esteve 1 yannick.esteve@univ-lemans.fr Statistical Language and Speech Processing 1 University of Le Mans, France 2 ITMO University, Saint-Petersburg, Russia 3 STC-innovations Ltd, Saint-Petersburg, Russia

2 Outline 1. Introduction Speaker adaptation GMM vs DNN acoustic models GMM adaptation DNN adaptation: related work Combining GMM and DNN in speech recognition 2. Proposed approach for speaker adaptation: GMM-derived features 3. System fusion 4. Experiments 5. Conclusions 6. Future work

3 Outline 1. Introduction Speaker adaptation GMM vs DNN acoustic models GMM adaptation DNN adaptation: related work Combining GMM and DNN in speech recognition 2. Proposed approach for speaker adaptation: GMM-derived features 3. System fusion 4. Experiments 5. Conclusions 6. Future work

4 Adaptation: Motivation Speaker Sources of speech variability Environment Why do we need adaptation? Differences between training and testing conditions may significantly degrade recognition accuracy in speech recognition systems. gender, age, emotional state, speaking rate, accent, style, channel, background noises, reverberation Adaptation is an efficient way to reduce the mismatch between the models and the data from a particular speaker or channel.

5 Speaker adaptation The adaptation of pre-existing models towards the optimal recognition of a new target speaker using limited adaptation data from the target speaker Adaptation General speaker independent (SI) acoustic models trained on a large corpus of acoustic data from different speakers Speaker adapted acoustic models, obtained from the SI model using data of a new speaker

6 Acoustic Models: GMM vs DNN Gaussian Mixture Models GMM DNN Deep Neural Networks GMM-HMMs have a long history: since 1980s have been used in speech recognition Speaker adaptation is a well-studied field of research Big advances in speech recognition over the past 3-5 years DNNs show higher performance than GMMs Neural networks are state-of-the-art of acoustic modelling Speaker adaptation is still a very challenging task

7 GMM adaptation Model based: Adapt the parameters of the acoustic models to better match the observed data Maximum a posteriori (MAP) adaptation of GMM parameters In MAP adaptation each Gaussian is updated individually: MAP Maximum likelihood linear regression (MLLR) of Gaussian parameters In MLLR adaptation all Gaussians of the same regression class share the same transform: Feature space: Transform features Feature space maximum likelihood linear regression (fmllr)

8 DNN adaptation: Related work DNN adaptation Linear transformation Regularization techniques Modelspace adaptation Auxiliary features Multi-task learning (MTL) Adaptation based on GMM LIN 1, fdlr 2, LHN 1, LON 3, odlr 4, fmllr 2, 1 Gemello et al, 2006 2 Seid et al, 2011 3 Li et al, 2010 4 Yao et al, 2012 L2-prior 5, KL-divergence 6, Conservative Training 7, 5 Liao, 2013 6 Yu et al, 2013 7 Albesano, Gemello et al, 2006 8 Swietojanski et al, 2014 10 Xue et al, 2014 LHUC 8 (fmap) linear regression 9 9 Huang et al, 2014 Speaker codes 10, i-vectors 11 11 Senior et al, 2014 12 Price et al, 2014 13 Liu et al, 2014 fmllr 2, TVWR 13, GMMderived features 14 14 Tomashenko & Kkokhlov, 2014

Combining GMM and DNN in speech recognition Tandem features 17 17 Hermansky et al, 2000 Bottleneck features 18 18 Grézl et al, 2007 GMM log-likelihoods as features for MLP 19 Log-likelihoods combination ROVER*, lattice-based combination, CNC**, 19 Pinto & Hermansky, 2008 *ROVER Recognizer Output Voting Error Reduction **CNC Confusion Network Combination 9

10 Outline 1. Introduction Speaker adaptation GMM vs DNN acoustic models GMM adaptation DNN adaptation: related work Combining GMM and DNN in speech recognition 2. Proposed approach for speaker adaptation: GMM-derived features 3. System fusion 4. Experiments 5. Conclusions 6. Future work

11 Proposed approach: Motivation It has been shown that speaker adaptation is more effective for GMM acoustic models than for DNN acoustic models. Many adaptation algorithms that work well for GMM systems cannot be easily applied to DNNs. Neural networks and GMMs may be complementary and benefit from their combination. To take advantage of existing adaptation methods developed for GMMs and apply them to DNNs.

12 Proposed approach: GMM-derived features for DNN GM GMM DNN GMM-derived (GMMD) features Extract features using GMM models and feed these GMM-derived features to DNN. Train DNN model on GMM-derived features. Using GMM adaptation algorithms adapt GMM-derived features.

13 Bottleneck-based GMM-derived features for DNNs the log-likelihood estimated using the GMM speaker independent For a given acoustic BN-feature vector O t a new GMM-derived feature vector f t is obtained by calculating likelihoods across all the states of the auxiliary adapted GMM on the given vector.

14 Outline 1. Introduction Speaker adaptation GMM vs DNN acoustic models GMM adaptation DNN adaptation: related work Combining GMM and DNN in speech recognition 2. Proposed approach for speaker adaptation: GMM-derived features 3. System fusion 4. Experiments 5. Conclusions 6. Future work

15 System Fusion Feature level: fusion for training and decoding stages Input features 1 Input features 2 Feature concatenation DNN Output posteriors Decoder Result

16 System Fusion Posterior combination Input features 1 Input features 2 DNN 1 DNN 2 Output posteriors 1 Output posteriors 2 Posterior combination Decoder Result

17 System Fusion Lattice combination Input features 1 Input features 2 DNN 1 DNN 2 Output posteriors 1 Output posteriors 2 Decoder Decoder Lattices 1 Lattices 2 Confusion Network Combination Result

18 Outline 1. Introduction Speaker adaptation GMM vs DNN acoustic models GMM adaptation DNN adaptation: related work Combining GMM and DNN in speech recognition 2. Proposed approach for speaker adaptation: GMM-derived features 3. System fusion 4. Experiments 5. Conclusions 6. Future work

Experiments: Data TED-LIUM corpus:* 1495 TED talks, 207 hours: 141 hours of male, 66 hours of female speech data, 1242 speakers, 16kHz *A. Rousseau, P. Deleglise, and Y. Esteve, Enhancing the TED-LIUM corpus with selected data for language modeling and more TED talks 2014 **cantab-tedliumpruned.lm31 Data set Duration, hours Number of Speakers Mean duration per speaker, minutes Training 172 1029 10 Development 3.5 14 15 Test 1 3.5 14 15 Test 2 4.9 14 21 LM:** 150K word vocabulary and publicly available trigram LM 19

20 Experiments: Baseline systems We follow Kaldi TED-LIUM recipe for training baselines models: Speaker-adaptive training with fmllr Train DNN Model #2 Speaker-independent model RBM, CE, smbr Train DNN Model #1

21 Experiments: Training models with GMMD features 2 types of integration of GMMD features into the baseline recipe: 1. Adapted features AF 1 (with monophone auxiliary GMM) Train DNN Models #3, #4 2. Adapted features AF 2 (with triphone auxiliary GMM) Train DNN Model #5

Results: Adaptation performance for DNNs GMMD baseline WER, % # Adaptation Features τ Dev Test 1 Test 2 1 No BN 12.14 10.77 13.75 2 fmllr BN 10.64 9.52 12.78 3 MAP AF 1 2 10.27 9.59 12.94 4 MAP AF 1 + align. #2 5 10.26 9.40 12.52 5 MAP+fMLLR AF 2 + align. #2 5 10.42 9.74 13.29 τ better than speaker-adapted baseline parameter in MAP adaptation 22

fusion GMMDbaseline Results: Adaptation and Fusion # Adaptation Features α α is a weight of the baseline model in the fusion WER, % Dev Test 1 Test 2 1 No BN 12.14* 10.77* 13.75* 2 fmllr BN 10.57 9.46 12.67 4 MAP AF 1 + align. #2 10.23 9.31 10.46 5 MAP+fMLLR AF 2 + align. #2 10.37 9.69 13.23 6 Posterior fusion: #2 + #4 0.45 9.91 6.2 9.06 4.3 12.04 5.0 7 Posterior fusion: #2 + #5 0.55 9.91 6.2 9.10 3.8 12.23 3.5 8 Lattice fusion: #2 + #4 0.44 10.06 4.8 9.09 4.0 12.12 4.4 * WER in #1 was calculated from lattices, in other lines from consensus hypothesis Relative WER reduction in comparison with adapted baseline #2 9 Lattice fusion: #2 + #5 0.50 10.01 5.3 9.17 3.1 12.25 3.3 Best improvement Two types of fusion: posterior level and lattice level provide additional comparable improvement, In most cases posterior level fusion provides slightly better results than the lattice level fusion. 23

24 Outline 1. Introduction Speaker adaptation GMM vs DNN acoustic models GMM adaptation DNN adaptation: related work Combining GMM and DNN in speech recognition 2. Proposed approach for speaker adaptation: GMM-derived features 3. System fusion 4. Experiments 5. Conclusions 6. Future work

25 Conclusions We investigate a new way of combining GMM and DNN frameworks for speaker adaptation of acoustic models The main advantage of GMM-derived features is the possibility of performing the adaptation of a DNN-HMM model through the adaptation of the auxiliary GMM. Other methods for the adaptation of the auxiliary GMM can be used instead of MAP or fmllr adaptation. Thus, this approach provides a general framework for transferring adaptation algorithms developed for GMMs to DNN adaptation Experiments demonstrate that in an unsupervised adaptation mode, the proposed adaptation and fusion techniques can provide, approximately, 11 18% relative WER (in comparison with speaker independent model) 3 6% relative WER (in comparison with strong fmllr adapted baseline)

26 Outline 1. Introduction Speaker adaptation GMM vs DNN acoustic models GMM adaptation DNN adaptation: related work Combining GMM and DNN in speech recognition 2. Proposed approach for speaker adaptation: GMM-derived features 3. System fusion 4. Experiments 5. Conclusions 6. Future work

27 Future work Investigate the performance of the proposed method for different types of Neural Networks (Recurrent Neural Networks (RNNs), Long Short-Term Memory (LSTM),.) Other tasks Better understanding and analysis of GMMD features how we can improve the performance?

Visualization of output vectors using t-sne* Visualization of the softmax output vectors of the DNNs (5 speakers, 7 phonems): \ɛ\ \r\ \ɑ\ \n\ \ʃ\ \t\ \p\ 1. Baseline speakerindependent DNN, trained on BN features 2. Baseline speaker-adapted DNN, trained on fmllr adapted BN features 3. DNN, trained using GMMD features with MAP adaptation * t-distributed Stochastic Neighbor Embedding: Maaten, L. V. D., & Hinton, G. Visualizing data using t-sne. 2008. 28

29 Key References (1) Adaptation of DNN acoustic models: 1. R. Gemello, F. Mana, S. Scanzio, P. Laface, & R. De Mori, Adaptation of hybrid ANN/HMM models using linear hidden transformations and conservative Training. 2006. 2. F. Seide, G. Li, X. Chen, & D. Yu, Feature engineering in context-dependent deep neural networks for conversational speech transcription. 2011. 3. B. Li & K. C. Sim, Comparison of discriminative input and output transformations for speaker adaptation in the hybrid NN/HMM systems. 2010. 4. K. Yao, D. Yu, F. Seide, H. Su, L. Deng, & Y. Gong, Adaptation of context-dependent deep neural networks for automatic speech recognition. 2012. 5. H. Liao, Speaker adaptation of context dependent deep neural networks. 2013. 6. D. Yu, K. Yao, H. Su, G. Li, & F. Seide, KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition. 2013. 7. D. Albesano, R. Gemello, P. Laface, F. Mana, & S. Scanzio, Adaptation of artificial neural networks avoiding catastrophic forgetting. 2006. 8. P. Swietojanski & S. Renals, Learning hidden unit contributions for unsupervised speaker adaptation of neural network acoustic models. 2014. 9. Z. Huang, J. Li, S. M. Siniscalchi, I.-F. Chen, C. Weng, & C.-H. Lee, Feature space maximum a posteriori linear regression for adaptation of deep neural. Networks. 2014. 10. S. Xue, O. Abdel-Hamid, H. Jiang, L. Dai, & Q. Liu, Fast adaptation of deep neural network based on discriminant codes for speech recognition. 2014. 11. A. Senior & I. Lopez-Moreno, Improving DNN speaker independence with i-vector inputs. 2014. 12. Price, R., Iso, K. I., & Shinoda, K. Speaker adaptation of deep neural networks using a hierarchy of output layers. 2014. 13. S. Liu & K. C. Sim, On combining DNN and GMM with unsupervised speaker adaptation for robust automatic speech recognition. 2014.

30 Proposed approach for adaptation: Key References (2) 14. N. Tomashenko & Y. Khokhlov. Speaker adaptation of context dependent deep neural networks based on map-adaptation and GMM-derived feature processing. 2014. 15. N. Tomashenko & Y. Khokhlov. GMM-derived features for effective unsupervised adaptation of deep neural network acoustic models. 2015. 16. Kundu, S., Sim, K. C., & Gales, M. Incorporating a Generative Front-End Layer to Deep Neural Network for Noise Robust Automatic Speech Recognition. 2016. Combining GMM and DNN: 17. Hermansky, H., Ellis, D. P., & Sharma, S. Tandem connectionist feature extraction for conventional HMM systems. 2000. 18. Grézl, F., Karafiát, M., Kontár, S., & Cernocky, J. Probabilistic and bottle-neck features for LVCSR of meetings. 2007. 19. J. P. Pinto & H. Hermansky, Combining evidence from a generative and a discriminative model in phoneme recognition. 2008.

http://www-lium.univ-lemans.fr http://en.ifmo.ru http://speechpro.com Thank you! Questions?