Language Identification Pavel Matějka, Lukáš Burget, Petr Schwarz and Jan Černocký matejkap burget schwarzp cernocky@fit.vutbr.cz Speech@FIT group, Faculty of Information Technology Brno University of Technology, Czech Republic Brno University of Technology: Language identification 1/31
Plan Introduction - Why do we need LID? Gaussian mixture model approach System description Features for recognition Discriminative training Phonotactic approach Basic system description Extension to lattices Language antimodels Results on LRE 2003 and 2005 Conclusions and future work Brno University of Technology: Language identification 2/31
Why do we need language identification? - I. 1) Route phone calls to human operators. emergency (155,911) call-centers police (158) fireguard (150) Brno University of Technology: Language identification 3/31
Why do we need language identification? - II. 2) Pre-select suitable speech recognition system (information systems). Black Box LID Approache Brno University of Technology: Language identification 4/31
Why do we need language identification? - III. 3) Security applications. Brno University of Technology: Language identification 5/31
Two main approaches to Language Identification Acoustic - Gaussian Mixture Models (GMM) Speech Features GMM Decision good for short and long speech segments and dialect recognition, relies on the sounds - tends to recognize the speaker s native language Phonotactic - Phoneme Recognition followed by Language Model (PRLM) Speech Phoneme Recognizer Language Arabic a a a 8 a b c 2 a b v 4 a v t 15 r t x 12 e r t 1 English a a a 9 a b c 15 a b v 23 a v t 1 r t x 7 e r t 48 Model Decision good for longer speech segments, robust against dialects in one language and eliminates speech characteristics of speaker s native language Brno University of Technology: Language identification 6/31
Acoustic LID Language 1 GMM Score 1 Feature extraction Language 2 GMM Score normalization Score 2. Language N GMM Score N Brno University of Technology: Language identification 7/31
Feature extraction MFCC (static coefficients including C0) RASTA channel normalization VTLN speaker adaptation MFCC are augmented with Shifted Delta Cepstra 7-1-3-7 (SDC) representing an information about the speech evolution around the current frame (±0.1sec.) MFCC coefficients Frames SDC corresponding to one MFCC coefficient The size of final feature vector is: 7MF CC + 7 7SDC = 56 Brno University of Technology: Language identification 8/31
Distribution of Feaures for Two Languages Each feature vector can be presented as a point in N-dimensional space Brno University of Technology: Language identification 9/31
Distribution of Feaures for Two Languages What language is spoken in the green utterance? Blue or red? Brno University of Technology: Language identification 10/31
Distribution of Feaures for Two Languages Brno University of Technology: Language identification 11/31
Modeling Distributions Using Mixture of Gaussians Brno University of Technology: Language identification 12/31
GMM - Training Goal: Using training utterances O r and their transcriptions L r, find model parameters λ Maximum Likelihood (ML) training Objective function to maximize is the likelihood of training data given the transcription F ML (λ) = R r=1 R log p(o r M L r λ ) = r=1 T r t=1 log p(o rt M L r λ ) Independently trained models of different languages wasting parameters to precisely model distribution even of those parts of feature space with no discriminative power Assignment of frames to speech segments is NOT important for training Maximum Mutual Information (MMI) training Objective function to maximize is the posterior probability of all training segments being correctly recognized F MMI (λ) = R log p(o r M L r λ ) L p(o r M L λ ) r=1 Brno University of Technology: Language identification 13/31
Highly overlapped distributions Brno University of Technology: Language identification 14/31
Highly overlapped distributions Easily recognizable No need to precisely model the distributions Necesery to precisely model the boundary Highly overlaped classes, low discriminative power Brno University of Technology: Language identification 15/31
Maximum Mutual Information Concentrates on precise modeling of decision boundary Optimizes parameters for good recognition of whole segments (not individual frames) segmentation of speech is important for training MMI also learns the (undesirable) language priors from training data need to equalize amount of data per language (segment weighting in reestimation formulae) Other discriminative training technique were also investigated (MCE and MWE), MMI performs the best [Burget 2006] L. Burget, P. Matějka, and J. Černocký, Discriminative training techniques for acoustic language identification, ICASSP 2006, Toulouse, France Brno University of Technology: Language identification 16/31
Experiments Task description - NIST 2003 conversational telephone speech 12 target languages + 1 unknown 80 (or more) segments per duration of 3, 10 and 30 second in each language together 1280 segments per duration development set - 12 language task from LRE 1996 Arabic(Egyptian) German Farsi French(Canadian French) Hindi Japanese Korean English(American) Mandarin Tamil Vietnamese Spanish(Latin American) unknown - Russian Brno University of Technology: Language identification 17/31
Results on LRE2003 30 sec 40 20 fuse GMMMMI 128 + PPRLM PPRLMlattice+anti.m. GMMMMI 128 GMMML 2048 GMMML 128 Miss probability (in %) 10 5 2 1 0.5 0.2 0.1 System / EER [%] GMM-ML 2048 4.8 GMM-MMI 128 2.0 0.1 0.2 0.5 1 2 5 10 20 40 False Alarm probability (in %) Brno University of Technology: Language identification 18/31
Phonotactic - Phoneme Recognition followed by Language Model (PRLM) Feature Extraction Phoneme Recognizer Language Modeling Speech Phoneme Recognizer Language Arabic a a a 8 a b c 2 a b v 4 a v t 15 r t x 12 e r t 1 English a a a 9 a b c 15 a b v 23 a v t 1 r t x 7 e r t 48 Model Decision Brno University of Technology: Language identification 19/31
Phoneme recognition Speech Mel bank Time buffer Feature RC Extraction Neural Network Neural Decoder sil h a l l o sil Feature LC Extraction Neural Network Network 310 ms long temporal context around the actual frame: LC=left context (past), RC=right context (future) Temporal trajectories of mel-filter bank energies are processed by DCT, and concatenated from all bands. 3 neural nets to produce phoneme posterior probabilities TIMIT : Phoneme Error Rate = 21.5%, Classification Error = 17.2%. to learn more: P. Schwarz, P. Matějka, and J. Černocký, Hierarchical structures of neural networks for phoneme recognition, ICASSP 2006, Toulouse, France Brno University of Technology: Language identification 20/31
Phoneme recognition for LID the quality of PRLM and PPRLM heavily depends on the amount of training data initial work done with OGIStories, but not enough data. Question: will LID work in case we use tokenizers from languages for which we have enough and well transcribed data? Answer: Yes! - using Hungarian, Czech and Russian from SpeechDat-E database. we know well this data. 10 more data than OGIStories. none of these is target language of any NIST eval...... but works: see P. Matějka, P. Schwarz, J. Černocký and P. Chytil, Phonotactic Language Identification using High Quality Phoneme Recognition, Eurospeech 2005, Lisbon, Portugal Brno University of Technology: Language identification 21/31
Target model Tri-gram counts from the best path null sil h e l l o sil null Tri-gram counts from lattice sil e l null sil h o l a u sil null a d o Each count is weighted by the posterior probability of the path on which it is laying. Back-off 3-gram LM with Witten-Bell discounting Brno University of Technology: Language identification 22/31
Statistical modeling in example - Language model Arabic English... a a a 8 a a a 1 a a a 4 a a b 14 a a b 2 a a b 5 a a c 1 a a c 25 a a c 10............ Brno University of Technology: Language identification 23/31
Antimodel I. Modeling the space where target models make mistakes, inspired by LVCSR work of ICSI/SRI: A. Stolcke, et al.: The SRI March 2000 Hub-5 conversational speech transcription system, in Proc. NIST Speech Transcription Workshop, 2000. Miss recognized blue segments Training data for red antimodel Recognition of all training data, obtaining posteriors: Miss recognized red segments P(O r L) = L(O r LM + L ) L L(O r LM + L ) Training separate LM on all languages except the target one and weighting the counts by the posterior probability of wrongly recognizing the segment as the target language Brno University of Technology: Language identification 24/31
Antimodel II. Obtaining the final score: log S(O r L) = log L(O r LM + L ) k log L(O r LM L ), Brno University of Technology: Language identification 25/31
Results on LRE2003 30 sec condition Miss probability (in %) 40 20 10 5 2 1 fuse GMM MMI 128 + PPRLM PPRLM lattice+anti.m. GMM MMI 128 PRLM string PRLM lattice PRLM lattice+anti.m. GMM ML 2048 GMM ML 128 System / EER [%] PRLM string 3.1 PRLM lattice 2.3 PRLM anti.m. 1.8 PPRLM anti.m. 1.4 GMM-MMI 128 2.0 Fusion 0.8 0.5 0.2 0.1 0.1 0.2 0.5 1 2 5 10 20 40 False Alarm probability (in %) Brno University of Technology: Language identification 26/31
NIST 2005 Language Recognition Evaluation conversational telephone speech 7 target languages + 1 unknown 360 (or more) segments with duration of 3, 10 and 30 second in each language together 3662 segments per duration = more then 30 hours of speech dialect recognition - two dialects of English and Mandarin Development set - 12 language task from LRE 2003 English (American) English (Indian) Hindi Mandarin (Mainland) Mandarin (Taiwan) Japanese Korean Spanish (Mexican) Tamil unknown - German Brno University of Technology: Language identification 27/31
Results on LRE2005 System EER [%] Duration 30sec 10 sec 3 sec PRLM string 6.8 13.9 24.5 PRLM lattice 5.7 10.7 21.2 PRLM lattice+anti.m. 5.3 10.7 21.4 GMM-MMI 256 4.6 8.6 17.2 Fusion 2.9 6.4 14.1 Brno University of Technology: Language identification 28/31
Conclusion GMM discriminative training (MMI) of GMM provides substantial improvements with respect to conventional ML training at the same time it allows to significantly reduce the number of parameters PRLM PRLM improved by training and testing on lattices obtained good results with antimodels Brno University of Technology: Language identification 29/31
Future plans GMM HLDA experiments [Burget 2006] PRLM (and phoneme recognizer) channel and speaker adaptation in NN-based phoneme recognizer improve language modeling, use SVM classification, binary trees General better combination of separate systems channel normalization [Burget 2006] L. Burget, P. Matějka, and J. Černocký, Discriminative training techniques for acoustic language identification, ICASSP 2006, Toulouse, France Brno University of Technology: Language identification 30/31
END: Thank you for your attention More information: Eurospeech 2005 - Lisbon, Portugal: Phonotactic Language Identification using High Quality Phoneme Recognition ICASSP 2006 - Toulouse, France: Use of Anti-Models to Further Improve State-of-the-art PRLM LID System ICASSP 2006 - Toulouse, France: Discriminative Training Techniques for Acoustic Language Identification ODYSSEY 2006 - San Juan, Puerto Rico: Brno University of Technology System for NIST 2005 Language Recognition Eval. Brno University of Technology: Language identification 31/31
What to do next? WORK HARDER & WORK CAREFULLY Brno University of Technology: Language identification 32/31
Related work Phoneme Recognizer being developed primary as a part of indexation and search engine using keyword spotting (sponsored by European AMI Project) Phoneme recognizer is available at http://www.fit.vutbr.cz/speech/sw/phnrec.html GMM Trained by our speech toolkit (STK). The toolkit is HTK compatible and supports many nice features like discriminative training (MMI, MPE), training from lattices, linear transforms (MLLT, LDA, HLDA), keyword spotting tool etc. STK was used for example to train AMI LVCSR for meeting transcription submitted to RT-05 NIST evaluation. STK is available at http://www.fit.vutbr.cz/speech/sw/stk.html LID system partially sponsored by Czech Ministry of Defense Brno University of Technology: Language identification 33/31
National Institute of Standards and Technology US government agency. Coordinating benchmark tests within the research and development community. Active fields in speech processing: Language Recognition Speaker Recognition LVCSR - Large Vocabulary Continuous Speech Recognition... Brno University of Technology: Language identification 34/31
System Evaluation 1. Correctnes 2. DET curves - as NIST evaluation matric Probabilities of false alarms and miss rejections are evaluated as a function of detection threshold Brno University of Technology: Language identification 35/31
Classification - Two Class Problem The problem: Is P (L 1 O) > P (L 2 O)? Using Bayes Theorem: Is p(o L 1 )P (L 1 ) p(o) > p(o L 2)P (L 2 ) p(o)? Assuming equal priors: Is p(o L 1 ) > p(o L 2 )? where P (L O) is probability of language L given the observation sequence (utterance) O, P (L) is the prior probability of language L and p(o L) = T p(o t L) =. t T p(o t M L λ) t where value of probability density function p(o t L) (distribution of features for language L) is approximated by Gaussian mixture model M L λ with parameters λ = {µ i, σ 2 i, c i} Brno University of Technology: Language identification 36/31