Language Identification

Similar documents
Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Speech Recognition at ICSI: Broadcast News and beyond

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

Support Vector Machines for Speaker and Language Recognition

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

WHEN THERE IS A mismatch between the acoustic

Modeling function word errors in DNN-HMM based LVCSR systems

Calibration of Confidence Measures in Speech Recognition

Modeling function word errors in DNN-HMM based LVCSR systems

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

Speech Emotion Recognition Using Support Vector Machine

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Human Emotion Recognition From Speech

A study of speaker adaptation for DNN-based speech synthesis

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Investigation on Mandarin Broadcast News Speech Recognition

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Speaker recognition using universal background model on YOHO database

Improvements to the Pruning Behavior of DNN Acoustic Models

Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Learning Methods in Multilingual Speech Recognition

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

ROSETTA STONE PRODUCT OVERVIEW

Speech Translation for Triage of Emergency Phonecalls in Minority Languages

Python Machine Learning

Lecture 1: Machine Learning Basics

Deep Neural Network Language Models

Vowel mispronunciation detection using DNN acoustic models with cross-lingual training

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Artificial Neural Networks written examination

On the Formation of Phoneme Categories in DNN Acoustic Models

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Using dialogue context to improve parsing performance in dialogue systems

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Spoofing and countermeasures for automatic speaker verification

Generative models and adversarial training

Reducing Features to Improve Bug Prediction

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

UMass at TDT Similarity functions 1. BASIC SYSTEM Detection algorithms. set globally and apply to all clusters.

Automatic Pronunciation Checker

Improved Hindi Broadcast ASR by Adapting the Language Model and Pronunciation Model Using A Priori Syntactic and Morphophonemic Knowledge

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

SPEECH RECOGNITION CHALLENGE IN THE WILD: ARABIC MGB-3

Section V Reclassification of English Learners to Fluent English Proficient

Comparison of EM and Two-Step Cluster Method for Mixed Data: An Application

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Australian Journal of Basic and Applied Sciences

Switchboard Language Model Improvement with Conversational Data from Gigaword

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Corrective Feedback and Persistent Learning for Information Extraction

Speech Recognition by Indexing and Sequencing

Rule Learning With Negation: Issues Regarding Effectiveness

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

Speaker Identification by Comparison of Smart Methods. Abstract

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

Speaker Recognition. Speaker Diarization and Identification

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING

Edinburgh Research Explorer

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

LOW-RANK AND SPARSE SOFT TARGETS TO LEARN BETTER DNN ACOUSTIC MODELS

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

The 2014 KIT IWSLT Speech-to-Text Systems for English, German and Italian

Segregation of Unvoiced Speech from Nonspeech Interference

Comparison of network inference packages and methods for multiple networks inference

Mandarin Lexical Tone Recognition: The Gating Paradigm

Probabilistic Latent Semantic Analysis

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

SUPRA-SEGMENTAL FEATURE BASED SPEAKER TRAIT DETECTION

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

arxiv: v1 [cs.cl] 27 Apr 2016

Assignment 1: Predicting Amazon Review Ratings

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

Word Segmentation of Off-line Handwritten Documents

A Neural Network GUI Tested on Text-To-Phoneme Mapping

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Affective Classification of Generic Audio Clips using Regression Models

The MSR-NRC-SRI MT System for NIST Open Machine Translation 2008 Evaluation

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS

Transcription:

Language Identification Pavel Matějka, Lukáš Burget, Petr Schwarz and Jan Černocký matejkap burget schwarzp cernocky@fit.vutbr.cz Speech@FIT group, Faculty of Information Technology Brno University of Technology, Czech Republic Brno University of Technology: Language identification 1/31

Plan Introduction - Why do we need LID? Gaussian mixture model approach System description Features for recognition Discriminative training Phonotactic approach Basic system description Extension to lattices Language antimodels Results on LRE 2003 and 2005 Conclusions and future work Brno University of Technology: Language identification 2/31

Why do we need language identification? - I. 1) Route phone calls to human operators. emergency (155,911) call-centers police (158) fireguard (150) Brno University of Technology: Language identification 3/31

Why do we need language identification? - II. 2) Pre-select suitable speech recognition system (information systems). Black Box LID Approache Brno University of Technology: Language identification 4/31

Why do we need language identification? - III. 3) Security applications. Brno University of Technology: Language identification 5/31

Two main approaches to Language Identification Acoustic - Gaussian Mixture Models (GMM) Speech Features GMM Decision good for short and long speech segments and dialect recognition, relies on the sounds - tends to recognize the speaker s native language Phonotactic - Phoneme Recognition followed by Language Model (PRLM) Speech Phoneme Recognizer Language Arabic a a a 8 a b c 2 a b v 4 a v t 15 r t x 12 e r t 1 English a a a 9 a b c 15 a b v 23 a v t 1 r t x 7 e r t 48 Model Decision good for longer speech segments, robust against dialects in one language and eliminates speech characteristics of speaker s native language Brno University of Technology: Language identification 6/31

Acoustic LID Language 1 GMM Score 1 Feature extraction Language 2 GMM Score normalization Score 2. Language N GMM Score N Brno University of Technology: Language identification 7/31

Feature extraction MFCC (static coefficients including C0) RASTA channel normalization VTLN speaker adaptation MFCC are augmented with Shifted Delta Cepstra 7-1-3-7 (SDC) representing an information about the speech evolution around the current frame (±0.1sec.) MFCC coefficients Frames SDC corresponding to one MFCC coefficient The size of final feature vector is: 7MF CC + 7 7SDC = 56 Brno University of Technology: Language identification 8/31

Distribution of Feaures for Two Languages Each feature vector can be presented as a point in N-dimensional space Brno University of Technology: Language identification 9/31

Distribution of Feaures for Two Languages What language is spoken in the green utterance? Blue or red? Brno University of Technology: Language identification 10/31

Distribution of Feaures for Two Languages Brno University of Technology: Language identification 11/31

Modeling Distributions Using Mixture of Gaussians Brno University of Technology: Language identification 12/31

GMM - Training Goal: Using training utterances O r and their transcriptions L r, find model parameters λ Maximum Likelihood (ML) training Objective function to maximize is the likelihood of training data given the transcription F ML (λ) = R r=1 R log p(o r M L r λ ) = r=1 T r t=1 log p(o rt M L r λ ) Independently trained models of different languages wasting parameters to precisely model distribution even of those parts of feature space with no discriminative power Assignment of frames to speech segments is NOT important for training Maximum Mutual Information (MMI) training Objective function to maximize is the posterior probability of all training segments being correctly recognized F MMI (λ) = R log p(o r M L r λ ) L p(o r M L λ ) r=1 Brno University of Technology: Language identification 13/31

Highly overlapped distributions Brno University of Technology: Language identification 14/31

Highly overlapped distributions Easily recognizable No need to precisely model the distributions Necesery to precisely model the boundary Highly overlaped classes, low discriminative power Brno University of Technology: Language identification 15/31

Maximum Mutual Information Concentrates on precise modeling of decision boundary Optimizes parameters for good recognition of whole segments (not individual frames) segmentation of speech is important for training MMI also learns the (undesirable) language priors from training data need to equalize amount of data per language (segment weighting in reestimation formulae) Other discriminative training technique were also investigated (MCE and MWE), MMI performs the best [Burget 2006] L. Burget, P. Matějka, and J. Černocký, Discriminative training techniques for acoustic language identification, ICASSP 2006, Toulouse, France Brno University of Technology: Language identification 16/31

Experiments Task description - NIST 2003 conversational telephone speech 12 target languages + 1 unknown 80 (or more) segments per duration of 3, 10 and 30 second in each language together 1280 segments per duration development set - 12 language task from LRE 1996 Arabic(Egyptian) German Farsi French(Canadian French) Hindi Japanese Korean English(American) Mandarin Tamil Vietnamese Spanish(Latin American) unknown - Russian Brno University of Technology: Language identification 17/31

Results on LRE2003 30 sec 40 20 fuse GMMMMI 128 + PPRLM PPRLMlattice+anti.m. GMMMMI 128 GMMML 2048 GMMML 128 Miss probability (in %) 10 5 2 1 0.5 0.2 0.1 System / EER [%] GMM-ML 2048 4.8 GMM-MMI 128 2.0 0.1 0.2 0.5 1 2 5 10 20 40 False Alarm probability (in %) Brno University of Technology: Language identification 18/31

Phonotactic - Phoneme Recognition followed by Language Model (PRLM) Feature Extraction Phoneme Recognizer Language Modeling Speech Phoneme Recognizer Language Arabic a a a 8 a b c 2 a b v 4 a v t 15 r t x 12 e r t 1 English a a a 9 a b c 15 a b v 23 a v t 1 r t x 7 e r t 48 Model Decision Brno University of Technology: Language identification 19/31

Phoneme recognition Speech Mel bank Time buffer Feature RC Extraction Neural Network Neural Decoder sil h a l l o sil Feature LC Extraction Neural Network Network 310 ms long temporal context around the actual frame: LC=left context (past), RC=right context (future) Temporal trajectories of mel-filter bank energies are processed by DCT, and concatenated from all bands. 3 neural nets to produce phoneme posterior probabilities TIMIT : Phoneme Error Rate = 21.5%, Classification Error = 17.2%. to learn more: P. Schwarz, P. Matějka, and J. Černocký, Hierarchical structures of neural networks for phoneme recognition, ICASSP 2006, Toulouse, France Brno University of Technology: Language identification 20/31

Phoneme recognition for LID the quality of PRLM and PPRLM heavily depends on the amount of training data initial work done with OGIStories, but not enough data. Question: will LID work in case we use tokenizers from languages for which we have enough and well transcribed data? Answer: Yes! - using Hungarian, Czech and Russian from SpeechDat-E database. we know well this data. 10 more data than OGIStories. none of these is target language of any NIST eval...... but works: see P. Matějka, P. Schwarz, J. Černocký and P. Chytil, Phonotactic Language Identification using High Quality Phoneme Recognition, Eurospeech 2005, Lisbon, Portugal Brno University of Technology: Language identification 21/31

Target model Tri-gram counts from the best path null sil h e l l o sil null Tri-gram counts from lattice sil e l null sil h o l a u sil null a d o Each count is weighted by the posterior probability of the path on which it is laying. Back-off 3-gram LM with Witten-Bell discounting Brno University of Technology: Language identification 22/31

Statistical modeling in example - Language model Arabic English... a a a 8 a a a 1 a a a 4 a a b 14 a a b 2 a a b 5 a a c 1 a a c 25 a a c 10............ Brno University of Technology: Language identification 23/31

Antimodel I. Modeling the space where target models make mistakes, inspired by LVCSR work of ICSI/SRI: A. Stolcke, et al.: The SRI March 2000 Hub-5 conversational speech transcription system, in Proc. NIST Speech Transcription Workshop, 2000. Miss recognized blue segments Training data for red antimodel Recognition of all training data, obtaining posteriors: Miss recognized red segments P(O r L) = L(O r LM + L ) L L(O r LM + L ) Training separate LM on all languages except the target one and weighting the counts by the posterior probability of wrongly recognizing the segment as the target language Brno University of Technology: Language identification 24/31

Antimodel II. Obtaining the final score: log S(O r L) = log L(O r LM + L ) k log L(O r LM L ), Brno University of Technology: Language identification 25/31

Results on LRE2003 30 sec condition Miss probability (in %) 40 20 10 5 2 1 fuse GMM MMI 128 + PPRLM PPRLM lattice+anti.m. GMM MMI 128 PRLM string PRLM lattice PRLM lattice+anti.m. GMM ML 2048 GMM ML 128 System / EER [%] PRLM string 3.1 PRLM lattice 2.3 PRLM anti.m. 1.8 PPRLM anti.m. 1.4 GMM-MMI 128 2.0 Fusion 0.8 0.5 0.2 0.1 0.1 0.2 0.5 1 2 5 10 20 40 False Alarm probability (in %) Brno University of Technology: Language identification 26/31

NIST 2005 Language Recognition Evaluation conversational telephone speech 7 target languages + 1 unknown 360 (or more) segments with duration of 3, 10 and 30 second in each language together 3662 segments per duration = more then 30 hours of speech dialect recognition - two dialects of English and Mandarin Development set - 12 language task from LRE 2003 English (American) English (Indian) Hindi Mandarin (Mainland) Mandarin (Taiwan) Japanese Korean Spanish (Mexican) Tamil unknown - German Brno University of Technology: Language identification 27/31

Results on LRE2005 System EER [%] Duration 30sec 10 sec 3 sec PRLM string 6.8 13.9 24.5 PRLM lattice 5.7 10.7 21.2 PRLM lattice+anti.m. 5.3 10.7 21.4 GMM-MMI 256 4.6 8.6 17.2 Fusion 2.9 6.4 14.1 Brno University of Technology: Language identification 28/31

Conclusion GMM discriminative training (MMI) of GMM provides substantial improvements with respect to conventional ML training at the same time it allows to significantly reduce the number of parameters PRLM PRLM improved by training and testing on lattices obtained good results with antimodels Brno University of Technology: Language identification 29/31

Future plans GMM HLDA experiments [Burget 2006] PRLM (and phoneme recognizer) channel and speaker adaptation in NN-based phoneme recognizer improve language modeling, use SVM classification, binary trees General better combination of separate systems channel normalization [Burget 2006] L. Burget, P. Matějka, and J. Černocký, Discriminative training techniques for acoustic language identification, ICASSP 2006, Toulouse, France Brno University of Technology: Language identification 30/31

END: Thank you for your attention More information: Eurospeech 2005 - Lisbon, Portugal: Phonotactic Language Identification using High Quality Phoneme Recognition ICASSP 2006 - Toulouse, France: Use of Anti-Models to Further Improve State-of-the-art PRLM LID System ICASSP 2006 - Toulouse, France: Discriminative Training Techniques for Acoustic Language Identification ODYSSEY 2006 - San Juan, Puerto Rico: Brno University of Technology System for NIST 2005 Language Recognition Eval. Brno University of Technology: Language identification 31/31

What to do next? WORK HARDER & WORK CAREFULLY Brno University of Technology: Language identification 32/31

Related work Phoneme Recognizer being developed primary as a part of indexation and search engine using keyword spotting (sponsored by European AMI Project) Phoneme recognizer is available at http://www.fit.vutbr.cz/speech/sw/phnrec.html GMM Trained by our speech toolkit (STK). The toolkit is HTK compatible and supports many nice features like discriminative training (MMI, MPE), training from lattices, linear transforms (MLLT, LDA, HLDA), keyword spotting tool etc. STK was used for example to train AMI LVCSR for meeting transcription submitted to RT-05 NIST evaluation. STK is available at http://www.fit.vutbr.cz/speech/sw/stk.html LID system partially sponsored by Czech Ministry of Defense Brno University of Technology: Language identification 33/31

National Institute of Standards and Technology US government agency. Coordinating benchmark tests within the research and development community. Active fields in speech processing: Language Recognition Speaker Recognition LVCSR - Large Vocabulary Continuous Speech Recognition... Brno University of Technology: Language identification 34/31

System Evaluation 1. Correctnes 2. DET curves - as NIST evaluation matric Probabilities of false alarms and miss rejections are evaluated as a function of detection threshold Brno University of Technology: Language identification 35/31

Classification - Two Class Problem The problem: Is P (L 1 O) > P (L 2 O)? Using Bayes Theorem: Is p(o L 1 )P (L 1 ) p(o) > p(o L 2)P (L 2 ) p(o)? Assuming equal priors: Is p(o L 1 ) > p(o L 2 )? where P (L O) is probability of language L given the observation sequence (utterance) O, P (L) is the prior probability of language L and p(o L) = T p(o t L) =. t T p(o t M L λ) t where value of probability density function p(o t L) (distribution of features for language L) is approximated by Gaussian mixture model M L λ with parameters λ = {µ i, σ 2 i, c i} Brno University of Technology: Language identification 36/31