Sphinx Benchmark Report

Similar documents
A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Speech Recognition at ICSI: Broadcast News and beyond

Investigation on Mandarin Broadcast News Speech Recognition

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

Learning Methods in Multilingual Speech Recognition

A study of speaker adaptation for DNN-based speech synthesis

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

WHEN THERE IS A mismatch between the acoustic

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Modeling function word errors in DNN-HMM based LVCSR systems

Calibration of Confidence Measures in Speech Recognition

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Modeling function word errors in DNN-HMM based LVCSR systems

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Speech Emotion Recognition Using Support Vector Machine

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Probabilistic Latent Semantic Analysis

Improvements to the Pruning Behavior of DNN Acoustic Models

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Edinburgh Research Explorer

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Lecture 1: Machine Learning Basics

Generative models and adversarial training

Human Emotion Recognition From Speech

The 2014 KIT IWSLT Speech-to-Text Systems for English, German and Italian

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

Speaker recognition using universal background model on YOHO database

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

(Sub)Gradient Descent

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Deep Neural Network Language Models

Switchboard Language Model Improvement with Conversational Data from Gigaword

Introduction to Questionnaire Design

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Lecture 9: Speech Recognition

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Python Machine Learning

Support Vector Machines for Speaker and Language Recognition

Purdue Data Summit Communication of Big Data Analytics. New SAT Predictive Validity Case Study

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Speech Recognition by Indexing and Sequencing

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

Online Updating of Word Representations for Part-of-Speech Tagging

Improved Hindi Broadcast ASR by Adapting the Language Model and Pronunciation Model Using A Priori Syntactic and Morphophonemic Knowledge

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Case study Norway case 1

CS Machine Learning

Automatic Pronunciation Checker

The MSR-NRC-SRI MT System for NIST Open Machine Translation 2008 Evaluation

Discriminative Learning of Beam-Search Heuristics for Planning

Learning From the Past with Experiment Databases

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

Clickthrough-Based Translation Models for Web Search: from Word Models to Phrase Models

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

DNN ACOUSTIC MODELING WITH MODULAR MULTI-LINGUAL FEATURE EXTRACTION NETWORKS

Spoken Language Parsing Using Phrase-Level Grammars and Trainable Classifiers

Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment

Effect of Word Complexity on L2 Vocabulary Learning

Probability and Statistics Curriculum Pacing Guide

Speech Translation for Triage of Emergency Phonecalls in Minority Languages

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

A Neural Network GUI Tested on Text-To-Phoneme Mapping

arxiv: v1 [cs.lg] 7 Apr 2015

CSC200: Lecture 4. Allan Borodin

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

STA 225: Introductory Statistics (CT)

Science Fair Project Handbook

Artificial Neural Networks written examination

Designing a Computer to Play Nim: A Mini-Capstone Project in Digital Design I

Visit us at:

Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

Mandarin Lexical Tone Recognition: The Gating Paradigm

An Online Handwriting Recognition System For Turkish

Why Did My Detector Do That?!

Voice conversion through vector quantization

Transcription:

Sphinx Benchmark Report Long Qin Language Technologies Institute School of Computer Science Carnegie Mellon University

Overview! uate general training and testing schemes! LDA-MLLT, VTLN, MMI, SAT, MLLR, CMLLR! Use default setup and existing tools! SphinxTrain-.8, Sphinx3! Focus on WER, running time was not measured! Experiments were performed on different server machines, it s not easy to directly compare the xrt! Test on different data! Easy task (WSJ) vs. broadcast news! English vs. Mandarin

Outline! The baseline training scheme! LDA-MLLT! VTLN! MMI! SAT! CMLLR! MLLR! Experiments! Discussion

Baseline Training Scheme 13-MFCC with Delta and Delta-Delta Triphone model 3-state HMM GMM observation distribution Feature Extraction CI Model CD Model Monophone model 3-state HMM 1-Gaussian or GMM observation distribution Decision tree clustering with auto-generated questions A few thousand states

Force Alignment Feature Extraction CI Model Force Alignment CI Model CD Model! Force Alignment! Find the best alignment between speech and corresponding HMMs! Goal! Possibly remove utterances with transcription errors or low quality recordings! Find appropriate pronunciations for words with multiple pronunciations! Settings! $CFG_FORCEDALIGN = yes ;! $CFG_FORCE_ALIGN_BEAM = 1e-6;! $CFG_FALIGN_CI_MGAU = yes / no ;

LDA-MLLT Feature Extraction CI Model LDA- MLLT CI Model CD Model! LDA (linear discriminant analysis)! Find a linear transform of feature vector, so that class separation is maximized! Reduce feature dimension! MLLT (maximum likelihood linear transform)! Minimize the loss of likelihood between full and diagonal variance model! Applied together with LDA! In Sphinx! Each Gaussian is considered as one class! Easier to implement! Settings:! Could also define state or phone as class! $CFG_LDA_MLLT = yes ;! $CFG_LDA_DIMENSION = 29;

VTLN Feature Extraction CI Model VTLN Train CI Model CD Model VTLN Decode! VTLN (vocal tract length normalization)! Formant frequency is considered to have a linear relationship with the vocal tract length! Adjust vocal tract length for each speaker to an average length by warping their spectra! The warping factor:! In Sphinx! Warping factor is estimated for each utterance using exhaust search! Could also estimate identical warping factor for each speaker! Warping factor should be estimated in both training and decoding! Settings:! $CFG_VTLN = yes';! $CFG_VTLN_START =.7;! $CFG_VTLN_END = 1.4;! $CFG_VTLN_STEP =.;

MMI Feature Extraction CI Model CD Model MMI! MMI (maximum mutual information)! A discriminative training algorithm! Maximize the posterior probability of the true hypothesis! Training is time consuming! Settings:! $CFG_MMIE_MAX_ITERATIONS = 4;! $CFG_MMIE_CONSTE = "3.";! $CFG_LANGUAGEWEIGHT = "11.";! The same as the language weight used in decoding! $CFG_LANGUAGEMODEL = LMFILE";! A unigram or bigram LM

CMLLR Feature Extraction CI Model CD Model CMLLR! CMLLR (constraint maximum likelihood linear regression)! A speaker adaptation algorithm to modify speaker independent system towards new speaker using limited data! Use the same transform for both mean and variance, therefore usually require less data then MLLR! Could be formulated as a linear transform of input features! In Sphinx! Use a single global transform to adapt the input features for each speaker! When accumulate counts, run BW with -fullvar yes, -2passvar no and -cmllrdump yes! Settings:! $CFG_DEC_DICTIONARY = DECODING_DICTIONRY ;! $CFG_DEC_LM = DECODING_LANGUAGE_MODEL ;

SAT Feature Extraction CI Model CD Model SAT! SAT (speaker adaptive training)! Train a better speaker independent system! Apply CMLLR transforms to training features! Re-estimate the CMLLR transforms every iteration! In Sphinx! SAT is applied after training a fairly good ML/MMI model! Need to split the training control and reference files into smaller files for each speaker (make_speaker_lists.py)! Settings:! $CFG_SAT_DIR = $CFG_BASE_DIR/sat ;

MLLR Feature Extraction CI Model CD Model MLLR! MLLR (maximum likelihood linear regression)! Another speaker adaptation algorithm! Adjust mean and/or covariance to maximize the likelihood of the adaptation data! In Sphinx! Adapt mean in default! Could also adapt covariance! Use a single global transform for all models! Could have multiple transforms for different classes of models! Settings! Applied during decoding! Get hypotheses of the testing data from the first pass decoding! Using those hypotheses and testing data to estimate transforms and update model parameters! During bw run, must set -2passvar no! Decode again using the adapted model! It s the same procedure when we apply CMLLR/VTLN in decoding

Overall System Framework Feature Extraction LDA-MLLT Force Alignment VTLN Train SAT MMI CD Model CI Model CMLLR VTLN Test MLLR

Data Training Testing LM WSJ WSJ+1 1-hour 82-hour Nov. 92 k and 2k Dev/ standard Trigram BN 138-hour HUB4-96 Dev/ (with data in all different environments) Mandarin BN 128-hour RT4- Trigram from BN 92-97 LM data Trigram from Chinese Gigaword

Baseline Settings! Force Alignment! Could use multiple-gaussian CI model! A little bit better, more computation! Linguistic Questions! If available! Or use auto-generated questions! Decoding! lw=11., beam=1e-1, wbeam=1e-8, wip=.2! Mixtures and States! WSJ: 16 mixtures, 2 tied-states! WSJ+1: 32 mixtures, 4 tied-states! BN: 32 mixtures, tied-states! Mandarin: 32 mixtures, 4 tied-states

Baseline Results Data Dev WER (%) WSJ WSJ+1 7.62 (k) 12.84 (2k). (k) 9.8 (2k) 6.8 (k) 11.69 (2k) 4.18 (k) 7.78 (2k) BN 32.98 32.8 Mandarin ----- 2.3

LDA-MLLT Results 1 1 19% WSJ 13% 1 4% WSJ+1 % K K 33. 33 32. -.4% Dev BN Mandarin 27-1% -% 26 2 24 Baseline LDA-MLLT Comment: may work better on simple tasks with high quality data, but others (Joao Miranda) had tried it on noisy data, which also helped a lot. It works on telephone conversation tasks too.

VTLN Results 1 1 WSJ 6% VTLN Train & Test 1% 1 WSJ+1 4% 1% K K 34 33 32 31 BN 3%.1% VTLN Test Only 26 2. 2 24. 24 Mandarin 3% For BN and Mandarin, VTLN is only applied during decoding, as it was found the performance was worse when applying VTLN in both training and decoding Baseline VTLN Dev To be noticed: the red numbers in the graph is the relative improvement over the baseline to have a graph without too many bars, the WSJ K/ results are the average of the the Dev and results

MMI Results 1 1 1% WSJ 3% 1 6% WSJ+1 4% K K 34 33 32 31 BN 26 Mandarin 3% 3% 2 24 % 23 Devel Baseline MMIE Comment: the results are not as good as I got from the lattice pruning experiments, where I used smaller lattices; try smaller beam widths when generate lattices, such as $beam = $wbeam = 1e-7, should be better and faster. Also try to use a bigram instead of unigram when generating lattices.

MLLR Results 1 1 WSJ 2% 18% 1 WSJ+1 1% 11% K K 33 32 31 BN 3% 2% 26 2 24 23 22 Mandarin 6% Baseline MLLR Dev Comment: works pretty good especially when the first path hypotheses are accurate; could use the second path hypotheses train a better transform and iteratively do it to get the best number

CMLLR Results 1 1 21% WSJ 21% 1 17% WSJ+1 13% K K 34 32 3 28 Dev BN Mandarin 26 6% 6% 2 24 7% 23 22 Baseline CMLLR Comment: has similar performance as MLLR, slightly better in BN

SAT Results Here the number is relative improvement of SAT+CMLLR over baseline 1 1 K 29% WSJ 28% Baseline CMLLR SAT 1 K 22% WSJ+1 23% Baseline CMLLR SAT Comment: SAT + CMLLR decoding is very effective, which usually gives 1% improvement over CMLLR decoding only. When estimating CMLLR transform, it s better to start from a very good hypothesis such as the CMLLR+MLLR decoding result.

VTLN + MLLR Results 1 1 WSJ 19% 2% 1 WSJ+1 17% 1% 34 33 32 31 3 K Here the number is relative improvement of VTLN+MLLR over baseline Dev 4% BN 2% 26 2 24 23 22 K Mandarin 7% Baseline VTLN MLLR VTLN+MLLR Comment: the improvement is additive, but quite small compared to perform MLLR only

Here the number is relative improvement of CMLLR +MLLR over baseline 1 1 CMLLR+MLLR Results K WSJ 27% 27% 1 K WSJ+1 24% 18% 34 32 3 7% BN 7% 26 24 22 Mandarin 1% Baseline CMLLR 28 Dev 2 MLLR Comment: CMLLR+MLLR further improves the WER!

Here the number is relative improvement of LDA-MLLT +MMI over baseline 1 1 LDA-MLLT + MMI Result K WSJ 21% 1% 1 K WSJ+1 11% 7% 34 32 3 Dev % BN 3% 28 26 24 22 Mandarin Comment: MMIE gives solid improvement over LDA-MLLT (compare the 2 nd bar and the 4 th bar) Baseline LDA-MLLT MMIE LDA-MLLT +MMIE

Summary! LDA-MLLT! works pretty good on simple tasks with clear speech, not clear on hard tasks with noisy speech, needs more investigation! VTLN! produces some improvement! MMIE! produces ok/good improvement! requires large amount computation! CMLLR! works pretty good, especially when first path hypotheses are very accurate! MLLR! SAT! works similar to CMLLR! produces solid improvement

Still Missing! Better discriminative training technique! boosted-mmi! Deep Neutral Network! Bottle Neck Feature (easier to adapt)! Hybrid Model (more improvement)