Purely sequence-trained neural networks for ASR based on lattice-free MMI

Similar documents
arxiv: v1 [cs.cl] 27 Apr 2016

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Improvements to the Pruning Behavior of DNN Acoustic Models

arxiv: v1 [cs.lg] 7 Apr 2015

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Deep Neural Network Language Models

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

Speech Recognition at ICSI: Broadcast News and beyond

SPEECH RECOGNITION CHALLENGE IN THE WILD: ARABIC MGB-3

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Investigation on Mandarin Broadcast News Speech Recognition

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, VOL XXX, NO. XXX,

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

Calibration of Confidence Measures in Speech Recognition

LOW-RANK AND SPARSE SOFT TARGETS TO LEARN BETTER DNN ACOUSTIC MODELS

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

Dropout improves Recurrent Neural Networks for Handwriting Recognition

Learning Methods in Multilingual Speech Recognition

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures

DNN ACOUSTIC MODELING WITH MODULAR MULTI-LINGUAL FEATURE EXTRACTION NETWORKS

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

The A2iA Multi-lingual Text Recognition System at the second Maurdor Evaluation

Improved Hindi Broadcast ASR by Adapting the Language Model and Pronunciation Model Using A Priori Syntactic and Morphophonemic Knowledge

Speech Emotion Recognition Using Support Vector Machine

The 2014 KIT IWSLT Speech-to-Text Systems for English, German and Italian

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

WHEN THERE IS A mismatch between the acoustic

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Softprop: Softmax Neural Network Backpropagation Learning

A study of speaker adaptation for DNN-based speech synthesis

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Bi-Annual Status Report For. Improved Monosyllabic Word Modeling on SWITCHBOARD

On the Formation of Phoneme Categories in DNN Acoustic Models

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Vowel mispronunciation detection using DNN acoustic models with cross-lingual training

Speech Translation for Triage of Emergency Phonecalls in Minority Languages

Forget catastrophic forgetting: AI that learns after deployment

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

INPE São José dos Campos

Lecture 1: Machine Learning Basics

Knowledge Transfer in Deep Convolutional Neural Nets

Switchboard Language Model Improvement with Conversational Data from Gigaword

Rule Learning with Negation: Issues Regarding Effectiveness

TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen

Speaker Identification by Comparison of Smart Methods. Abstract

Rule Learning With Negation: Issues Regarding Effectiveness

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

Generative models and adversarial training

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Language Model and Grammar Extraction Variation in Machine Translation

Noisy SMS Machine Translation in Low-Density Languages

Python Machine Learning

Cultivating DNN Diversity for Large Scale Video Labelling

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

Глубокие рекуррентные нейронные сети для аспектно-ориентированного анализа тональности отзывов пользователей на различных языках

Learning Methods for Fuzzy Systems

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Human Emotion Recognition From Speech

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

The IRISA Text-To-Speech System for the Blizzard Challenge 2017

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Semantic Segmentation with Histological Image Data: Cancer Cell vs. Stroma

Detecting English-French Cognates Using Orthographic Edit Distance

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Georgetown University at TREC 2017 Dynamic Domain Track

Artificial Neural Networks written examination

SARDNET: A Self-Organizing Feature Map for Sequences

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Reducing Features to Improve Bug Prediction

Using dialogue context to improve parsing performance in dialogue systems

Residual Stacking of RNNs for Neural Machine Translation

Corrective Feedback and Persistent Learning for Information Extraction

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Letter-based speech synthesis

Australian Journal of Basic and Applied Sciences

Transcription:

Purely sequence-trained neural networks for ASR based on lattice-free MMI Dan Povey, Vijay Peddinti, Daniel Galvez, Pegah Ghahremani, Vimal Manohar, Xingyu Na, Yiming Wang, Sanjeev Khudanpur

Why should you care about this? It gives better WERs than the conventional way of training models. It's a lot faster to train It's a lot faster to decode We're modifying most of the recipes in Kaldi to use this. Doesn't always give WER improvements on small data (e.g. < 50 hours)

Connection with CTC Simplification of an extension of CTC Context-dependent CTC Commonalities with CTC Objective function is the posterior of the correct transcript of the utterance Output is computed at 33 Hz frame rate Also see Lower Frame Rate Neural Network AMs, Pundak & Sainath, Interspeech 2016 [1] A. Graves, et al, Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks, in ICML 2006. [2] Senior A., et al., Acoustic Modelling with CD-CTC-SMBR LSTM RNNS, in ASRU, 2015 [3] Sak H., et al., Fast and accurate recurrent neural network acoustic models for speech recognition, in Interspeech 2015.

Practical sequence discriminative training Lattice generation Seed model and lattice quality affect final recognition accuracy [1] Seed Model Lattice Generation Model CE1 CE2 GMM 15.8% (-2%) - DNN CE1 (16.2%) 14.1 % (-13) % 13.7% (-15%) DNN CE2 (15.6%) - 13.5% (-17%) Cross-entropy pre-training is important! [1] Yu, D. (2014). Automatic Speech Recognition: A Deep Learning Approach. Springer. [2] Vesely et al, Sequence-discriminative training of deep neural networks Interspeech 2013 [3] Su et al, Error back propagation for sequence training of context-dependent deep networks for conversational speech transcription, ICASSP 2013

Practical sequence discriminative training What about lattice-free MMI? Full forward-backward on denominator is slow The use of the accelerated likelihood evaluation, tight pruning beams, and a small decoding graph made latticefree MMI possible [1] Beam search on GPU is hard Loss of efficiency if different cores are taking different code paths or accessing different data [1] Chen, Stanley F., et al. "Advances in speech transcription at IBM under the DARPA EARS program., IEEE TASLP 2006 [2] Xiong W., et al., The Microsoft 2016 Conversational Speech Recognition System, arxiv.org : September 2016

Proposed approach Full forward backward of denominator on GPU Break up utterances into fixed-size chunks Keep the denominator graph small enough so we can keep the forward scores (α) on the GPU for a minibatch of utterances (e.g. 128) Use a phone LM

Fixed chunk sizes Use 1-second chunks Slight overlaps or gaps where we break up utterances this way Append successive utterances in data preparation How do we break up the transcripts? 1-second chunks may not coincide with word boundaries

Numerator Graph Convert to phone graphs Generate numerator lattices using the transcript Numerator FSA Each state has a frame-index Convert to frame-by-frame masks User-specifiable tolerance default : 50ms Split into fixed size chunks [1] Senior A., et al., Acoustic Modelling with CD-CTC-SMBR LSTM RNNS, in ASRU, 2015

Denominator Graph A decoding graph with phone-level LM (P) and no lexicon HCLG à HCP Construct P to minimize HCP 4-gram phone level LM Estimated from phone alignments of data Prune 4-gram states with low-counts Limit back-off to unpruned 3-gram LM Tri-phones not seen in training cannot be generated Denominator forward-backward took less than 20% of training time

Frame-rate : Topology We use a topology that can be traversed in 1 state, and a 30ms frame shift We did find that the 30ms frame shift was optimal for the 1-state topology. We experimented with different topologies that can be traversed in 1 state. Chosen topology, 0.5 Can generate "a", "ab", "abb",... 0.5 0.5 0.5

Frame-rate : Speed Conventional LF-MMI takes less than 20% of the training time Speeding up network computation RNNs : Use larger delays time LF-MMI time Efficient LF-MMI time

Frame-rate : Speed LF-MMI takes less than 20% of the training time Speeding up network computation TDNNs : Use sub-sampling and strides time Conventional LF-MMI Efficient LF-MMI time time

Speed LF-MMI is substantially faster than conventional systems Frame subsampling with efficient network archictectures Smaller networks to avoid over-fitting LF-MMI is even faster than cross-entropy pretraining Eliminates stages in training No cross-entropy pre-training No denominator lattice generation 5x faster training Decodes are 2-3 times faster than conventional models

Regularization Models trained LF-MMI highly prone to overfitting Three regularization methods Cross-entropy Output l 2 norm Leaky HMM

Regularization : Cross Entropy Supervision is derived from numerator lattices Scaled by constant 0.1 in most cases Larger constants (0.25) for tasks with small amount of data AMI, TED-LIUM and Babel Cross-entropy branch of network is discarded after training Cross-entropy Softmax LF-MMI Final Affine Final Affine Layer N-1 Layer N-1 Layer. N-2 Layer 1

Regularization : Output l 2 norm L 2 norm of the outputs Cross-entropy Softmax LF-MMI Final Affine Final Affine Layer N-1 Layer N-1 Layer. N-2 Layer 1

Regularization : Leaky-HMM Allow gradual forgetting of context ε transition from each state in the HMM to every other state Only one ε transition allowed per frame Equivalent to stopping and restarting the HMM after each frame Leaky-hmm coefficient of 0.1 Modification of the denominator graph

Results : Regularization Crossentropy Regularization Function WER (%) Output l 2 norm Leaky HMM Total SWBD N N N 16.8 11.1 Y N N 15.9 10.5 N Y N 15.9 10.4 N N Y 16.4 10.9 Y Y N 15.7 10.3 Y N Y 15.7 10.3 N Y Y 15.8 10.4 Y Y Y 15.6 10.4 SWBD-300 Hr task : TDNN acoustic models : HUB 00 eval set

Transcript Quality LF-MMI is sensitive to transcript quality Identified from analysis of poor performance in AMI and TED-LIUM tasks Data clean-up using lattice-oracle WER [1] Peddinti et al, "Far-field ASR without parallel data", Interspeech 2016 Data Retained (%) 100 90 80 70 60 50 40 30 20 10 0 0 20 40 60 80 100 WER threshold (%)

Results : Transcript Quality AMI LVCSR Tasks : Impact of data filtering LVCSR Task Cross-entropy Lattice-free MMI Lattice-free MMI + Data Filtering Dev Eval Dev Eval Dev Eval SDM 45.8 50.3 43.2 47.3 42.8 46.1 MDM 41 44.7 40.5 43.2 38.5 41.5 IHM 24.4 25.1 22.6 22.5 22.4 22.4 [1] Peddinti et al, "Far-field ASR without parallel data", Interspeech 2016

Results : Transcript Quality AMI-SDM : Impact of data filtering on Cross-entropy and LF-MMI WER Threshold (%) Cross-entropy Lattice-free MMI Dev Eval Dev Eval 40 45.4 50.3 43.1 46.9 45 45.5 50.1 42.8 46.1 50 45.5 50.1 42.8 46.6 -- 45.8 50.3 43.2 47.3 [1] Peddinti et al, "Far-field ASR without parallel data", Interspeech 2016

Comparison of LF-MMI and CE+sMBR Objective Function Model (Size) WER(%) SWBD Total CE TDNN-A (16.6 M) 12.5 18.2 CE smbr TDNN-A (16.6 M) 11.4 16.9 TDNN-A (9.8 M) 10.7 16.1 LF-MMI TDNN-B (9.9 M) 10.4 15.6 TDNN-C (11.2 M) 10.2 15.5 LF-MMI smbr TDNN-C (11.2 M) 10.0 15.1 time SWBD-300 Hr task : TDNN acoustic models : HUB 00 eval set time

LF-MMI with different DNNs Model TDNN LSTM BLSTM Objective Function WER (%) SWBD Total CE 12.5 18.2 LF-MMI 10.2 15.5 CE 11.6 16.5 LF-MMI 10.3 15.6 CE 10.3 14.9 LF-MMI 9.6 14.5 15 5 3 SWBD-300 Hr task HUB 00 eval set

LF-MMI with different DNNs Model TDNN BLSTM Objective Function WER (%) Total CE 31.0 LF-MMI 27.8 CE 29.4 LF-MMI 25.6 10 13 Training data : 1800 hr Fisher data * 3 fold data augmentation = 5400 hr ASpIRE dev set : 5 hours

LF-MMI in various LVCSR tasks Standard ASR Data Set Size CE CE smbr LF-MMI Rel. Δ AMI-IHM 80 hrs 25.1% 23.8% 22.4% 6% AMI-SDM 80 hrs 50.9% 48.9% 46.1% 6% TED-LIUM* 118 hrs 12.1% 11.3% 11.2% 0% Switchboard 300 hrs 18.2% 16.9% 15.5% 8% LibriSpeech 960 hrs 4.97% 4.56% 4.28% 6% Fisher + Switchboard 2100 hrs 15.4% 14.5% 13.3% 8% TDNN acoustic models Similar architecture across LVCSR tasks *New TED-LIUM recipe using release 2 data CE (10.8) to LF-MMI (9.3)

[1] A.R.Mohamed, F.Seide, D.Yu, J.Droppo, A.Stolcke, G.Zweig and G. Penn, Deep bi-directional recurrent networks over spectral windows, in Proceedings of ASRU. ASRU, 2015. [2] G. Saon, H.K. J. Kuo, S. Rennie, and M. Picheny, The IBM 2015 English Conversational Telephone Speech Recognition System, 2015. Available: http://arxiv.org/abs/ 1505.05899 *Better results reported in Saon et. al., The IBM 2016 English Conversational Telephone Speech Recognition System, this conf. Performance of lattice-free MMI System AM dataset LM dataset Hub5 2000 RT03S SWB CHM FSH SWB Mohd. et al [1] F+S F+S 10.6% - 13.2% 18.9% Mohd. et al [1] F+S F+S+O 9.9% - 12.3% 17.8% Mohd. et al [1] F+S+O F+S+O 9.2% - 11.5% 16.7% Saon et al [2] F+S+C F+S+O 8.0*% 14.1% - - TDNN + LF-MMI S F+S 10.2% 20.5% 14.2% 23.5% TDNN + LF-MMI smbr S F+S 10.0% 20.1% 13.8% 22.1% BLSTM + LF-MMI smbr S F+S 9.6% 19.3% 13.2 20.8 TDNN + LF-MMI F+S F+S 9.2% 17.3% 9.8% 14.8% BLSTM + LF-MMI F+S F+S 8.8% 15.3% 9.8% 13.4% F : Fisher corpus (1800 hrs) S: Switchboard Corpus (300 hrs) C: Callhome corpus (14 hrs) O: Other corpora

Latest Changes : Left bi-phone All the results shown in this paper are with triphone models. Typically the number of leaves is about 10% to 20% fewer than the conventional DNN system (we found this worked the best). Since the paper was published, we've found that left biphone works slightly better with this type of model. It's also faster, of course.

Latest Changes : Better data cleanup Since publishing this paper, we've come up with a more fine-grained data cleaning method Bad parts of utterances are thrown away, and good parts kept. This is a completely separate process from LF-MMI training... but LF-MMI is particularly sensitive to its effect

Conclusion Applied ideas from recent CTC efforts to MMI Reduced output rate and tolerance in numerator Using denominator-lattice-free MMI & reduced frame rate Up to 5x reduction in total training time no CE pre-training, no denominator lattice generation 8% rel. imp. over CE+sMBR 11.5% rel. imp. over CE Consistent gains across several datasets (80-2100 hrs)