Purely sequence-trained neural networks for ASR based on lattice-free MMI

Purely sequence-trained neural networks for ASR based on lattice-free MMI Dan Povey, Vijay Peddinti, Daniel Galvez, Pegah Ghahremani, Vimal Manohar, Xingyu Na, Yiming Wang, Sanjeev Khudanpur

Why should you care about this? It gives better WERs than the conventional way of training models. It's a lot faster to train It's a lot faster to decode We're modifying most of the recipes in Kaldi to use this. Doesn't always give WER improvements on small data (e.g. < 50 hours)

Connection with CTC Simplification of an extension of CTC Context-dependent CTC Commonalities with CTC Objective function is the posterior of the correct transcript of the utterance Output is computed at 33 Hz frame rate Also see Lower Frame Rate Neural Network AMs, Pundak & Sainath, Interspeech 2016 [1] A. Graves, et al, Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks, in ICML 2006. [2] Senior A., et al., Acoustic Modelling with CD-CTC-SMBR LSTM RNNS, in ASRU, 2015 [3] Sak H., et al., Fast and accurate recurrent neural network acoustic models for speech recognition, in Interspeech 2015.

Practical sequence discriminative training Lattice generation Seed model and lattice quality affect final recognition accuracy [1] Seed Model Lattice Generation Model CE1 CE2 GMM 15.8% (-2%) - DNN CE1 (16.2%) 14.1 % (-13) % 13.7% (-15%) DNN CE2 (15.6%) - 13.5% (-17%) Cross-entropy pre-training is important! [1] Yu, D. (2014). Automatic Speech Recognition: A Deep Learning Approach. Springer. [2] Vesely et al, Sequence-discriminative training of deep neural networks Interspeech 2013 [3] Su et al, Error back propagation for sequence training of context-dependent deep networks for conversational speech transcription, ICASSP 2013

Practical sequence discriminative training What about lattice-free MMI? Full forward-backward on denominator is slow The use of the accelerated likelihood evaluation, tight pruning beams, and a small decoding graph made latticefree MMI possible [1] Beam search on GPU is hard Loss of efficiency if different cores are taking different code paths or accessing different data [1] Chen, Stanley F., et al. "Advances in speech transcription at IBM under the DARPA EARS program., IEEE TASLP 2006 [2] Xiong W., et al., The Microsoft 2016 Conversational Speech Recognition System, arxiv.org : September 2016

Proposed approach Full forward backward of denominator on GPU Break up utterances into fixed-size chunks Keep the denominator graph small enough so we can keep the forward scores (α) on the GPU for a minibatch of utterances (e.g. 128) Use a phone LM

Fixed chunk sizes Use 1-second chunks Slight overlaps or gaps where we break up utterances this way Append successive utterances in data preparation How do we break up the transcripts? 1-second chunks may not coincide with word boundaries

Numerator Graph Convert to phone graphs Generate numerator lattices using the transcript Numerator FSA Each state has a frame-index Convert to frame-by-frame masks User-specifiable tolerance default : 50ms Split into fixed size chunks [1] Senior A., et al., Acoustic Modelling with CD-CTC-SMBR LSTM RNNS, in ASRU, 2015

Denominator Graph A decoding graph with phone-level LM (P) and no lexicon HCLG à HCP Construct P to minimize HCP 4-gram phone level LM Estimated from phone alignments of data Prune 4-gram states with low-counts Limit back-off to unpruned 3-gram LM Tri-phones not seen in training cannot be generated Denominator forward-backward took less than 20% of training time

Frame-rate : Topology We use a topology that can be traversed in 1 state, and a 30ms frame shift We did find that the 30ms frame shift was optimal for the 1-state topology. We experimented with different topologies that can be traversed in 1 state. Chosen topology, 0.5 Can generate "a", "ab", "abb",... 0.5 0.5 0.5

Frame-rate : Speed Conventional LF-MMI takes less than 20% of the training time Speeding up network computation RNNs : Use larger delays time LF-MMI time Efficient LF-MMI time

Frame-rate : Speed LF-MMI takes less than 20% of the training time Speeding up network computation TDNNs : Use sub-sampling and strides time Conventional LF-MMI Efficient LF-MMI time time

Speed LF-MMI is substantially faster than conventional systems Frame subsampling with efficient network archictectures Smaller networks to avoid over-fitting LF-MMI is even faster than cross-entropy pretraining Eliminates stages in training No cross-entropy pre-training No denominator lattice generation 5x faster training Decodes are 2-3 times faster than conventional models

Regularization Models trained LF-MMI highly prone to overfitting Three regularization methods Cross-entropy Output l 2 norm Leaky HMM

Regularization : Cross Entropy Supervision is derived from numerator lattices Scaled by constant 0.1 in most cases Larger constants (0.25) for tasks with small amount of data AMI, TED-LIUM and Babel Cross-entropy branch of network is discarded after training Cross-entropy Softmax LF-MMI Final Affine Final Affine Layer N-1 Layer N-1 Layer. N-2 Layer 1

Regularization : Output l 2 norm L 2 norm of the outputs Cross-entropy Softmax LF-MMI Final Affine Final Affine Layer N-1 Layer N-1 Layer. N-2 Layer 1

Regularization : Leaky-HMM Allow gradual forgetting of context ε transition from each state in the HMM to every other state Only one ε transition allowed per frame Equivalent to stopping and restarting the HMM after each frame Leaky-hmm coefficient of 0.1 Modification of the denominator graph

Results : Regularization Crossentropy Regularization Function WER (%) Output l 2 norm Leaky HMM Total SWBD N N N 16.8 11.1 Y N N 15.9 10.5 N Y N 15.9 10.4 N N Y 16.4 10.9 Y Y N 15.7 10.3 Y N Y 15.7 10.3 N Y Y 15.8 10.4 Y Y Y 15.6 10.4 SWBD-300 Hr task : TDNN acoustic models : HUB 00 eval set

Transcript Quality LF-MMI is sensitive to transcript quality Identified from analysis of poor performance in AMI and TED-LIUM tasks Data clean-up using lattice-oracle WER [1] Peddinti et al, "Far-field ASR without parallel data", Interspeech 2016 Data Retained (%) 100 90 80 70 60 50 40 30 20 10 0 0 20 40 60 80 100 WER threshold (%)

Results : Transcript Quality AMI LVCSR Tasks : Impact of data filtering LVCSR Task Cross-entropy Lattice-free MMI Lattice-free MMI + Data Filtering Dev Eval Dev Eval Dev Eval SDM 45.8 50.3 43.2 47.3 42.8 46.1 MDM 41 44.7 40.5 43.2 38.5 41.5 IHM 24.4 25.1 22.6 22.5 22.4 22.4 [1] Peddinti et al, "Far-field ASR without parallel data", Interspeech 2016

Results : Transcript Quality AMI-SDM : Impact of data filtering on Cross-entropy and LF-MMI WER Threshold (%) Cross-entropy Lattice-free MMI Dev Eval Dev Eval 40 45.4 50.3 43.1 46.9 45 45.5 50.1 42.8 46.1 50 45.5 50.1 42.8 46.6 -- 45.8 50.3 43.2 47.3 [1] Peddinti et al, "Far-field ASR without parallel data", Interspeech 2016

Comparison of LF-MMI and CE+sMBR Objective Function Model (Size) WER(%) SWBD Total CE TDNN-A (16.6 M) 12.5 18.2 CE smbr TDNN-A (16.6 M) 11.4 16.9 TDNN-A (9.8 M) 10.7 16.1 LF-MMI TDNN-B (9.9 M) 10.4 15.6 TDNN-C (11.2 M) 10.2 15.5 LF-MMI smbr TDNN-C (11.2 M) 10.0 15.1 time SWBD-300 Hr task : TDNN acoustic models : HUB 00 eval set time

LF-MMI with different DNNs Model TDNN LSTM BLSTM Objective Function WER (%) SWBD Total CE 12.5 18.2 LF-MMI 10.2 15.5 CE 11.6 16.5 LF-MMI 10.3 15.6 CE 10.3 14.9 LF-MMI 9.6 14.5 15 5 3 SWBD-300 Hr task HUB 00 eval set

LF-MMI with different DNNs Model TDNN BLSTM Objective Function WER (%) Total CE 31.0 LF-MMI 27.8 CE 29.4 LF-MMI 25.6 10 13 Training data : 1800 hr Fisher data * 3 fold data augmentation = 5400 hr ASpIRE dev set : 5 hours

LF-MMI in various LVCSR tasks Standard ASR Data Set Size CE CE smbr LF-MMI Rel. Δ AMI-IHM 80 hrs 25.1% 23.8% 22.4% 6% AMI-SDM 80 hrs 50.9% 48.9% 46.1% 6% TED-LIUM* 118 hrs 12.1% 11.3% 11.2% 0% Switchboard 300 hrs 18.2% 16.9% 15.5% 8% LibriSpeech 960 hrs 4.97% 4.56% 4.28% 6% Fisher + Switchboard 2100 hrs 15.4% 14.5% 13.3% 8% TDNN acoustic models Similar architecture across LVCSR tasks *New TED-LIUM recipe using release 2 data CE (10.8) to LF-MMI (9.3)

[1] A.R.Mohamed, F.Seide, D.Yu, J.Droppo, A.Stolcke, G.Zweig and G. Penn, Deep bi-directional recurrent networks over spectral windows, in Proceedings of ASRU. ASRU, 2015. [2] G. Saon, H.K. J. Kuo, S. Rennie, and M. Picheny, The IBM 2015 English Conversational Telephone Speech Recognition System, 2015. Available: http://arxiv.org/abs/ 1505.05899 *Better results reported in Saon et. al., The IBM 2016 English Conversational Telephone Speech Recognition System, this conf. Performance of lattice-free MMI System AM dataset LM dataset Hub5 2000 RT03S SWB CHM FSH SWB Mohd. et al [1] F+S F+S 10.6% - 13.2% 18.9% Mohd. et al [1] F+S F+S+O 9.9% - 12.3% 17.8% Mohd. et al [1] F+S+O F+S+O 9.2% - 11.5% 16.7% Saon et al [2] F+S+C F+S+O 8.0*% 14.1% - - TDNN + LF-MMI S F+S 10.2% 20.5% 14.2% 23.5% TDNN + LF-MMI smbr S F+S 10.0% 20.1% 13.8% 22.1% BLSTM + LF-MMI smbr S F+S 9.6% 19.3% 13.2 20.8 TDNN + LF-MMI F+S F+S 9.2% 17.3% 9.8% 14.8% BLSTM + LF-MMI F+S F+S 8.8% 15.3% 9.8% 13.4% F : Fisher corpus (1800 hrs) S: Switchboard Corpus (300 hrs) C: Callhome corpus (14 hrs) O: Other corpora

Latest Changes : Left bi-phone All the results shown in this paper are with triphone models. Typically the number of leaves is about 10% to 20% fewer than the conventional DNN system (we found this worked the best). Since the paper was published, we've found that left biphone works slightly better with this type of model. It's also faster, of course.

Latest Changes : Better data cleanup Since publishing this paper, we've come up with a more fine-grained data cleaning method Bad parts of utterances are thrown away, and good parts kept. This is a completely separate process from LF-MMI training... but LF-MMI is particularly sensitive to its effect

Conclusion Applied ideas from recent CTC efforts to MMI Reduced output rate and tolerance in numerator Using denominator-lattice-free MMI & reduced frame rate Up to 5x reduction in total training time no CE pre-training, no denominator lattice generation 8% rel. imp. over CE+sMBR 11.5% rel. imp. over CE Consistent gains across several datasets (80-2100 hrs)