Discriminative Training for Segmental Minimum Bayes Risk Decoding

Discriminative Training for Segmental Minimum Bayes Risk Decoding Vlasios Doumpiotis, Stavros Tsakalidis, Bill Byrne Center for Language and Speech Processing Department of Electrical and Computer Engineering The Johns Hopkins University

Segmental Minimum Bayes Risk Decoding (SMBR) Lattices are segmented into sequences of separate decision problems involving small sets of confusable words Separate sets of acoustic models specialized to discriminate between the competing words in these classes are applied in subsequent SMBR decoding passes Results in a refined search space that allows the use of specialized discriminative models Improvement in performance over MMI 2

Review of MAP Decoding vs Minimum Bayes-Risk Decoders MAP decoding, given an utterance A produces a sentence hypothesis W = max W W P (W A) MAP is the optimum decoding criterion when performance is measured under the Sentence Error Rate criterion. For other criteria, such as Word Error Rate, other decoding schemes may be better. If L(W,W ) is the loss function between word strings W and W, the MBR recognizer seeks the optimal hypothesis as W = max W W W W L(W, W )P (W A) MAP is the optimum decoding criterion when performance is measured under the Sentence Error Rate criterion. Minimum Bayes-Risk decoders attempt to find the sentence hypothesis with the least expected error under a given task specific loss function. If L(W,W ) is the 1/0 loss function, MAP results 3

Segmental Minimum Bayes Risk Decoding Address the MBR search problem over very large lattices. Each word string in the lattice is segmented into N substrings: W = W 1 W N This effectively segments the lattice as well: W = W 1 W N Given a specific lattice segmentation, the MBR hypothesis can then be obtained through a sequence of independent MBR decision rules Lattice Segmentation and Pinching Ŵ i = min W W i W W i L(W, W )P i (W A) Every path in the lattice is aligned to the MAP hypothesis Low and high confidence regions are identified High confidence regions: retain only the MAP hypothesis The word order of the original lattice is preserved 4

Lattice Cutting and Pinching C A NINE A E D V OH J NINE A 8 B K D V 4 NINE C A D OH K NINE A A V 4 J 8 B E A:17 OH J:17 NINE A A:7 V:5 5 8:7 B:5

Objectives 1. Identify potential errors in the MAP hypothesis 2. Derive a new search space for subsequent decoding passes Models will be trained to fix the errors in the MAP hypothesis Regions of low confidence The search space contains portions of the MAP hypothesis plus alternatives. Regions of high confidence The search space is restricted to the MAP hypothesis. Because the structure of the original lattice is retained, we can perform acoustic rescoring over this pinched lattice 6

Minimum Error Estimation for SMBR Suppose we have a labeled training set (A,W) A reasonable approach to estimation for an MBR decoder is min θ W L(W, W )P (W A; θ) Note that if L is the 0/1 loss function, MMI results: max θ P (W A; θ) How does this change for SMBR? If we assume that each segment set contains one word strings and the loss function is binary, then we can treat the estimation problem for each segment set separately max θi W L(W, W )P (W A; θ i ) The problem simplifies to separate MMI estimation procedures for the small vocabulary ASR problems identified in the segmented lattices 7

Iterative SMBR Estimation and Decoding Our goal is to develop a joint estimation and decoding procedure that improves over MMI. 1. Generate lattices, initially with MMI acoustic models 2. Segment and pinch lattices 3. Identify errors 4. Train sets of models to resolve the errors 5. Rescore the pinched lattices using the models tuned to fix the errors in each segment set 6. Repeat... We need to establish that Lattice cutting finds segment sets similar to the dominant confusion pairs observed in decoding. The segment sets identified in the test set are also found consistently in the training set. Put differently, does the decoder behave the same on the training set as on the test set? 8

HTK baseline: Whole word models, MFCCs, 12 mixture Gaussian HMMs, ATT FSM decoder 46,730 training utterances, 3,112 test utterances Ten Most Frequent ASR Word Errors F+S 58 60 V+Z 54 42 M+N 45 35 P+T 32 44 B+V 40 29 8+H 17 34 A+8 10 40 L+OH 12 33 B+D 16 23 C+V 16 17 Dominant Confusion Sets in MMI Decoding Hypothesized errors via unsupervised lattice cutting agree with actual errors 9 Ten Most Frequent Confusion Sets Found by Lattice Cutting Test Count Training Count F+S 1089 F+S 15197 P+T 843 P+T 10744 8+H 784 8+H 10370 M+N 772 M+N 10242 V+Z 557 V+Z 8068 B+D 389 B+D 5996 L+OH 343 L+OH 5108 B+V 314 B+V 4963 A+K 292 5+I 4413 5+I 289 J+K 3653

Discriminative Training on OGI AlphaDigits 11 10 10.7 9.98 9 MMI 9.36 9.07 9.03 9.27 WER 8 MRT 8.47 8.17 7.92 7.86 7 6 Observations 5 0 1 2 3 4 5 6 7 8 Iteration Initial ML performance of 10.7% WER is reduced to 9.07% with MMI. MinRisk training: a further 1% WER reduction beyond the best MMI performance. Overall WER decreases in MMI training progresses... 10

MMI Improvement Is Not Uniform Over All Error Types 90 80 70 60 50 40 MMI-1 MMI-2 MMI-3 30 20 10 0 F->S S->F V->Z Z->V M->N N->M P->T T->P B->V V->B 8->H H->8 A->8 8->A L->OH OH->L B->D D->B C->V V->C Overall reduction in WER is at the expense of specific errors 11

70 Minimum Risk Training 60 50 40 30 MRT-1 MRT-2 MRT-3 20 10 0 F->S S->F V->Z Z->V M->N N->M P->T T->P B->V V->B 8->H Overall error rate is not reduced at the expense of individual hypotheses H->8 A->8 8->A L->OH OH->L B->D D->B C->V V->C 12

Conclusions SMBR - a divide and conquer approach to ASR Unsupervised approach to identify and eliminate recognition errors SMBR is used to identify regions that are likely to contain errors rescore with models trained for each type of error SMBR yields further improvements over MMIR Arguably, discriminative training is improved by introducing a training criterion based on a good approximation to the Word Error Rate rather than the Sentence Error Rate 13