Discriminative Training for Segmental Minimum Bayes Risk Decoding

Similar documents
Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Speech Recognition at ICSI: Broadcast News and beyond

Learning Methods in Multilingual Speech Recognition

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Modeling function word errors in DNN-HMM based LVCSR systems

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Modeling function word errors in DNN-HMM based LVCSR systems

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING

HOUSE OF REPRESENTATIVES AS REVISED BY THE COMMITTEE ON EDUCATION APPROPRIATIONS ANALYSIS

WHEN THERE IS A mismatch between the acoustic

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

Calibration of Confidence Measures in Speech Recognition

Language Model and Grammar Extraction Variation in Machine Translation

Generative models and adversarial training

Investigation on Mandarin Broadcast News Speech Recognition

SPEECH RECOGNITION CHALLENGE IN THE WILD: ARABIC MGB-3

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

(Sub)Gradient Descent

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Lecture 1: Machine Learning Basics

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Deep Neural Network Language Models

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

A study of speaker adaptation for DNN-based speech synthesis

Comparison of network inference packages and methods for multiple networks inference

NATIONAL CENTER FOR EDUCATION STATISTICS RESPONSE TO RECOMMENDATIONS OF THE NATIONAL ASSESSMENT GOVERNING BOARD AD HOC COMMITTEE ON.

Revisiting the role of prosody in early language acquisition. Megha Sundara UCLA Phonetics Lab

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Designing a Rubric to Assess the Modelling Phase of Student Design Projects in Upper Year Engineering Courses

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Stages of Literacy Ros Lugg

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

An Online Handwriting Recognition System For Turkish

Segregation of Unvoiced Speech from Nonspeech Interference

Cooperative Game Theoretic Models for Decision-Making in Contexts of Library Cooperation 1

CS Machine Learning

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Disambiguation of Thai Personal Name from Online News Articles

Improvements to the Pruning Behavior of DNN Acoustic Models

Improved Hindi Broadcast ASR by Adapting the Language Model and Pronunciation Model Using A Priori Syntactic and Morphophonemic Knowledge

Human Emotion Recognition From Speech

Reinforcement Learning by Comparing Immediate Reward

Preparing a Research Proposal

The IDN Variant Issues Project: A Study of Issues Related to the Delegation of IDN Variant TLDs. 20 April 2011

elearning OVERVIEW GFA Consulting Group GmbH 1

arxiv: v1 [cs.cl] 2 Apr 2017

CONQUERING THE CONTENT: STRATEGIES, TASKS AND TOOLS TO MOVE YOUR COURSE ONLINE. Robin M. Smith, Ph.D.

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

The MSR-NRC-SRI MT System for NIST Open Machine Translation 2008 Evaluation

Student Perceptions of Reflective Learning Activities

Probabilistic Latent Semantic Analysis

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

The Strong Minimalist Thesis and Bounded Optimality

Multi-Year Guaranteed Annuities

Large vocabulary off-line handwriting recognition: A survey

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Artificial Neural Networks written examination

Decision Analysis. Decision-Making Problem. Decision Analysis. Part 1 Decision Analysis and Decision Tables. Decision Analysis, Part 1

On the Formation of Phoneme Categories in DNN Acoustic Models

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

College Pricing and Income Inequality

Language Acquisition Chart

Progress or action taken

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, VOL XXX, NO. XXX,

GACE Computer Science Assessment Test at a Glance

How to Do Research. Jeff Chase Duke University

Rhythm-typology revisited.

Version Space. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Version Space Term 2012/ / 18

Entrepreneurial Discovery and the Demmert/Klein Experiment: Additional Evidence from Germany

AN EXAMPLE OF THE GOMORY CUTTING PLANE ALGORITHM. max z = 3x 1 + 4x 2. 3x 1 x x x x N 2

A General Class of Noncontext Free Grammars Generating Context Free Languages

LEARNING A SEMANTIC PARSER FROM SPOKEN UTTERANCES. Judith Gaspers and Philipp Cimiano

DNN ACOUSTIC MODELING WITH MODULAR MULTI-LINGUAL FEATURE EXTRACTION NETWORKS

CSL465/603 - Machine Learning

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Running head: LISTENING COMPREHENSION OF UNIVERSITY REGISTERS 1

Session 2B From understanding perspectives to informing public policy the potential and challenges for Q findings to inform survey design

Proof Theory for Syntacticians

Given a directed graph G =(N A), where N is a set of m nodes and A. destination node, implying a direction for ow to follow. Arcs have limitations

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

GRADUATE STUDENTS Academic Year

Parallel Evaluation in Stratal OT * Adam Baker University of Arizona

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Automatic Pronunciation Checker

Noisy SMS Machine Translation in Low-Density Languages

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

A Strategic Plan for the Law Library. Washington and Lee University School of Law Introduction

Transcription:

Discriminative Training for Segmental Minimum Bayes Risk Decoding Vlasios Doumpiotis, Stavros Tsakalidis, Bill Byrne Center for Language and Speech Processing Department of Electrical and Computer Engineering The Johns Hopkins University

Segmental Minimum Bayes Risk Decoding (SMBR) Lattices are segmented into sequences of separate decision problems involving small sets of confusable words Separate sets of acoustic models specialized to discriminate between the competing words in these classes are applied in subsequent SMBR decoding passes Results in a refined search space that allows the use of specialized discriminative models Improvement in performance over MMI 2

Review of MAP Decoding vs Minimum Bayes-Risk Decoders MAP decoding, given an utterance A produces a sentence hypothesis W = max W W P (W A) MAP is the optimum decoding criterion when performance is measured under the Sentence Error Rate criterion. For other criteria, such as Word Error Rate, other decoding schemes may be better. If L(W,W ) is the loss function between word strings W and W, the MBR recognizer seeks the optimal hypothesis as W = max W W W W L(W, W )P (W A) MAP is the optimum decoding criterion when performance is measured under the Sentence Error Rate criterion. Minimum Bayes-Risk decoders attempt to find the sentence hypothesis with the least expected error under a given task specific loss function. If L(W,W ) is the 1/0 loss function, MAP results 3

Segmental Minimum Bayes Risk Decoding Address the MBR search problem over very large lattices. Each word string in the lattice is segmented into N substrings: W = W 1 W N This effectively segments the lattice as well: W = W 1 W N Given a specific lattice segmentation, the MBR hypothesis can then be obtained through a sequence of independent MBR decision rules Lattice Segmentation and Pinching Ŵ i = min W W i W W i L(W, W )P i (W A) Every path in the lattice is aligned to the MAP hypothesis Low and high confidence regions are identified High confidence regions: retain only the MAP hypothesis The word order of the original lattice is preserved 4

Lattice Cutting and Pinching C A NINE A E D V OH J NINE A 8 B K D V 4 NINE C A D OH K NINE A A V 4 J 8 B E A:17 OH J:17 NINE A A:7 V:5 5 8:7 B:5

Objectives 1. Identify potential errors in the MAP hypothesis 2. Derive a new search space for subsequent decoding passes Models will be trained to fix the errors in the MAP hypothesis Regions of low confidence The search space contains portions of the MAP hypothesis plus alternatives. Regions of high confidence The search space is restricted to the MAP hypothesis. Because the structure of the original lattice is retained, we can perform acoustic rescoring over this pinched lattice 6

Minimum Error Estimation for SMBR Suppose we have a labeled training set (A,W) A reasonable approach to estimation for an MBR decoder is min θ W L(W, W )P (W A; θ) Note that if L is the 0/1 loss function, MMI results: max θ P (W A; θ) How does this change for SMBR? If we assume that each segment set contains one word strings and the loss function is binary, then we can treat the estimation problem for each segment set separately max θi W L(W, W )P (W A; θ i ) The problem simplifies to separate MMI estimation procedures for the small vocabulary ASR problems identified in the segmented lattices 7

Iterative SMBR Estimation and Decoding Our goal is to develop a joint estimation and decoding procedure that improves over MMI. 1. Generate lattices, initially with MMI acoustic models 2. Segment and pinch lattices 3. Identify errors 4. Train sets of models to resolve the errors 5. Rescore the pinched lattices using the models tuned to fix the errors in each segment set 6. Repeat... We need to establish that Lattice cutting finds segment sets similar to the dominant confusion pairs observed in decoding. The segment sets identified in the test set are also found consistently in the training set. Put differently, does the decoder behave the same on the training set as on the test set? 8

HTK baseline: Whole word models, MFCCs, 12 mixture Gaussian HMMs, ATT FSM decoder 46,730 training utterances, 3,112 test utterances Ten Most Frequent ASR Word Errors F+S 58 60 V+Z 54 42 M+N 45 35 P+T 32 44 B+V 40 29 8+H 17 34 A+8 10 40 L+OH 12 33 B+D 16 23 C+V 16 17 Dominant Confusion Sets in MMI Decoding Hypothesized errors via unsupervised lattice cutting agree with actual errors 9 Ten Most Frequent Confusion Sets Found by Lattice Cutting Test Count Training Count F+S 1089 F+S 15197 P+T 843 P+T 10744 8+H 784 8+H 10370 M+N 772 M+N 10242 V+Z 557 V+Z 8068 B+D 389 B+D 5996 L+OH 343 L+OH 5108 B+V 314 B+V 4963 A+K 292 5+I 4413 5+I 289 J+K 3653

Discriminative Training on OGI AlphaDigits 11 10 10.7 9.98 9 MMI 9.36 9.07 9.03 9.27 WER 8 MRT 8.47 8.17 7.92 7.86 7 6 Observations 5 0 1 2 3 4 5 6 7 8 Iteration Initial ML performance of 10.7% WER is reduced to 9.07% with MMI. MinRisk training: a further 1% WER reduction beyond the best MMI performance. Overall WER decreases in MMI training progresses... 10

MMI Improvement Is Not Uniform Over All Error Types 90 80 70 60 50 40 MMI-1 MMI-2 MMI-3 30 20 10 0 F->S S->F V->Z Z->V M->N N->M P->T T->P B->V V->B 8->H H->8 A->8 8->A L->OH OH->L B->D D->B C->V V->C Overall reduction in WER is at the expense of specific errors 11

70 Minimum Risk Training 60 50 40 30 MRT-1 MRT-2 MRT-3 20 10 0 F->S S->F V->Z Z->V M->N N->M P->T T->P B->V V->B 8->H Overall error rate is not reduced at the expense of individual hypotheses H->8 A->8 8->A L->OH OH->L B->D D->B C->V V->C 12

Conclusions SMBR - a divide and conquer approach to ASR Unsupervised approach to identify and eliminate recognition errors SMBR is used to identify regions that are likely to contain errors rescore with models trained for each type of error SMBR yields further improvements over MMIR Arguably, discriminative training is improved by introducing a training criterion based on a good approximation to the Word Error Rate rather than the Sentence Error Rate 13