Minimum Bayes-Risk Techniques for Automatic Speech Recognition and Machine Translation

Similar documents
Language Model and Grammar Extraction Variation in Machine Translation

Speech Recognition at ICSI: Broadcast News and beyond

Lecture 1: Machine Learning Basics

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Greedy Decoding for Statistical Machine Translation in Almost Linear Time

Rule Learning With Negation: Issues Regarding Effectiveness

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Learning Methods in Multilingual Speech Recognition

arxiv: v1 [cs.cl] 2 Apr 2017

Improved Reordering for Shallow-n Grammar based Hierarchical Phrase-based Translation

A Quantitative Method for Machine Translation Evaluation

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

(Sub)Gradient Descent

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Rule Learning with Negation: Issues Regarding Effectiveness

Noisy SMS Machine Translation in Low-Density Languages

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Modeling function word errors in DNN-HMM based LVCSR systems

Calibration of Confidence Measures in Speech Recognition

Cross Language Information Retrieval

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

The stages of event extraction

A Case Study: News Classification Based on Term Frequency

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

The MSR-NRC-SRI MT System for NIST Open Machine Translation 2008 Evaluation

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

Modeling function word errors in DNN-HMM based LVCSR systems

Improvements to the Pruning Behavior of DNN Acoustic Models

Using dialogue context to improve parsing performance in dialogue systems

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

CSL465/603 - Machine Learning

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Beyond the Pipeline: Discrete Optimization in NLP

Deep Neural Network Language Models

Detecting English-French Cognates Using Orthographic Edit Distance

Discriminative Learning of Beam-Search Heuristics for Planning

SEMAFOR: Frame Argument Resolution with Log-Linear Models

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Linking Task: Identifying authors and book titles in verbose queries

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

The Strong Minimalist Thesis and Bounded Optimality

Applications of memory-based natural language processing

Spoken Language Parsing Using Phrase-Level Grammars and Trainable Classifiers

Python Machine Learning

Domain Adaptation in Statistical Machine Translation of User-Forum Data using Component-Level Mixture Modelling

Prediction of Maximal Projection for Semantic Role Labeling

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Knowledge Transfer in Deep Convolutional Neural Nets

INPE São José dos Campos

Artificial Neural Networks written examination

LEARNING A SEMANTIC PARSER FROM SPOKEN UTTERANCES. Judith Gaspers and Philipp Cimiano

Probabilistic Latent Semantic Analysis

Data Integration through Clustering and Finding Statistical Relations - Validation of Approach

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Regression for Sentence-Level MT Evaluation with Pseudo References

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

GACE Computer Science Assessment Test at a Glance

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

Impact of Controlled Language on Translation Quality and Post-editing in a Statistical Machine Translation Environment

Accurate Unlexicalized Parsing for Modern Hebrew

A Case-Based Approach To Imitation Learning in Robotic Agents

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Radius STEM Readiness TM

Training and evaluation of POS taggers on the French MULTITAG corpus

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

Switchboard Language Model Improvement with Conversational Data from Gigaword

UNIVERSITY OF CALIFORNIA SANTA CRUZ TOWARDS A UNIVERSAL PARAMETRIC PLAYER MODEL

Name of Course: French 1 Middle School. Grade Level(s): 7 and 8 (half each) Unit 1

Clickthrough-Based Translation Models for Web Search: from Word Models to Phrase Models

Grammars & Parsing, Part 1:

Parsing of part-of-speech tagged Assamese Texts

Process to Identify Minimum Passing Criteria and Objective Evidence in Support of ABET EC2000 Criteria Fulfillment

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge

The Smart/Empire TIPSTER IR System

Summary / Response. Karl Smith, Accelerations Educational Software. Page 1 of 8

AQUA: An Ontology-Driven Question Answering System

Corpus Linguistics (L615)

Specification of the Verity Learning Companion and Self-Assessment Tool

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Re-evaluating the Role of Bleu in Machine Translation Research

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Ensemble Technique Utilization for Indonesian Dependency Parser

Learning From the Past with Experiment Databases

UNIVERSITY OF OSLO Department of Informatics. Dialog Act Recognition using Dependency Features. Master s thesis. Sindre Wetjen

The NICT Translation System for IWSLT 2012

University of Alberta. Large-Scale Semi-Supervised Learning for Natural Language Processing. Shane Bergsma

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

Transcription:

Minimum Bayes-Risk Techniques for Automatic Speech Recognition and Machine Translation October 23, 2003 Shankar Kumar Advisor: Prof. Bill Byrne ECE Committee: Prof. Gert Cauwenberghs and Prof. Pablo Iglesias Center for Language and Speech Processing and Department of Electrical and Computer Engineering The Johns Hopkins University MBR Techniques in Automatic Speech Recognition and Machine Translation p.1/33

Motivation Automatic Speech Recognition (ASR) and Machine Translation (MT) are finding several applications Examples: Information Retrieval from Text and Speech Archives, Devices for Speech to Speech Translation etc. Usefulness is measured by Task-specific error metrics Maximum Likelihood techniques are used in estimation and classification of current ASR/MT systems Do not take into account task-specific evaluation measures Minimum Bayes-Risk Classification Building automatic systems tuned for specific tasks Task-specific Loss functions Formulation in two different areas - automatic speech recognition and machine translation MBR Techniques in Automatic Speech Recognition and Machine Translation p.2/33

Outline Automatic Speech Recognition Minimum Bayes-Risk Classifiers Segmental Minimum Bayes-Risk Classification Risk-Based Lattice Segmentation Statistical Machine Translation A Statistical Translation Model Minimum Bayes-Risk Classifiers for Word Alignment of Bilingual Texts Minimum Bayes-Risk Classifiers for Machine Translation Conclusions and Future Work MBR Techniques in Automatic Speech Recognition and Machine Translation p.3/33

Loss functions in Automatic Speech Recognition STATISTICAL CLASSIFIER YOU TALKED ABOUT VOLCANOS HUGH TALKED ABOUT VOLCANOS YOU WHAT ABOVE VOLCANOS IT S ALL ABOUT VOLCANOS HUGH TALKED ABOUT VOLCANOS YOU TALKED ABOVE VOLCANOS Hypothesis Space (Huge!) Loss function Reference : HUGH TALKED ABOUT VOLCANOS String Edit Distance (Word Error Rate) Hypothesis : YOU TALKED ABOUT VOLCANOS 1/4 (25%) Loss-function is specific to the application of ASR system Reference : HUGH TALKED ABOUT VOLCANOS Hypothesis : YOU TALKED ABOUT VOLCANOS Sentences Words Keywords Understanding Loss(Truth,Hyp) 1/1 1/4 1/2 Large Loss MBR Techniques in Automatic Speech Recognition and Machine Translation p.4/33

Minimum Bayes-Risk (MBR) Speech Recognizer Evaluate the expected loss of each hypothesis E(W ) = W W Select the hypothesis with least expected loss δ MBR (A) = argmin W W L(W, W )P (W A) W W L(W, W )P (W A) Relation to Maximum A-posteriori Probability (MAP) Classifiers Consider a sentence error loss function: L(W, W 1 if W W ) = 0 otherwise Then, δ MBR (A) reduces to the MAP classifier W = argmax W W P (W A) MBR Techniques in Automatic Speech Recognition and Machine Translation p.5/33

Algorithmic Implementations of MBR Speech Recognizers Loss function of interest is String Edit distance (Word Error Rate) Word Lattice ARE #0.7 HOW #0.9 HELLO #0.7 NOW #0.7 ARE #0.9 NOW #0.9 WELL #0.9 O #0.9 ARE #0.9 HOW #0.9 YOU #0.9 YOU #0.7 YOU #0.9 ALL #0.7 WELL #0.9 TODAY #0.7 </s> #0.9 DAY #0.7 TO #0.9 DAY #0.7 </s> #0.7 TO #0.9 </s> #0.7 TODAY #0.9 Lattices are compact representation of the most likely word strings generated by a speech recognizer MBR Procedures to compute Ŵ = argmin W W L(W, W )P (W A) W W Lattice rescoring via A search (Goel and Byrne: CSL 00) MBR Techniques in Automatic Speech Recognition and Machine Translation p.6/33

Segmental Minimum Bayes-Risk Lattice Segmentation A search is expensive over large lattices Pruning the lattices leads to search errors Can we simplify the MBR decoder? Suppose we can segment the word lattice: ARE #0.7 HOW #0.9 HELLO #0.7 NOW #0.7 ARE #0.9 NOW #0.9 WELL #0.9 O #0.9 ARE #0.9 HOW #0.9 YOU #0.9 ALL #0.7 YOU #0.7 WELL #0.9 YOU #0.9 TODAY #0.7 </s> #0.9 DAY #0.7 TO #0.9 DAY #0.7 </s> #0.7 TO #0.9 </s> #0.7 TODAY #0.9 Induced loss function: L I (W, W ) = L(W 1, W 1) + L(W 2, W 2) + L(W 3, W 3) MBR decoder can be decomposed into a sequence of segmental MBR decoders: Ŵ = argmin L(W, W )P 1 (W A) argmin L(W, W )P 2 (W A) argmin L(W, W )P 3 (W A) W W 1 W W 2 W W 3 W W 1 W W 2 W W 3 MBR Techniques in Automatic Speech Recognition and Machine Translation p.7/33

Trade-offs in Segmental MBR Lattice Segmentation MBR decoding on the entire lattice involves search errors Segmentation breaks up a single search problem into many simpler search problems An ideal segmentation: Loss between any two word strings unaffected by cutting Any segmentation restricts string alignments, and errors in approximating loss function between strings. L(W, W ) N L(W i, W i ) i=1 Therefore, segmentation involves tradeoff between search errors and errors in approximating the loss function Ideal segmentation criterion not achievable! Segmentation Rule: L( W, W ) = K i=1 L( W i, W i ) MBR Techniques in Automatic Speech Recognition and Machine Translation p.8/33

Aligning a Lattice against a Word String Motivation: Suppose we can align each word string in the lattice against W = w K 1, we can segment the lattice into K segments Substrings in i th set W i will align with i th word w i We have developed an efficient (almost exact) procedure using Weight Finite State Transducers to generate the simultaneous string alignment of every string in the lattice wrt MAP hypothesis - this is encoded as an acceptor  Use alignment information from  to segment the lattice into K sublattices WELL HELLO O HOW NOW NOW HOW ARE ARE ARE YOU YOU YOU ALL WELL TODAY TO TO TODAY DAY DAY </s> </s> </s> MBR Techniques in Automatic Speech Recognition and Machine Translation p.9/33

Aligning a Lattice against a Word String Motivation: Suppose we can align each word string in the lattice against W = w K 1, we can segment the lattice into K segments Substrings in i th set W i will align with i th word w i We have developed an efficient (almost exact) procedure using Weight Finite State Transducers to generate the simultaneous string alignment of every string in the lattice wrt MAP hypothesis - this is encoded as an acceptor  Use alignment information from  to segment the lattice into K sublattices TODAY.6 #0 HOW.2 #1 ARE.3 #0 YOU.4 #0 ALL.5 #0 </s>.7 #0 HELLO.1 #0 NOW.2 #0 NOW.2 #0 ARE.3 #0 YOU.4 #0 TO.INS.6 #1 TODAY.6 #0 </s>.7 #0 WELL.INS.1 #1 O.1 #1 HOW.2 #1 ARE.3 #0 YOU.4 #0 WELL.5 #1 TO.INS.6 #1 DAY.6 #1 MBR Techniques in Automatic Speech Recognition and Machine Translation p.9/33

Periodic Risk-Based Lattice Cutting (PLC) Segment the lattice into K segments relative to alignment against W = w K 1 Properties Optimal wrt best path only : L(W, W ) L I (W, W ) for W W Segment the lattice along fewer cuts Better approximations to loss function Solution: Segment Lattice into < K segments by choosing cuts at equal periods HOW.2 #1 ARE.3 #0 YOU.4 #0 ALL.5 #0 TODAY.6 #0 </s>.7 #0 HELLO.1 #0 NOW.2 #0 NOW.2 #0 ARE.3 #0 YOU.4 #0 TO.INS.6 #1 TODAY.6 #0 </s>.7 #0 WELL.INS.1 #1 O.1 #1 HOW.2 #1 ARE.3 #0 YOU.4 #0 WELL.5 #1 TO.INS.6 #1 DAY.6 #1 MBR Techniques in Automatic Speech Recognition and Machine Translation p.10/33

Periodic Risk-Based Lattice Cutting (PLC) Segment the lattice into K segments relative to alignment against W = w K 1 Properties Optimal wrt best path only : L(W, W ) L I (W, W ) for W W Segment the lattice along fewer cuts Better approximations to loss function Solution: Segment Lattice into < K segments by choosing cuts at equal periods HOW.2 #1 ARE.3 #0 YOU.4 #0 ALL.5 #0 TODAY.6 #0 </s>.7 #0 HELLO.1 #0 WELL.INS.1 #1 NOW.2 #0 NOW.2 #0 ARE.3 #0 YOU.4 #0 TO.INS.6 #1 TODAY.6 #0 </s>.7 #0 WELL.5 #1 O.1 #1 HOW.2 #1 ARE.3 #0 YOU.4 #0 TO.INS.6 #1 </s>.7 #0 MBR Techniques in Automatic Speech Recognition and Machine Translation p.10/33

Recognition Performance of MBR Classifiers Task: SWITCHBOARD Large Vocabulary ASR (JHU 2001 Evaluation System) Test Sets: SWB1 (1831 utterances) and SWB2 (1755 utterances) MBR decoding strategy: A search on lattices Decoder SWB2 WER(%) SWB1 Segmentation Strategy MAP (baseline) 41.1 26.0 MBR Decoding Properties No Cutting (Period ) search errors, no approx to loss function 40.4 25.5 PLC (Period 6) intermediate 40.0 25.4 PLC (Period 1) no search errors, poor approx to loss function 41.0 25.9 Segmental MBR decoding performs better than MAP decoding or MBR decoders on unsegmented lattices Segmental MBR decoder performs better under PLC-6 compared to PLC-1 MBR Techniques in Automatic Speech Recognition and Machine Translation p.11/33

Outline Automatic Speech Recognition Minimum Bayes-Risk Classifiers Segmental Minimum Bayes-Risk Classification Risk-Based Lattice Segmentation Statistical Machine Translation A Statistical Translation Model Minimum Bayes-Risk Classifiers for Word Alignment of Bilingual Texts Minimum Bayes-Risk Classifiers for Machine Translation Conclusions and Future Work MBR Techniques in Automatic Speech Recognition and Machine Translation p.12/33

Introduction to Statistical Machine Translation Statistical Machine Translation : Map a string of words in a source language (e.g. French) to a string of words in a target language (e.g. English) via statistical approaches les enfants ont besoin de jouets et de loisirs STATISTICAL CLASSIFIER children need toys and leisure time the children who need toys and leisure time those children need toys in leisure time the children need toys and leisures children need toys and leisure time Hypothesis Space (Huge!) Two sub-tasks of Machine Translation Word-to-Word alignment of bilingual texts Translation of sentences from source language to target language MBR Techniques in Automatic Speech Recognition and Machine Translation p.13/33

Alignment Template Translation Model Alignment Template Translation Model (ATTM) (Och, Tillmann and Ney 99) has emerged as a promising model for Statistical Machine Translation What are Alignment Templates? Alignment Template z = (E1 M, F0 N, A) specifies word alignments between word sequences E1 M and F0 N through a possible 0/1 valued matrix A. Alignment Templates map short word sequences in source language to short word sequences in target language NULL une inflation galopante F 0 N Z A run away inflation E 1 M MBR Techniques in Automatic Speech Recognition and Machine Translation p.14/33

Alignment Template Translation Model Architecture SOURCE LANGUAGE SENTENCE En aucune façon Monsieur le Président Component Models Source Segmentation Model EN_AUCUNE_FAÇON MONSIEUR_LE_PRÉSIDENT Phrase Permutation Model MONSIEUR_LE_PRÉSIDENT EN_AUCUNE_FAÇON Template Sequence Model MONSIEUR_LE_PRÉSIDENT EN_AUCUNE_FAÇON MR._SPEAKER MR._SPEAKER IN_NO_WAY IN_NO_WAY Phrasal Translation Model Mr. speaker in no way TARGET LANGUAGE SENTENCE MBR Techniques in Automatic Speech Recognition and Machine Translation p.15/33

Weighted Finite State Transducer Translation Model Reformulate the ATTM so that bitext-word alignment and translation can be implemented using Weighted Finite State Transducer (WFST) operations Modular Implementation: Statistical models are trained for each model component and implemented as WFSTs WFST implementation makes it unnecessary to develop a specialized decoder This decoder can even generate translation lattices and N-best lists WFST architecture provides support for generating bitext word alignments and alignment lattices Novel approach! Allows development of parameter re-estimation procedures Good performance in the NIST 2003 Chinese-English and Hindi-English MT Evaluations MBR Techniques in Automatic Speech Recognition and Machine Translation p.16/33

Outline Automatic Speech Recognition Minimum Bayes-Risk Classifiers Segmental Minimum Bayes-Risk Classification Risk-Based Lattice Segmentation Statistical Machine Translation A Statistical Translation Model Minimum Bayes-Risk Classifiers for Word Alignment of Bilingual Texts Minimum Bayes-Risk Classifiers for Machine Translation Conclusions and Future Work MBR Techniques in Automatic Speech Recognition and Machine Translation p.17/33

Word-to-Word Bitext Alignment Competing Alignments for an English-French Sentence Pair NULL Mr. Speaker, my question is directed to the Minister of Transport monsieur le Orateur, ma question se adresse à le ministre chargé de les transports NULL Mr. Speaker, my question is directed to the Minister of Transport Basic Terminology (e l 0, f m 1 ) : An English-French Sentence Pair Alignment Links: b = (i, j) : f i linked to e j Alignment is defined by a Link Set B = {b 1, b 2,..., b m } Some links are NULL links Given a candidate alignment B and the reference alignment B, L(B, B ) is the loss function that measures B wrt B. MBR Techniques in Automatic Speech Recognition and Machine Translation p.18/33

MBR Word Alignments of Bilingual Texts Word-to-Word alignments of Bilingual texts are important components of an MT system Alignment Templates are constructed from word alignments Better alignments lead to better templates and therefore better translation performance Alignment loss functions to measure alignment quality Different loss functions capture different features of alignments Loss functions can use information from word-to-word links, parse-trees and POS tags - These are ignored by most of the current translation models Minimum Bayes-Risk (MBR) Alignments under each loss function Performance gains by tuning alignment to the evaluation criterion MBR Techniques in Automatic Speech Recognition and Machine Translation p.19/33

Loss functions for Bitext word alignment Alignment Error measures # of non-null alignment links by which the candidate alignment differs reference alignment Derived from Alignment Error Rate (Och and Ney 00) L AE (B, B ) = B + B 2 B B Generalized Alignment Error: Extension of Alignment Error loss function to incorporate linguistic features L GAE (B, B ) = 2 δ i (i )d ijj where b = (i, j), b = (i, j ) b B b B Word-to-Word Distance Measure d ijj = D((j, e j ), (j, e j ); f i ) can be constructed using information from parse-trees or Part-of-Speech (POS) tags. L GAE can be almost reduced to L AE Example using Part-of-Speech Tags 0 POS(e j ) = POS(e j ) d ijj = 1 otherwise. MBR Techniques in Automatic Speech Recognition and Machine Translation p.20/33

Examples of Word Alignment Loss Function NP S VP Alignment Error = 10 + 10 2*9 = 2 Generalized Alignment Error (POS) = 2*1 = 2 Generalized Alignment Error (TREE) = 2*5 = 10 DT VBP PP VP i disagree IN NP VBN PP d(disagree,advanced; TREE) = 5 with DT NN advanced IN NP d(disagree,advanced; POS) = 1 the argument by DT NN. the minister. i disagree with the argument advanced by the minister. je ne partage pas le avis de le ministre. i disagree with the argument advanced by the minister. MBR Techniques in Automatic Speech Recognition and Machine Translation p.21/33

Minimum Bayes-Risk Decoding for Automatic Word Alignment Introduce a statistical model over alignments of a sentence pair (e, f) :P (B f, e) MBR decoder ˆB = argmin B B B B L(B, B )P (B f, e) B is the set of all alignments of (e, f) This is approximated by the alignment lattice: the set of the most likely word alignments We have derived closed form expressions for the MBR decoder under two classes of alignment loss functions Allows exact and efficient implementation of the lattice search MBR Techniques in Automatic Speech Recognition and Machine Translation p.22/33

Minimum Bayes-Risk Alignment Experiments Experiment Setup Training Data: 50,000 sentence pairs from French-English Hansards Test Data: 207 unseen sentence pairs from Hansards Evaluation: Measure error rates wrt human word alignments Generalized Alignment Error Rates Decoder AER (%) TREE (%) POS (%) ML 18.13 29.39 51.36 M AE 14.87 19.81 36.42 B GAE-TREE 23.26 14.45 26.76 R GAE-POS 28.60 15.70 26.28 MBR decoder tuned for a loss function performs the best under the corresponding error rate MBR Techniques in Automatic Speech Recognition and Machine Translation p.23/33

Outline Automatic Speech Recognition Minimum Bayes-Risk Classifiers Segmental Minimum Bayes-Risk Classification Risk-Based Lattice Segmentation Statistical Machine Translation A Statistical Translation Model Minimum Bayes-Risk Classifiers for Word Alignment of Bilingual Texts Minimum Bayes-Risk Classifiers for Machine Translation Conclusions and Future Work MBR Techniques in Automatic Speech Recognition and Machine Translation p.24/33

Loss functions for Machine Translation Automatic Evaluation of Machine Translation - Hard Problem! BLEU (Papineni et.al 2001) is an automatic MT metric - Shown to correlate well with human judgements on translation Other Metrics: Word Error Rate (WER) & Position Independent Word Error Rate (PER) : Minimum String edit distance between a reference sentence and any permutation of the hypothesis sentence Loss function Reference : mr. speaker, in absolutely no way. Hypothesis : in absolutely no way, mr. chairman. Sub-string Matches(Truth,Hyp) 1-word 2-word 3-word 4-word 7/8 3/7 2/6 1/5 Evaluation Metric(Truth,Hyp) (%) BLEU WER PER 39.76% 6/8 = 75.0% 1/8 = 12.5% BLEU computation: ( 7 8 3 7 2 6 1 5 ) 1 4 = 0.3976 MBR Techniques in Automatic Speech Recognition and Machine Translation p.25/33

Minimum Bayes-Risk Machine Translation Given a loss function, we can build Minimum Bayes-Risk Classifiers to optimize performance under the loss function. Setup A baseline translation model to give the probabilities over translations: P (E F ) A set E of N-Best Translations of F A Loss function L(E, E ) that measures the the quality of a candidate translation E relative to a reference translation E MBR Decoder Ê = argmin E E E E L(E, E )P (E F ) MBR Techniques in Automatic Speech Recognition and Machine Translation p.26/33

Performance of MBR Decoders for Machine Translation Experimental Setup: WS 03 - CLSP summer workshop Test Set: Chinese-English NIST MT Task (2002), 878 sentences, 1000-best lists Performance Metrics BLEU (%) mwer(%) mper (%) MAP(baseline) 31.6 62.4 39.3 M PER 31.7 62.2 38.5 B WER 31.8 61.8 38.8 R BLEU 31.9 62.5 39.2 MBR Decoding allows translation process to be tuned for specific loss functions MBR Techniques in Automatic Speech Recognition and Machine Translation p.27/33

Conclusions : Minimum Bayes-Risk Techniques Unified classification framework for two different tasks in speech and language processing Techniques are general and can be applied to a variety of scenarios Need design of various loss functions that measure task-dependent error rates Can optimize performance under task-dependent metrics MBR Techniques in Automatic Speech Recognition and Machine Translation p.28/33

Conclusions : Segmental Minimum Bayes-Risk Lattice Segmentation Segmental MBR Classification and Lattice Cutting decompose a large utterance level MBR recognizer into a sequence of simpler sub-utterance level MBR recognizers Risk-Based Lattice Segmentation - robust and stable technique Basis for novel discriminative training procedures in ASR (Doumpiotis, Tsakalidis and Byrne 03) Basis for novel classification schemes using Support Vector Machines for ASR (Venkataramani, Chakrabartty and Byrne 03) Future Work: Investigate applications within the MALACH ASR project MBR Techniques in Automatic Speech Recognition and Machine Translation p.29/33

Conclusions: Machine Translation The Weighted Finite State Transducer Alignment Template Translation Model Powerful modeling framework for Machine Translation A novel approach to generate word alignments and alignment lattices under this model MBR classifiers for bitext word alignment and translation Alignment and translation can be tuned under specific loss functions Syntactic features from English parsers and Part-of-Speech taggers can be integrated into a statistical MT system via appropriate definition of loss functions MBR Techniques in Automatic Speech Recognition and Machine Translation p.30/33

Proposed Research Refinements to the Alignment Template Translation Model Iterative parameter re-estimation via Expectation Maximization procedures Model currently initialized from bitext word alignments Alignment Lattices : Posterior Distributions over hidden variables Expect improvements in alignment and translation performance Reformulation as a source-channel model New strategies for template selection MBR Classifiers for Bitext Word Alignment and Translation Loss functions based on detailed models of translation Extend search space to Translation Lattices MBR Techniques in Automatic Speech Recognition and Machine Translation p.31/33

Thank you! MBR Techniques in Automatic Speech Recognition and Machine Translation p.32/33

References V. Goel and W. Byrne 2000. Minimum Bayes-Risk Decoding for Automatic Speech Recognition, Computer, Speech and Language S. Kumar and W. Byrne 2002. Risk-Based Lattice Cutting for Segmental Minimum Bayes-Risk Decoding, Proceedings of the International Conference on Spoken Language Processing, Denver CO. V. Goel, S. Kumar and W. Byrne 2003. Segmental Minimum Bayes-Risk Decoding for Automatic Speech Recognition, IEEE Transactions on Speech and Audio Processing, To appear S. Kumar and W. Byrne 2002. Minimum Bayes-Risk Word Alignments of Bilingual Texts, Proceedings of the Conference on Empirical Methods in Natural Language Processing, Philadelphia, PA S. Kumar and W. Byrne 2003. A Weighted Finite State Transducer Implementation of the Alignment Template Model for Statistical Machine Translation, Proceedings of the Conference on Human Language Technology, Edmonton, AB, Canada MBR Techniques in Automatic Speech Recognition and Machine Translation p.33/33