Improved Word Alignments for Statistical Machine Translation

Similar documents
Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

arxiv: v1 [cs.cl] 2 Apr 2017

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

Language Model and Grammar Extraction Variation in Machine Translation

Lecture 1: Machine Learning Basics

Domain Adaptation in Statistical Machine Translation of User-Forum Data using Component-Level Mixture Modelling

Python Machine Learning

Greedy Decoding for Statistical Machine Translation in Almost Linear Time

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

The MSR-NRC-SRI MT System for NIST Open Machine Translation 2008 Evaluation

The NICT Translation System for IWSLT 2012

Extracting Opinion Expressions and Their Polarities Exploration of Pipelines and Joint Models

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Clickthrough-Based Translation Models for Web Search: from Word Models to Phrase Models

Noisy SMS Machine Translation in Low-Density Languages

The Strong Minimalist Thesis and Bounded Optimality

Improved Reordering for Shallow-n Grammar based Hierarchical Phrase-based Translation

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Speech Recognition at ICSI: Broadcast News and beyond

Probabilistic Latent Semantic Analysis

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Learning Methods in Multilingual Speech Recognition

Using dialogue context to improve parsing performance in dialogue systems

Re-evaluating the Role of Bleu in Machine Translation Research

Cross Language Information Retrieval

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Rule Learning With Negation: Issues Regarding Effectiveness

Cross-Lingual Dependency Parsing with Universal Dependencies and Predicted PoS Labels

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Lecture 10: Reinforcement Learning

Lab 1 - The Scientific Method

Linking Task: Identifying authors and book titles in verbose queries

BYLINE [Heng Ji, Computer Science Department, New York University,

A Case Study: News Classification Based on Term Frequency

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS

LEARNING A SEMANTIC PARSER FROM SPOKEN UTTERANCES. Judith Gaspers and Philipp Cimiano

Evaluation of a Simultaneous Interpretation System and Analysis of Speech Log for User Experience Assessment

Training and evaluation of POS taggers on the French MULTITAG corpus

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Online Updating of Word Representations for Part-of-Speech Tagging

Detecting English-French Cognates Using Orthographic Edit Distance

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

On the Combined Behavior of Autonomous Resource Management Agents

Proof Theory for Syntacticians

(Sub)Gradient Descent

Bootstrapping and Evaluating Named Entity Recognition in the Biomedical Domain

Annotation Projection for Discourse Connectives

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

A heuristic framework for pivot-based bilingual dictionary induction

Regression for Sentence-Level MT Evaluation with Pseudo References

Constructing Parallel Corpus from Movie Subtitles

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

What is a Mental Model?

Assignment 1: Predicting Amazon Review Ratings

Syntactic Patterns versus Word Alignment: Extracting Opinion Targets from Online Reviews

Learning to Schedule Straight-Line Code

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Rule Learning with Negation: Issues Regarding Effectiveness

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

Generative models and adversarial training

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

End-of-Module Assessment Task

The stages of event extraction

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Multilingual Sentiment and Subjectivity Analysis

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Prediction of Maximal Projection for Semantic Role Labeling

Calibration of Confidence Measures in Speech Recognition

Lecture 2: Quantifiers and Approximation

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data

Inteligencia Artificial. Revista Iberoamericana de Inteligencia Artificial ISSN:

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

Data Integration through Clustering and Finding Statistical Relations - Validation of Approach

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

arxiv:cmp-lg/ v1 22 Aug 1994

Combining Bidirectional Translation and Synonymy for Cross-Language Information Retrieval

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

Using computational modeling in language acquisition research

WHEN THERE IS A mismatch between the acoustic

The KIT-LIMSI Translation System for WMT 2014

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Finding Translations in Scanned Book Collections

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Deep Neural Network Language Models

Loughton School s curriculum evening. 28 th February 2017

CS Machine Learning

Software Maintenance

Artificial Neural Networks written examination

Learning to Rank with Selection Bias in Personal Search

Using Small Random Samples for the Manual Evaluation of Statistical Association Measures

University of Alberta. Large-Scale Semi-Supervised Learning for Natural Language Processing. Shane Bergsma

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy

Age Effects on Syntactic Control in. Second Language Learning

Corrective Feedback and Persistent Learning for Information Extraction

arxiv: v2 [cs.cv] 30 Mar 2017

Transcription:

Improved Word Alignments for Statistical Machine Translation Institut für Maschinelle Sprachverarbeitung Universität Stuttgart 2008.01.17 Universität Heidelberg

Outline 2 Intro to statistical machine translation (SMT) How to build an SMT system SMT terminology What are word alignments? Improving word alignments for SMT Evaluating quality New model New training algorithm

How to Build an SMT System 3 Start with a large parallel corpus Consists of document pairs (document and its translation) Sentence alignment: in each document pair automatically find those sentences which are translations of one another Results in sentence pairs (sentence and its translation) Word alignment: in each sentence pair automatically annotate those words which are translations of one another Results in word-aligned sentence pairs

How to Build an SMT System 4 Construct a function g which, given a sentence in the source language and a hypothesized translation into the target language, assigns a goodness score g(die Waschmaschine läuft, the washing machine is running) = high number g(die Waschmaschine läuft, the car drove) = low number

How to Build an SMT System 5 Implement a search algorithm which, given a source language sentence, finds the target language sentence which maximizes g Problem: exhaustively searching this space is intractable Need an auxiliary function h that returns an approximate goodness score for only a part of the target sentence Using h, gradually build the target sentence from left to right

Using the SMT System 6 To use our SMT system to translate a new, unseen sentence, call the search algorithm Returns its determination of the best target language sentence To see if your SMT system works well, do this for a large number of unseen sentences and evaluate the results

SMT Models 7 We wish to build a machine translation system which given a Foreign sentence f produces its English translation e We build a model of P( e f ), the probability of the sentence e given the sentence f To translate a Foreign text f, choose the English text e which maximizes P( e f )

8 Noisy Channel: Decomposing P(e f ) argmax P( e f ) = argmax P( f e ) P( e ) e e P( e ) is referred to as the language model P ( e ) can be modeled using standard models (N-grams, etc) Parameters of P ( e ) can be estimated using large amounts of monolingual text (English) P( f e ) is referred to as the translation model

SMT Terminology Parameterized Model: the form of the function g which is used to determine the goodness of a translation g(die Waschmaschine läuft, the washing machine is running) = P(e f) P(the washing machine is running die Waschmaschine läuft)= n(1 die) t(the die) n(2 Waschmaschine) t(washing Waschmaschine) t(machine Waschmaschine) n(2 läuft) t(is läuft) t(running läuft) l(the START) l(washing the) l(machine washing) l(is machine) l(running is) 9

SMT Terminology Parameters: lookup tables used in the function g P(the washing machine is running die Waschmaschine läuft)= n(1 die) t(the die) n(2 Waschmaschine) t(washing Waschmaschine) t(machine Waschmaschine) n(2 läuft) t(is läuft) t(running läuft) l(the START) l(washing the) l(machine washing) l(is machine) l(running is) 10 0.1 x 0.1 x 0.5 x 0.8 x 0.7 x 0.1 x 0.1 x 0.1 x 0.0000001

SMT Terminology Parameters: lookup tables used in the function g P(the washing machine is running die Waschmaschine läuft)= n(1 die) t(the die) n(2 Waschmaschine) t(washing Waschmaschine) t(machine Waschmaschine) n(2 läuft) t(is läuft) t(running läuft) l(the START) l(washing the) l(machine washing) l(is machine) l(running is) 11 0.1 x 0.1 x 0.5 x 0.8 x 0.7 x 0.1 x 0.1 x 0.1 x 0.0000001 Change washing machine to car 0.1 x 0.1 x 0.1 x 0.0001 n( 1 Waschmaschine) t(car Waschmaschine) x 0.1 x 0.1 x 0.1 x also different

SMT Terminology 12 Training: automatically building the lookup tables used in g, using parallel sentences One way to determine t(the die) Generate a word alignment for each sentence pair Look through the word-aligned sentence pairs Count the number of times die is translated as the Divide by the number of times die is translated. If this is 10% of the time, we set t(the die) = 0.1

Evaluation Evaluation metric: method for assigning a numeric score to a set of hypothesized translations Automatic evaluation metrics often rely on comparison with previously completed human translations BLEU compares the 1,2,3,4-gram overlap with one to four human translations BLEU penalizes generating long strings BLEU works well for comparing two similar MT systems 13

SMT Last Words 14 Translating is usually referred to as decoding (Warren Weaver) SMT was invented by ASR (Automatic Speech Recognition) researchers. In ASR: P(e) = language model P(e f) = acoustic model However, SMT must deal with word reordering!

Word Alignments 15 Recall that we build translation models from word-aligned parallel sentences The statistics involved in state of the art SMT translation models are simple Just count translations in the word-aligned parallel sentences But what is a word alignment, and how do we obtain it?

Word alignment is annotation of minimal translational correspondences Annotated in the context in which they occur Not idealized translations! (solid blue lines annotated by a bilingual expert)

Word Alignments Mathematically, P(f e) = P(f, a e) An alignment represents one way f could be generated from e But for the models discussed today, we approximate! P(f e) = argmax P(f, a e) a a 17

Automatic word alignments are typically generated using a model called IBM Model 4 No linguistic knowledge No correct answers are supplied to the system unsupervised learning (red dashed line = automatically generated hypothesis)

Overview: Improving Word Alignment 19 Solving problems with: Measuring word alignment quality Modeling word alignments Knowledge-free training process

How to measure alignment quality? 20 If we want to compare word alignment algorithms, we can generate a word alignment with each algorithm Then build an SMT system from each alignment Compare performance of the SMT systems using BLEU But this is slow, building SMT systems can take days of computation Question: Can we have an automatic metric like BLEU, but for alignment? Answer: there are several metrics already defined, they involve comparison with gold standard alignments

Problem: Existing Metrics Do Not Track Translation Quality 21 - Dozens of papers at ACL, NAACL, HLT, COLING, WPT03, WPT05, etc, report word alignment quality increases using various metrics - Contradiction: few of these report translation results - Those that do report inconclusive gains - This is because the two commonly used metrics, Alignment Error Rate (AER) and balanced F- Measure, do not correlate with MT performance! - We will show that these metrics have low correlation with BLEU

Measuring Precision and Recall 22 Start by fully linking hypothesized alignments

Measuring Precision and Recall 23 Precision is percentage of links in hypothesis that are correct If we hypothesize there are no links, have 100% precision Recall is percentage of correct links we hypothesized If we hypothesize all possible links, have 100% recall We will test metrics which formally define and combine these in different ways

Evaluating Alignment Error Rate 24 Does the widely used Alignment Error Rate (AER) metric correlate with BLEU? Use our baseline unsupervised alignment system in combination with three symmetrization heuristics (union, refined, intersection) One of these is usually used to build MT systems Effect is having three very different alignment systems

Alignment Error Rate (AER) 25 Gold Precision( A, P) = P A A = 3 4 (e3,f4) wrong f1 f2 f3 f4 f5 e1 e2 e3 e4 Recall( A,S) = S A S = 2 3 (e2,f3) not in hyp Hypothesis AER( A,P,S) = 1 P A + S + S A A = 2 7 f1 f2 f3 f4 f5 e1 e2 e3 e4

Experiment Desideratum: Keep everything constant in a set of SMT systems except the word-level alignments Alignments should be realistic Experiment: Take a parallel corpus of 8M words of Foreign-English. Word-align it. Build SMT system. Report AER and Bleu. For better alignments: train on 16M, 32M, 64M words (but use only the 8M words for MT building). For worse alignments: train on 2 1/2, 4 1/4, 8 1/8 of the 8M word training corpus. If AER is a good indicator of MT performance, 1 AER and BLEU should correlate no matter how the alignments are built (union, intersection, refined) Low 1 AER scores should correspond to low BLEU scores High 1 AER scores should correspond to high BLEU scores 26

AER is not a good indicator of MT performance 27

28 AER is wrongly derived from F-Measure (can be shown analytically) For details see Squib in Comp. Ling. (Sept 2007) Important: AER incorrectly favors sparse alignments (many unlinked words).

F α -score 29 We will try a different evaluation metric called F α -score The alpha refers to a parameter tuned to favor either precision or recall

F α -score 30 Gold f1 f2 f3 f4 f5 e1 e2 e3 e4 Hypothesis f1 f2 f3 f4 f5 e1 e2 e3 e4 Precision( A, S) = Recall( A,S) = F( A,S, α ) = S A A S A S 1 α + Precision( A,S) = 3 4 = 3 5 1 α Recall( A,S) Called F α -score to differentiate from ambiguous term F-Measure (e3,f4) wrong (e2,f3) (e3,f5) not in hyp

F α -score is a good indicator of MT performance 31 α = 0.4

32 We have a way to rapidly measure alignment quality for SMT We will now look at alignment modeling

Problem: Existing Models Have the Wrong Structure 33 Existing generative models make false assumptions about alignment structure Proposed discriminative models either: Depend on generative models for their best results Or make false assumptions about structure themselves

1-to-N Assumption 34 1-to-N assumption Multi-word cepts (words in one language translated as a unit) only allowed on target side. Source side limited to single word cepts. Forced to create M-to-N alignments using heuristics, e.g. union

LEAF Generative Story (all) 35

LEAF Generative Story (0) 36

LEAF Generative Story (1) 37 Explicitly model three word types: Head word: provide most of conditioning for translation Robust representation of multi-word cepts (for this task) This is to semantics as ``syntactic head word'' is to syntax Non-head word: attached to a head word Deleted source words and spurious target words

LEAF Generative Story (2) 38 Stochastically attach the non-head words to a head word (using distance and the non-head word class)

LEAF Generative Story (3) 39 Generate exactly one target head word from each source word

LEAF Generative Story (4) 40 Decide how big the target cepts will be (using the source head and whether the source cept is only one word)

LEAF Generative Story (5) 41 Decide the number of spurious words (use the number of non-spurious words)

LEAF Generative Story (6) 42 Generate the spurious words

LEAF Generative Story (7) 43 Generate the target non-head words in each cept, conditioned on the source head word and the target head word class

LEAF Generative Story (8) 44 For each cept, place the target head word and then non-head words (relative distortion model)

LEAF Generative Story (9) 45 Place the spurious words

LEAF Can score the same structure in both directions Math in one direction (please do not try to read):

Comparing LEAF with Model 4 47 Model 4 does not allow source cepts to be more than one word This requires us to use heuristics to account for multiple word constructions LEAF allows multiple word source cepts LEAF is able to use the head-word relationship to better model both the source cept and the target cept

Unsupervised Training with EM 48 Expectation Maximization (EM) Unsupervised learning Maximize the likelihood of the training data Likelihood is (informally) the probability the model assigns to the training data E-Step: predict according to current parameters M-Step: reestimate parameters from predictions Amazing but true: if we iterate E and M steps, we increase likelihood!

The EM Algorithm 49 Bootstrap Viterbi alignments Translation Model Initial parameters E-Step Refined parameters Viterbi alignments M-Step

Want to learn more about EM? 50 See K. Knight 1999 word alignment tutorial Available from www.statmt.org

M-Step 51 M-Step: reestimate parameters Count events in the Viterbi Simple smoothing: add a small fractional constant Normalize to sum to 1 Bootstrap (initial M-step) See EMNLP 2007 paper for details

E-Step 52 E-Step: search for Viterbi alignments Solved using local hillclimbing search Given a starting alignment we can permute the alignment by making small changes such as swapping the incoming links for two words Algorithm: Begin: Given starting alignment, make list of possible small changes (e.g. list every possible swap of the incoming links for two words) for each possible small change Create new alignment A2 by copying A and applying small change If score(a2) > score(best) then best = A2 end for Choose best alignment as starting point, goto Begin: See ACL 2006 paper for improved local hillclimbing search

Discussion 53 LEAF has powerful features But requires approximate search Correct structure: M-to-N discontiguous First general purpose statistical word alignment model of this structure! Head word assumption allows use of multi-word cepts Gives power of phrase-based models, but decisions robustly decompose over words

The story so far 54 We know that better alignments (as measured using the F α -score) lead to better MT We have defined LEAF, a generative model which models M-to-N discontiguous alignments LEAF can be trained using approximate EM What about integrating new knowledge Light supervision (the correct alignments for a few sentence pairs) Linguistic knowledge?

Existing Approaches Can Not Utilize New Knowledge 55 Existing unsupervised alignment techniques can not use manually annotated data Could be useful for light supervision It is difficult to add new knowledge sources to generative models Requires completely reengineering the generative story for each new knowledge source

Semi-Supervised Training Overview 56 First decompose the steps of the LEAF generative story into sub-models of a (log-) linear model Allows us to tune vector λ which has a scalar for each submodel controlling its contribution The idea here is that we might trust, for instance, the translation distribution (one sub-model) more than the number of words in a cept distribution (another sub-model). This allows us to integrate new sub-models unrelated to LEAF and adjust their weights with respect to other submodels Then

Semi-Supervised Training Overview 57 Define a semi-supervised algorithm which alternates increasing likelihood with decreasing error Increasing likelihood similar to EM Discriminatively bias EM to converge to a local maxima of likelihood which corresponds to better alignments Better = higher F α -score on small gold standard corpus

The EMD Algorithm Initialize: Perform initial M-step : estimate sub-model parameters from the HMM Viterbi alignments (bootstrap) Perform initial D-step : Find λ values which maximize F α -score on the small gold standard word-aligned development corpus Repeat: E-Step : Find Viterbi alignments using sub-models weighted by λ M-Step : Re-estimate sub-model params from the new Viterbi alignments D-step : Find λ values that maximize F α -score on the small gold word-aligned development corpus 58

The EMD Algorithm 59 Bootstrap Initial sub-model parameters Tuned lambda vector E-Step Viterbi alignments Translation Model D-Step Viterbi alignments Sub-model parameters M-Step

Previous Work: Semi-Supervised Usual formulation of semi-supervised learning: using unlabeled data to help supervised learning Build supervised system using labeled data Predict on unlabeled data Iterate (estimating from both labeled data and predictions on unlabeled data) We do not have enough gold standard word alignments to estimate parameters directly! EMD allows us to train a small number of important parameters discriminatively, the rest using likelihood maximization, and allows interaction 60

Story so far 61 We ve now presented a new metric, a new model, and a new semi-supervised training algorithm We ve reformulated LEAF as a log-linear model and added additional sub-models We will train this model using the semi-supervised EMD training algorithm to maximize F α -score How well does this work?

Experiments 62 French/English LDC Hansard (67 M English words) 110 gold standard aligned sentences MT: Alignment Templates, phrase-based Arabic/English NIST 2006 task (168 M English words) 1000 gold standard aligned sentences MT: Hiero, hierarchical phrases

Results French/English Arabic/English 63 System F-Measure BLEU F-Measure BLEU (α = 0.4) (1 ref) (α = 0.1) (4 refs) IBM Model 4 73.5 30.63 75.8 51.55 (GIZA++) and heuristics EMD (ACL 2006 74.1 31.40 79.1 52.89 model) and heuristics LEAF+EMD 76.3 31.86 84.5 54.34

Contributions 64 Found a metric for measuring alignment quality which correlates with MT quality Designed LEAF, the first generative model of M-to-N discontiguous alignments Developed a semi-supervised training algorithm, the EMD algorithm Obtained large gains of 1.2 BLEU and 2.8 BLEU points for French/English and Arabic/English tasks

65 Much of the presented work was joint work with Daniel Marcu ISI (Univ. Southern California) Thank You! Dankeschön!