Improved Word Alignments for Statistical Machine Translation

Improved Word Alignments for Statistical Machine Translation Institut für Maschinelle Sprachverarbeitung Universität Stuttgart 2008.01.17 Universität Heidelberg

Outline 2 Intro to statistical machine translation (SMT) How to build an SMT system SMT terminology What are word alignments? Improving word alignments for SMT Evaluating quality New model New training algorithm

How to Build an SMT System 3 Start with a large parallel corpus Consists of document pairs (document and its translation) Sentence alignment: in each document pair automatically find those sentences which are translations of one another Results in sentence pairs (sentence and its translation) Word alignment: in each sentence pair automatically annotate those words which are translations of one another Results in word-aligned sentence pairs

How to Build an SMT System 4 Construct a function g which, given a sentence in the source language and a hypothesized translation into the target language, assigns a goodness score g(die Waschmaschine läuft, the washing machine is running) = high number g(die Waschmaschine läuft, the car drove) = low number

How to Build an SMT System 5 Implement a search algorithm which, given a source language sentence, finds the target language sentence which maximizes g Problem: exhaustively searching this space is intractable Need an auxiliary function h that returns an approximate goodness score for only a part of the target sentence Using h, gradually build the target sentence from left to right

Using the SMT System 6 To use our SMT system to translate a new, unseen sentence, call the search algorithm Returns its determination of the best target language sentence To see if your SMT system works well, do this for a large number of unseen sentences and evaluate the results

SMT Models 7 We wish to build a machine translation system which given a Foreign sentence f produces its English translation e We build a model of P( e f ), the probability of the sentence e given the sentence f To translate a Foreign text f, choose the English text e which maximizes P( e f )

8 Noisy Channel: Decomposing P(e f ) argmax P( e f ) = argmax P( f e ) P( e ) e e P( e ) is referred to as the language model P ( e ) can be modeled using standard models (N-grams, etc) Parameters of P ( e ) can be estimated using large amounts of monolingual text (English) P( f e ) is referred to as the translation model

SMT Terminology Parameterized Model: the form of the function g which is used to determine the goodness of a translation g(die Waschmaschine läuft, the washing machine is running) = P(e f) P(the washing machine is running die Waschmaschine läuft)= n(1 die) t(the die) n(2 Waschmaschine) t(washing Waschmaschine) t(machine Waschmaschine) n(2 läuft) t(is läuft) t(running läuft) l(the START) l(washing the) l(machine washing) l(is machine) l(running is) 9

SMT Terminology Parameters: lookup tables used in the function g P(the washing machine is running die Waschmaschine läuft)= n(1 die) t(the die) n(2 Waschmaschine) t(washing Waschmaschine) t(machine Waschmaschine) n(2 läuft) t(is läuft) t(running läuft) l(the START) l(washing the) l(machine washing) l(is machine) l(running is) 10 0.1 x 0.1 x 0.5 x 0.8 x 0.7 x 0.1 x 0.1 x 0.1 x 0.0000001

SMT Terminology Parameters: lookup tables used in the function g P(the washing machine is running die Waschmaschine läuft)= n(1 die) t(the die) n(2 Waschmaschine) t(washing Waschmaschine) t(machine Waschmaschine) n(2 läuft) t(is läuft) t(running läuft) l(the START) l(washing the) l(machine washing) l(is machine) l(running is) 11 0.1 x 0.1 x 0.5 x 0.8 x 0.7 x 0.1 x 0.1 x 0.1 x 0.0000001 Change washing machine to car 0.1 x 0.1 x 0.1 x 0.0001 n( 1 Waschmaschine) t(car Waschmaschine) x 0.1 x 0.1 x 0.1 x also different

SMT Terminology 12 Training: automatically building the lookup tables used in g, using parallel sentences One way to determine t(the die) Generate a word alignment for each sentence pair Look through the word-aligned sentence pairs Count the number of times die is translated as the Divide by the number of times die is translated. If this is 10% of the time, we set t(the die) = 0.1

Evaluation Evaluation metric: method for assigning a numeric score to a set of hypothesized translations Automatic evaluation metrics often rely on comparison with previously completed human translations BLEU compares the 1,2,3,4-gram overlap with one to four human translations BLEU penalizes generating long strings BLEU works well for comparing two similar MT systems 13

SMT Last Words 14 Translating is usually referred to as decoding (Warren Weaver) SMT was invented by ASR (Automatic Speech Recognition) researchers. In ASR: P(e) = language model P(e f) = acoustic model However, SMT must deal with word reordering!

Word Alignments 15 Recall that we build translation models from word-aligned parallel sentences The statistics involved in state of the art SMT translation models are simple Just count translations in the word-aligned parallel sentences But what is a word alignment, and how do we obtain it?

Word alignment is annotation of minimal translational correspondences Annotated in the context in which they occur Not idealized translations! (solid blue lines annotated by a bilingual expert)

Word Alignments Mathematically, P(f e) = P(f, a e) An alignment represents one way f could be generated from e But for the models discussed today, we approximate! P(f e) = argmax P(f, a e) a a 17

Automatic word alignments are typically generated using a model called IBM Model 4 No linguistic knowledge No correct answers are supplied to the system unsupervised learning (red dashed line = automatically generated hypothesis)

Overview: Improving Word Alignment 19 Solving problems with: Measuring word alignment quality Modeling word alignments Knowledge-free training process

How to measure alignment quality? 20 If we want to compare word alignment algorithms, we can generate a word alignment with each algorithm Then build an SMT system from each alignment Compare performance of the SMT systems using BLEU But this is slow, building SMT systems can take days of computation Question: Can we have an automatic metric like BLEU, but for alignment? Answer: there are several metrics already defined, they involve comparison with gold standard alignments

Problem: Existing Metrics Do Not Track Translation Quality 21 - Dozens of papers at ACL, NAACL, HLT, COLING, WPT03, WPT05, etc, report word alignment quality increases using various metrics - Contradiction: few of these report translation results - Those that do report inconclusive gains - This is because the two commonly used metrics, Alignment Error Rate (AER) and balanced F- Measure, do not correlate with MT performance! - We will show that these metrics have low correlation with BLEU

Measuring Precision and Recall 22 Start by fully linking hypothesized alignments

Measuring Precision and Recall 23 Precision is percentage of links in hypothesis that are correct If we hypothesize there are no links, have 100% precision Recall is percentage of correct links we hypothesized If we hypothesize all possible links, have 100% recall We will test metrics which formally define and combine these in different ways

Evaluating Alignment Error Rate 24 Does the widely used Alignment Error Rate (AER) metric correlate with BLEU? Use our baseline unsupervised alignment system in combination with three symmetrization heuristics (union, refined, intersection) One of these is usually used to build MT systems Effect is having three very different alignment systems

Alignment Error Rate (AER) 25 Gold Precision( A, P) = P A A = 3 4 (e3,f4) wrong f1 f2 f3 f4 f5 e1 e2 e3 e4 Recall( A,S) = S A S = 2 3 (e2,f3) not in hyp Hypothesis AER( A,P,S) = 1 P A + S + S A A = 2 7 f1 f2 f3 f4 f5 e1 e2 e3 e4

Experiment Desideratum: Keep everything constant in a set of SMT systems except the word-level alignments Alignments should be realistic Experiment: Take a parallel corpus of 8M words of Foreign-English. Word-align it. Build SMT system. Report AER and Bleu. For better alignments: train on 16M, 32M, 64M words (but use only the 8M words for MT building). For worse alignments: train on 2 1/2, 4 1/4, 8 1/8 of the 8M word training corpus. If AER is a good indicator of MT performance, 1 AER and BLEU should correlate no matter how the alignments are built (union, intersection, refined) Low 1 AER scores should correspond to low BLEU scores High 1 AER scores should correspond to high BLEU scores 26

AER is not a good indicator of MT performance 27

28 AER is wrongly derived from F-Measure (can be shown analytically) For details see Squib in Comp. Ling. (Sept 2007) Important: AER incorrectly favors sparse alignments (many unlinked words).

F α -score 29 We will try a different evaluation metric called F α -score The alpha refers to a parameter tuned to favor either precision or recall

F α -score 30 Gold f1 f2 f3 f4 f5 e1 e2 e3 e4 Hypothesis f1 f2 f3 f4 f5 e1 e2 e3 e4 Precision( A, S) = Recall( A,S) = F( A,S, α ) = S A A S A S 1 α + Precision( A,S) = 3 4 = 3 5 1 α Recall( A,S) Called F α -score to differentiate from ambiguous term F-Measure (e3,f4) wrong (e2,f3) (e3,f5) not in hyp

F α -score is a good indicator of MT performance 31 α = 0.4

32 We have a way to rapidly measure alignment quality for SMT We will now look at alignment modeling

Problem: Existing Models Have the Wrong Structure 33 Existing generative models make false assumptions about alignment structure Proposed discriminative models either: Depend on generative models for their best results Or make false assumptions about structure themselves

1-to-N Assumption 34 1-to-N assumption Multi-word cepts (words in one language translated as a unit) only allowed on target side. Source side limited to single word cepts. Forced to create M-to-N alignments using heuristics, e.g. union

LEAF Generative Story (all) 35

LEAF Generative Story (0) 36

LEAF Generative Story (1) 37 Explicitly model three word types: Head word: provide most of conditioning for translation Robust representation of multi-word cepts (for this task) This is to semantics as ``syntactic head word'' is to syntax Non-head word: attached to a head word Deleted source words and spurious target words

LEAF Generative Story (2) 38 Stochastically attach the non-head words to a head word (using distance and the non-head word class)

LEAF Generative Story (3) 39 Generate exactly one target head word from each source word

LEAF Generative Story (4) 40 Decide how big the target cepts will be (using the source head and whether the source cept is only one word)

LEAF Generative Story (5) 41 Decide the number of spurious words (use the number of non-spurious words)

LEAF Generative Story (6) 42 Generate the spurious words

LEAF Generative Story (7) 43 Generate the target non-head words in each cept, conditioned on the source head word and the target head word class

LEAF Generative Story (8) 44 For each cept, place the target head word and then non-head words (relative distortion model)

LEAF Generative Story (9) 45 Place the spurious words

LEAF Can score the same structure in both directions Math in one direction (please do not try to read):

Comparing LEAF with Model 4 47 Model 4 does not allow source cepts to be more than one word This requires us to use heuristics to account for multiple word constructions LEAF allows multiple word source cepts LEAF is able to use the head-word relationship to better model both the source cept and the target cept

Unsupervised Training with EM 48 Expectation Maximization (EM) Unsupervised learning Maximize the likelihood of the training data Likelihood is (informally) the probability the model assigns to the training data E-Step: predict according to current parameters M-Step: reestimate parameters from predictions Amazing but true: if we iterate E and M steps, we increase likelihood!

The EM Algorithm 49 Bootstrap Viterbi alignments Translation Model Initial parameters E-Step Refined parameters Viterbi alignments M-Step

Want to learn more about EM? 50 See K. Knight 1999 word alignment tutorial Available from www.statmt.org

M-Step 51 M-Step: reestimate parameters Count events in the Viterbi Simple smoothing: add a small fractional constant Normalize to sum to 1 Bootstrap (initial M-step) See EMNLP 2007 paper for details

E-Step 52 E-Step: search for Viterbi alignments Solved using local hillclimbing search Given a starting alignment we can permute the alignment by making small changes such as swapping the incoming links for two words Algorithm: Begin: Given starting alignment, make list of possible small changes (e.g. list every possible swap of the incoming links for two words) for each possible small change Create new alignment A2 by copying A and applying small change If score(a2) > score(best) then best = A2 end for Choose best alignment as starting point, goto Begin: See ACL 2006 paper for improved local hillclimbing search

Discussion 53 LEAF has powerful features But requires approximate search Correct structure: M-to-N discontiguous First general purpose statistical word alignment model of this structure! Head word assumption allows use of multi-word cepts Gives power of phrase-based models, but decisions robustly decompose over words

The story so far 54 We know that better alignments (as measured using the F α -score) lead to better MT We have defined LEAF, a generative model which models M-to-N discontiguous alignments LEAF can be trained using approximate EM What about integrating new knowledge Light supervision (the correct alignments for a few sentence pairs) Linguistic knowledge?

Existing Approaches Can Not Utilize New Knowledge 55 Existing unsupervised alignment techniques can not use manually annotated data Could be useful for light supervision It is difficult to add new knowledge sources to generative models Requires completely reengineering the generative story for each new knowledge source

Semi-Supervised Training Overview 56 First decompose the steps of the LEAF generative story into sub-models of a (log-) linear model Allows us to tune vector λ which has a scalar for each submodel controlling its contribution The idea here is that we might trust, for instance, the translation distribution (one sub-model) more than the number of words in a cept distribution (another sub-model). This allows us to integrate new sub-models unrelated to LEAF and adjust their weights with respect to other submodels Then

Semi-Supervised Training Overview 57 Define a semi-supervised algorithm which alternates increasing likelihood with decreasing error Increasing likelihood similar to EM Discriminatively bias EM to converge to a local maxima of likelihood which corresponds to better alignments Better = higher F α -score on small gold standard corpus

The EMD Algorithm Initialize: Perform initial M-step : estimate sub-model parameters from the HMM Viterbi alignments (bootstrap) Perform initial D-step : Find λ values which maximize F α -score on the small gold standard word-aligned development corpus Repeat: E-Step : Find Viterbi alignments using sub-models weighted by λ M-Step : Re-estimate sub-model params from the new Viterbi alignments D-step : Find λ values that maximize F α -score on the small gold word-aligned development corpus 58

The EMD Algorithm 59 Bootstrap Initial sub-model parameters Tuned lambda vector E-Step Viterbi alignments Translation Model D-Step Viterbi alignments Sub-model parameters M-Step

Previous Work: Semi-Supervised Usual formulation of semi-supervised learning: using unlabeled data to help supervised learning Build supervised system using labeled data Predict on unlabeled data Iterate (estimating from both labeled data and predictions on unlabeled data) We do not have enough gold standard word alignments to estimate parameters directly! EMD allows us to train a small number of important parameters discriminatively, the rest using likelihood maximization, and allows interaction 60

Story so far 61 We ve now presented a new metric, a new model, and a new semi-supervised training algorithm We ve reformulated LEAF as a log-linear model and added additional sub-models We will train this model using the semi-supervised EMD training algorithm to maximize F α -score How well does this work?

Experiments 62 French/English LDC Hansard (67 M English words) 110 gold standard aligned sentences MT: Alignment Templates, phrase-based Arabic/English NIST 2006 task (168 M English words) 1000 gold standard aligned sentences MT: Hiero, hierarchical phrases

Results French/English Arabic/English 63 System F-Measure BLEU F-Measure BLEU (α = 0.4) (1 ref) (α = 0.1) (4 refs) IBM Model 4 73.5 30.63 75.8 51.55 (GIZA++) and heuristics EMD (ACL 2006 74.1 31.40 79.1 52.89 model) and heuristics LEAF+EMD 76.3 31.86 84.5 54.34

Contributions 64 Found a metric for measuring alignment quality which correlates with MT quality Designed LEAF, the first generative model of M-to-N discontiguous alignments Developed a semi-supervised training algorithm, the EMD algorithm Obtained large gains of 1.2 BLEU and 2.8 BLEU points for French/English and Arabic/English tasks

65 Much of the presented work was joint work with Daniel Marcu ISI (Univ. Southern California) Thank You! Dankeschön!