Outline. Factored translation models. Outline. Factored translation models N-gram-based translation models Hiero Syntax-based translation systems

Maxim Khalilov TAU Labs Amsterdam Marta R. Costa-jussà Barcelona Media Barcelona Outline Factored translation models N-gram-based translation models yntax-based translation systems RuIR 2012 August 5-10, 2012 3 of 61 Outline Factored translation models N-gram-based translation models yntax-based translation systems Factored translation models Factored translation models are an extension to phrasebased models where every word is substituted by a vector of factors. (word) = (word, lemma, Po, morphology,...) The translation is now a combination of pure translation (T) and generation (G) steps: lemma f Po f morphology f word f T T T G lemma e Po e morphology e word e 2 of 61 4 of 61

Factored translation models What differs in factored translation models (as compared to standard phrase-based models) The parallel corpus must be annotated beforehand. Extra language models for every factor can also be used. Translation steps are accomplished in a similar way. Generation steps imply a training only on the target side of the corpus. Models corresponding to the different factors and components are combined in a log-linear fashion. Factored translation models English German Model BLEU best published result 18.15% baseline (surface) 18.04% surface + PO 18.15% surface + PO + morph 18.22% English panish Model BLEU baseline (surface) 23.41% surface + morph 24.66% surface + PO + morph 24.25% English Czech Model BLEU baseline (surface) 25.82% surface + all morph 27.04% surface + case/number/gender 27.45% surface + CNG/verb/prepositions 27.62% 5 of 61 7 of 61 Factored translation models Outline Factored Representation Input Output word lemma word lemma Factored Model: transfer and generation word lemma Input Output word lemma Factored translation models N-gram-based translation models yntax-based translation systems PO PO PO PO morphology morphology morphology morphology word class word class word class word class 6 of 61 8 of 61

N-gram-based translation models N-gram-based translation models Log-linear combination of feature functions: { M } ˆt I 1 = arg max t I 1 m=1 λ m h m (s J 1,t I 1) Bilingual N-gram translation language model target language model word bonus model source target lexicon model (ibm1 probs.) target source lexicon model (ibm1 probs.) target PO language model 9 of 61 Tuples are extracted from word alignment A unique, monotonous segmentation of each sentence pair is produced. No word in a tuple is aligned to words outside of it No smaller tuples can be extracted without violating the previous constraints 11 of 61 N-gram-based translation models N-gram-based translation models h BM (s J 1,t I 1) = log K p((s,t) i (s,t) i N+1,...,(s,t) i 1 ) i=1 Given a word alignment, tuples T i = (s,t) i are those bilingual units: having minimal length describing a monotonic segmentation of each sentence pair K h BM (s J 1,t I 1) = log p((s,t) i (s,t) i N+1,...,(s,t) i 1 ) i=1 Given a word alignment, tuples T i = (s,t) i are those bilingual units: having minimal length describing a monotonic segmentation of each sentence pair Constraints define a unique possible segmentation Except: NULL-source tuples Constraints define a unique possible segmentation Except: NULL-source tuples 10 of 61 12 of 61

N-gram-based translation models Feature functions: target language model: h TM (s,t) = h TM (t) = log K p(w k w k N+1,...,w k 1 ) k=1 word bonus model: h WB (s,t) = h WB (t) = K source target lexicon model: h LE (s,t) = log 1 (I + 1) J J j=1 i=0 target source lexicon model (analogous) 13 of 61 I p IBM1 (t n j s n i ) N-gram-based translation models WMT 09 (large corpus, Es<->En): panish-to-english English-to-panish ystem BLEU Constr. Rank. ystem BLEU Constr. Rank. GOOGLE 0.29 NO.70 GOOGLE 0.28 NO.65 UEDIN 0.26 YE.56 NU 0.25 YE.59 UPC-TALP 0.26 (2-3) YE.59 (2) UEDIN 0.25 YE.66 NICT 0.22 YE.37 UPC-TALP 0.25 (2-4) YE.58 (5) RBMT 0.20 NO.55 RBMT 0.22 NO.64 AAR 0.20 NO.51 RWTH 0.22 YE.51 - - - - AAR 0.20 NO.48 IWLT 08 (small amount of data, Zh<->En): ystem Zh2Es Zh2(En)2Es Clean AR Clean AR ystem BLEU BLEU Rank. BLEU BLEU Rank. TCH 34.57 30.52 47.73 TCH 40.42 35.43 49.32 FBK 29.60 24.24 33.42 FBK 39.41 32.51 39.90 DCU 27.10 23.89 28.99 UPC-TALP 38.09 (3) 32.51 (3-4) 39.01 (3) TTK 26.62 24.40 28.99 NICT 37.11 32.81 30.88 NICT 26.41 23.31 29.79 DCU 32.42 28.47 31.72 PT 25.72 20.10 19.77 TTK 31.88 28.15 34.16 UPC-TALP 25.65 (7) 22.14 (6) 26.42 (6) GREYC 15.80 15.05 15.46 GREYC 19.70 18.91 15.46 QMUL 2.87 11.59 17.72 15 of 61 N-gram-based translation models N-gram-based translation models Decoding: freely available MARIE decoder [Crego et al., 2005] (beam search with hypothesis recombination, threshold and histogram pruning) no rescoring module (1-best output used) monotone and reordered search José B. Mariño, Rafael Banchs, Josep M. Crego, Adrià de Gispert, Patrik Lambert, José A. R. Fonollosa, Marta R. Costa-jussà. N-gram-based Machine Translation. Computational Linguistics, 2006 Feature function weights optimization: Downhill implex Method 14 of 61 16 of 61

Outline Factored translation models N-gram-based translation models yntax-based translation systems A Chinese sentence: Why the difference in word order? Aozhou shi yu Beihan you bangjiao de shaoshu guojia zhiyi Australia is with North Korea have dipl. rels. that few The English translation: countries one of Australia is one of the few countries that have diplomatic relations with North Korea shaoshuguojiazhiyi oneofthefewcountries 17 of 61 19 of 61 A Motivating Example from Chiang 2007 A Chinese sentence: Aozhou shi yu Beihan you bangjiao de shaoshu guojia zhiyi Why the difference in word order? A Chinese sentence: Aozhou shi yu Beihan you bangjiao de shaoshu guojia zhiyi Australia is with North Korea have dipl. rels. that few countries one of Australia is with North Korea have dipl. rels. that few countries one of The English translation: Australia is one of the few countries that have diplomatic relations with North Korea The English translation: Australia is one of the few countries that have diplomatic relations with North Korea shaoshuguojiazhiyi oneofthefewcountries yubeihan youbangjiao de that have diplomatic relations with North Korea 18 of 61 20 of 61

A Chinese sentence: Why the difference in word order? Aozhou shi yu Beihan you bangjiao de shaoshu guojia zhiyi A Chinese sentence: Why the difference in word order? Aozhou shi yu Beihan you bangjiao de shaoshu guojia zhiyi Australia is with North Korea have dipl. rels. that few countries one of Australia is with North Korea have dipl. rels. that few countries one of The English translation: Australia is one of the few countries that have diplomatic relations with North Korea shaoshu guojia zhiyi one of the few countries yubeihan youbangjiao de that have diplomatic relations with North Korea The English translation: Australia is one of the few countries that have diplomatic relations with North Korea shaoshu guojia zhiyi one of the few countries yubeihan youbangjiao de that have diplomatic relations with North Korea 21 of 61 23 of 61 A Motivating Example from Chiang 2007 A Chinese sentence: Aozhou shi yu Beihan you bangjiao de shaoshu guojia zhiyi A olution: Hierarchical Phrases A Chinese sentence: Aozhou shi yu Beihan you bangjiao de shaoshu guojia zhiyi Australia is with North Korea have dipl. rels. that few countries one of Australia is with North Korea have dipl. rels. that few countries one of The English translation: Australia is one of the few countries that have diplomatic relations with North Korea Output from a phrase-based system: [Aozhou] [shi] 1 [yubeihan] 2 [you] [bangjiao] [deshaoshuguojiazhiyi] [Australia] [has] [dipl.rels.] [withnorthkorea] 2 [is] 1 [oneofthefewcountries] Hierarchical phrases needed for this example: yu 1 you 2,have 2 with 1 1 de 2,the 2 that 1 1 zhiyi,oneof 1 (We ll see how to formalize this next.) 22 of 61 24 of 61

Examples of s-cfg Rules(from Chiang 2007) yu 1 you 2,have 2 with 1 1 de 2,the 2 that 1 1 zhiyi,oneof 1 An invalid s-cfg rule: Examples of s-cfg Rules V P PP 1 you NP 2,have NP 2 1 This rule is invalid because a PP corresponds to an. Nonterminals that correspond to each other must be the same. Note:theserulesmakeuseofasinglenon-terminal, We use subscripts such as 1, 2 to specify which non-terminals correspond to each other. 25 of 61 27 of 61 Another valid s-cfg rule: Examples of s-cfg Rules V P PP 1 you NP 2,have NP 2 PP 1 In this case three non-terminals, NP, PP, and V P are used. The above rule is perfectly valid in an s-cfg. However, Chiang s grammar only makes use of two non-terminals: and. Intuition Behind Translation with an s-cfg Firststep:wecanreadoffaCFGforChinesefromthes-CFG, andparsethechinesewiththiscfg For example, yu 1 you 2,have 2 with 1 implies the Chinese-only context-free rule and yu you bangjiao,diplomaticrelations implies the Chinese-only context-free rule bangjiao 26 of 61 28 of 61

Intuition Behind Translation with an s-cfg The resulting CFG for Chinese: yu you de zhiyi Aozhu Beihan shi bangjiao shaoshuguojia Intuition Behind Translation with an s-cfg Firststep:wecanreadoffaCFGforChinesefromthes-CFG, andparsethechinesewiththiscfg econdstep:weusethesynchronousrulestomapthechinese parsetreetoanenglishparsetree 29 of 61 31 of 61 Aparsetreeforourexample: tart bottom-up. For example, Aozhu, Australia gives: shi zhiyi shi zhiyi Aozhou Australia de de yu you shaoshu guojia yu you shaoshu guojia Beihan bangjiao Beihan bangjiao 30 of 61 32 of 61

The tree after all the lowest-level rules are applied: Use 1 de 2,the 2 that 1 toget: Australia is zhiyi Australia is zhiyi de yu you few countries the that N. Korea dipl. rels. few countries have with 33 of 61 35 of 61 Next, apply higher-level rules. For example, use yu 1 you 2,have 2 with 1 toget: Use 1 zhiyi,oneof 1 toget: Australia is one of Australia is zhiyi the that few countries have with de dipl. rels. N. Korea have with few countries dipl. rels. N. Korea 34 of 61 36 of 61

What is missing here, but can be found in the paper: Derivation in CFG Learning a CFG grammar Probability calculation But, here are some results: Results from Chiang 2007 Outline Factored translation models N-gram-based translation models yntax-based translation systems MT03 MT04 MT05 ATA 30.84 31.74 30.50 33.72 34.57 31.79 Results are for translation from Chinese to English. MT03, MT04,andMT05are3differenttestsets.AllscoresareBleu scores. 37 of 61 39 of 61 yntax-based MT Two camps: yntax will improve translation (K. Knight) impler data-driven models will always win (F. Och) Joshua: open-source toolkit for parsing-based MT 38 of 61 40 of 61

yntax-based MT NP E There MD will AU be JJR more ADJP VP NN divisiveness ADJP IN than JJ positive NN effects Elle aura de les effets plus destructifs que positifs VP PP Fox (2002) NP Gloss: It will have effects more destructive than positive Phrases are not coherent in bitexts 41 of 61 yntax-based MT WHY???? 43 of 61 yntax-based MT 28.0 25.5 23.0 IBM Model 4 20.5 PBMT PBMT w/syntactic phrases 18.0 10k 20k 40k 80k 160k 320k Koehn et al (2003) yntax-based MT WHAT HOULD WE DO???? 42 of 61 44 of 61

yntax-based MT yntactic translation models incorporate syntax to the source and/or target languages. yntax-based MT Example of a string-to-tree translation system: interlingua yntactic phrase-based based on tree trasducers: foreign semantics foreign syntax english semantics english syntax Tree-to-string. Build mappings from target parse trees to source strings. tring-to-tree. Build mappings from target strings to source parse trees. Tree-to-tree. Mappings from parse trees to parse trees. foreign words english words Use of English syntax trees[yamada and Knight, 2001] exploit rich resources on the English side obtained with statistical parser[collins, 1997] flattened tree to allow more reorderings works well with syntactic language model 45 of 61 47 of 61 yntax-based MT Advantages of syntax-based translation: yntax-based MT Example of a string-to-tree translation system: Reordering for syntactic reasons PRP VB VB1 VB2 reorder PRP VB VB2 VB1 e.g., move German object to end of sentence he adores VB TO he TO VB adores Better explanation for function words listening TO MN to music MN music TO to listening e.g., prepositions, determiners Conditioning to syntactically related words VB PRP VB2 VB1 he ha TO VB ga adores desu VB insert PRP VB2 VB1 kare ha TO VB ga daisuki desu translation of verb may depend on subject or object MN music TO to listening no translate MN ongaku TO wo kiku no Use of syntactic language models take leaves Kare ha ongaku wo kiku no ga daisuki desu 46 of 61 48 of 61

yntax-based MT yntax-based MT Decoding as parsing: Original Order Reordering p(reorder original) PRP VB1 VB2 PRP VB1 VB2 0.074 PRPVB1VB2 PRPVB2VB1 0.723 PRP VB1 VB2 VB1 PRP VB2 0.061 PRP VB1 VB2 VB1 VB2 PRP 0.037 PRP VB1 VB2 VB2 PRP VB1 0.083 PRP VB1 VB2 VB2 VB1 PRP 0.021 VB TO VB TO 0.107 VB TO TO VB 0.893 Chart Parsing PRP NN TO he music to kare ha ongaku wo kiku no ga daisuki desu Pick Japanese words Translate into tree stumps TO NN TO NN 0.251 TO NN NN TO 0.749 49 of 61 51 of 61 yntax-based MT Decoding as parsing: yntax-based MT Decoding as parsing: Chart Parsing PRP he kare ha ongaku wo kiku no ga daisuki desu PP PRP NN TO he music to kare ha ongaku wo kiku no ga daisuki desu Pick Japanese words Adding some more entries... Translate into tree stumps 50 of 61 52 of 61

yntax-based MT Decoding as parsing: yntax-based MT Decoding as parsing: PP PRP NN TO VB he music to listening kare ha ongaku wo kiku no ga daisuki desu Combine entries VB2 PP PRP NN TO VB VB1 he music to listening adores kare ha ongaku wo kiku no ga daisuki desu 53 of 61 55 of 61 yntax-based MT Decoding as parsing: yntax-based MT Decoding as parsing: VB VB2 VB2 PP PRP NN TO VB he music to listening kare ha ongaku wo kiku no ga daisuki desu PP PRP NN TO VB VB1 he music to listening adores kare ha ongaku wo kiku no ga daisuki desu Finished when all foreign words covered 54 of 61 56 of 61

yntax-based MT How realistic is this model? Do English trees match foreign strings? Crossings between French-English[Fox, 2002] 0.29-6.27 per sentence, depending on how it is measured Canbereducedby flattening tree, as done by[yamada and Knight, 2001] detecting phrasal translation special treatment for small number of constructions Most coherence between dependency structures yntax-based MT Other syntax-based systems: U.Alberta (Microsoft): treelet translation Translating from English Using dependency parser in English Project dependency treeinto language for training Map parts of the dependency tree ( treelets ) into foreign language Reranking phrase-based MT output with syntactic features Create n-best lists with phrase-based MT PO-tag and parse candidate translation Rerank with sybtactic features 57 of 61 59 of 61 yntax-based MT Other syntax-based systems: yntax-aided phrase-based MT (Koehn, 2005) tick with phrase-based systems pecial treatment for special syntactic problems (NP treatment, clause restructuring) II: extended work of Yamada and Knight More complex rules Perfromance approaching phrase-based Prague: translation via dependency structures Parallel Czech-English treebank Tecto-grammatical translation model yntax-based MT o, syntax: does it help? Not yet best systems still phrase-based, treat words as tokens Well,maybe... work on reordering German automatically trained tree transfer systems promising Whynotyet? ifrealsyntax,weneedgoodparsers aretheygoodenough? syntactic annotations add a level of complexity difficult to handle, slow to train and decode few researchers good at statistical modeling and understand syntactic theories 58 of 61 60 of 61

Next session Practical workshop: let s make our hands dirty!!! 61 of 61