FACTORED TRANSLATION MODELS Raj Dabre Raksha Sharma Avishek Dan
Purpose of the talk To give motivations for Factored Based Machine Translation (FBMT) To cover the basic concepts of FBMT To highlight all possible factors that can help in translation To illustrate the process of FBMT
Flow of the Presentation Motivation Introduction to FBMT Decomposing the FBMT Process Lemma Translation Morphology Translation Generation Statistical Model Training Combining Components Decoding Experiments and Analysis Conclusion
Motivation Consider the example: यह म र ग ड़ ह {yaha meri gaadi hai} {This is my car} ग ड़ + Plural य म र ग ड़ड़य ह {ye meri gaadiyan hai} {These are my cars} Utilize Factors to overcome data sparsity
FBMT in Vauquois Triangle
Another motivating example from the Orient わたしのなまえわラジです Watashino namaewa laji desu. My name is Raj. Difficult to know word mappings. Suppose POS tags are given. わたしの (PRON) なまえわ (NN) ラジ (NNP) です (VM/VCOP). My(PRON) name(nn) is(vm/vcop) Raj(NNP). Mappings easier to deduce. Factors reduce uncertainty.
Introduction to FBMT Definition FBMT is an extension of phrase-based statistical machine translation models that integrates additional annotation at the wordlevel. Annotations can be linguistic markup or automatically generated word classes.
Factors to Exploit Surface form Lemma Part-of-speech Morphological features gender count and case Automatic word classes Shallow syntactic tags Dedicated factors to ensure agreement
Example ग ड़ड़य from (य म र ग ड़ड़य ह ) Surface form: ग ड़ड़य Lemma: ग ड़ Part-of-speech: NN Morphological features Gender: female Number: plural Case: Accusative Shallow syntactic tags: NP
Decomposition of FBMT
Decomposition of FBMT For translating cars to ग ड़ य Translate input lemmas into output lemmas car to ग ड़ Translate morphological and POS factors Noun to Noun Plural to Plural Neuter to Female Generate surface forms given the lemma and linguistic factors ग ड़ + Noun + Plural + Female = ग ड़ य
Statistical Model - Training Automatically annotate the parallel corpus with additional factors POS, Morphology Word Alignment using GIZA++ Can specify alignment basis POS to POS, Theta roles to Theta roles etc. Can use any combination of factors 3 types of tables generated Lemma translation (source lemma to target lemma) Morphology translation (source morphology to target morphology) Word Generation (target lemma + target morphology to target word form)
Annotating the corpus Use POS taggers, Shallow Syntactic parsers, UNL and dependency parsers for generating factors. Example: These are my cars य म र ग ड़ य ह These this DET subj Are is VM/VCOP present My me PRON possessor Cars car NN neuter, plural, object य यह DET subj म र म र PRON possessor ग ड़ड़य ग ड़ NN feminine, plural, object ह ह न VM/VCOP present
Alignments of Phrases यह म र ग ह This Is My Car
Alignments of Factors DET-Subj DET/Subj PRON - Poss NN - Fem VM/VCOP - Pres VM/VCOP- Pres PRON-Poss NN-Neu
Translation Tables Sr. English Phrase Hindi Phrase 1 This यह 2 My Car म र ग ड़ 3 Is ह Sr. English Factors Hindi Factors 1 DET-Subj DET-Subj 2 PRON-possessor NN- neuter, plural, object Lemma Translation Table PRON-possessor NN-feminine, plural, object 3 VM/VCOP-present VM/VCOP-present Factor Translation Table
Generation Table (Target Language) Sr no Lemma Factors Surface word 1 यह DET+subj य 2 म र PRON+possessor म र 3 ग ड़ NN+feminine, plural, object ग ड़ड़य 4 ह न VM/VCOP+present ह
A tougher example I am going home म घर ज रह ह {Main ghar jaa raha hoon} Here home is neuter, singular which is mapped to घर which is masculine, singular. Non trivial mapping since difference in gender. Here am going is mapped to ज रह ह. Non trivial since 2 word phrase mapped to 3 word phrase. Am going has factors [(is VM Present) (go VAUX greund, continuous)] ज रह ह has factors [(ज न VM )(रहन VAUX continuous)(ह न VAUX Present)] Here extracting the factor mappings is also non trivial. Difficulty is greater when small phrases map to big phrases. Morphologically rich to morphologically poor languages.
Alignments of Phrases म घर ज न रहन ह न I Is Go Home
Alignments of Factors PRON VM,PRES VAUX,CONT NN, NEUTER PRON NN,FEMININE VM VAUX,CONT VAUX,PRES
Components of FBMT
Combining the Components
Decoding Beam search decoding algorithm is used. Start with empty hypothesis. Generate and add hypothesis until full sentence is covered. Highest scoring complete hypothesis is the best translation. Per phrase translation options limited to 50 to address combinatorial explosion.
EXPERIMENTS AND RESULTS
Corpus English German Training: Europarl corpus Training: News Commentary corpus Test: WMT 2006 test set English Spanish Training: Europarl corpus English Czech WSJ corpus
Syntactic Enrichment Implementation part of Moses Factors Surface word (3 gram) POS (7 gram) Morphological Shallow syntactic Higher order sequence model obtained supports syntactic coherence of output
Syntactic Enrichment Results
Morphological Analysis and Generation Translate word lemma and morphology separately Pure lemma/ morph model yields poor results Evidence based choice of model 21% of unknown word forms translated
Conclusion Incorporating linguistic tools in the translation model improves translation accuracy Linguistic tools ensure grammatical coherence Separate translation of lemma and morphology leads to better handling of OOV words Complex factor models lead to larger search space and increased computation time
References Philipp Koehn and Hieu Hoang, Factored Translation Models, Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning,2007 Ananthakrishnan Ramanathan, Hansraj Choudhary, Avishek Ghosh, and Pushpak Bhattacharyya. 2009. Case markers and morphology: Addressing the crux of the fluency problem in English- Hindi SMT. In Proceedings of the Joint Conference of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, pages 800 808, Suntec, Singapore.Association for Computational Linguistics,2009.