FACTORED TRANSLATION MODELS. Raj Dabre Raksha Sharma Avishek Dan

FACTORED TRANSLATION MODELS Raj Dabre Raksha Sharma Avishek Dan

Purpose of the talk To give motivations for Factored Based Machine Translation (FBMT) To cover the basic concepts of FBMT To highlight all possible factors that can help in translation To illustrate the process of FBMT

Flow of the Presentation Motivation Introduction to FBMT Decomposing the FBMT Process Lemma Translation Morphology Translation Generation Statistical Model Training Combining Components Decoding Experiments and Analysis Conclusion

Motivation Consider the example: यह म र ग ड़ ह {yaha meri gaadi hai} {This is my car} ग ड़ + Plural य म र ग ड़ड़य ह {ye meri gaadiyan hai} {These are my cars} Utilize Factors to overcome data sparsity

FBMT in Vauquois Triangle

Another motivating example from the Orient わたしのなまえわラジです Watashino namaewa laji desu. My name is Raj. Difficult to know word mappings. Suppose POS tags are given. わたしの (PRON) なまえわ (NN) ラジ (NNP) です (VM/VCOP). My(PRON) name(nn) is(vm/vcop) Raj(NNP). Mappings easier to deduce. Factors reduce uncertainty.

Introduction to FBMT Definition FBMT is an extension of phrase-based statistical machine translation models that integrates additional annotation at the wordlevel. Annotations can be linguistic markup or automatically generated word classes.

Factors to Exploit Surface form Lemma Part-of-speech Morphological features gender count and case Automatic word classes Shallow syntactic tags Dedicated factors to ensure agreement

Example ग ड़ड़य from (य म र ग ड़ड़य ह ) Surface form: ग ड़ड़य Lemma: ग ड़ Part-of-speech: NN Morphological features Gender: female Number: plural Case: Accusative Shallow syntactic tags: NP

Decomposition of FBMT

Decomposition of FBMT For translating cars to ग ड़ य Translate input lemmas into output lemmas car to ग ड़ Translate morphological and POS factors Noun to Noun Plural to Plural Neuter to Female Generate surface forms given the lemma and linguistic factors ग ड़ + Noun + Plural + Female = ग ड़ य

Statistical Model - Training Automatically annotate the parallel corpus with additional factors POS, Morphology Word Alignment using GIZA++ Can specify alignment basis POS to POS, Theta roles to Theta roles etc. Can use any combination of factors 3 types of tables generated Lemma translation (source lemma to target lemma) Morphology translation (source morphology to target morphology) Word Generation (target lemma + target morphology to target word form)

Annotating the corpus Use POS taggers, Shallow Syntactic parsers, UNL and dependency parsers for generating factors. Example: These are my cars य म र ग ड़ य ह These this DET subj Are is VM/VCOP present My me PRON possessor Cars car NN neuter, plural, object य यह DET subj म र म र PRON possessor ग ड़ड़य ग ड़ NN feminine, plural, object ह ह न VM/VCOP present

Alignments of Phrases यह म र ग ह This Is My Car

Alignments of Factors DET-Subj DET/Subj PRON - Poss NN - Fem VM/VCOP - Pres VM/VCOP- Pres PRON-Poss NN-Neu

Translation Tables Sr. English Phrase Hindi Phrase 1 This यह 2 My Car म र ग ड़ 3 Is ह Sr. English Factors Hindi Factors 1 DET-Subj DET-Subj 2 PRON-possessor NN- neuter, plural, object Lemma Translation Table PRON-possessor NN-feminine, plural, object 3 VM/VCOP-present VM/VCOP-present Factor Translation Table

Generation Table (Target Language) Sr no Lemma Factors Surface word 1 यह DET+subj य 2 म र PRON+possessor म र 3 ग ड़ NN+feminine, plural, object ग ड़ड़य 4 ह न VM/VCOP+present ह

A tougher example I am going home म घर ज रह ह {Main ghar jaa raha hoon} Here home is neuter, singular which is mapped to घर which is masculine, singular. Non trivial mapping since difference in gender. Here am going is mapped to ज रह ह. Non trivial since 2 word phrase mapped to 3 word phrase. Am going has factors [(is VM Present) (go VAUX greund, continuous)] ज रह ह has factors [(ज न VM )(रहन VAUX continuous)(ह न VAUX Present)] Here extracting the factor mappings is also non trivial. Difficulty is greater when small phrases map to big phrases. Morphologically rich to morphologically poor languages.

Alignments of Phrases म घर ज न रहन ह न I Is Go Home

Alignments of Factors PRON VM,PRES VAUX,CONT NN, NEUTER PRON NN,FEMININE VM VAUX,CONT VAUX,PRES

Components of FBMT

Combining the Components

Decoding Beam search decoding algorithm is used. Start with empty hypothesis. Generate and add hypothesis until full sentence is covered. Highest scoring complete hypothesis is the best translation. Per phrase translation options limited to 50 to address combinatorial explosion.

EXPERIMENTS AND RESULTS

Corpus English German Training: Europarl corpus Training: News Commentary corpus Test: WMT 2006 test set English Spanish Training: Europarl corpus English Czech WSJ corpus

Syntactic Enrichment Implementation part of Moses Factors Surface word (3 gram) POS (7 gram) Morphological Shallow syntactic Higher order sequence model obtained supports syntactic coherence of output

Syntactic Enrichment Results

Morphological Analysis and Generation Translate word lemma and morphology separately Pure lemma/ morph model yields poor results Evidence based choice of model 21% of unknown word forms translated

Conclusion Incorporating linguistic tools in the translation model improves translation accuracy Linguistic tools ensure grammatical coherence Separate translation of lemma and morphology leads to better handling of OOV words Complex factor models lead to larger search space and increased computation time

References Philipp Koehn and Hieu Hoang, Factored Translation Models, Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning,2007 Ananthakrishnan Ramanathan, Hansraj Choudhary, Avishek Ghosh, and Pushpak Bhattacharyya. 2009. Case markers and morphology: Addressing the crux of the fluency problem in English- Hindi SMT. In Proceedings of the Joint Conference of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, pages 800 808, Suntec, Singapore.Association for Computational Linguistics,2009.