Factored SMT Models. Q.Q June 3, 2014

Size: px

Start display at page:

Download "Factored SMT Models. Q.Q June 3, 2014"

Dana Shields
6 years ago
Views:

1 Factored SMT Models Q.Q June 3, 2014

2 Standard phrase-based models Limitations of phrase-based models: No explicit use of linguistic information

3 Word = Token Words in different forms are treated independent of each other. Unknown words cannot be translated, especially in morhologically rich languages. ex: eat, eating, ate, eaten

4 Integration of linguistic information into the translation model: Draw on richer statistics Overcome data sparseness problems Direct modeling of linguistic aspects Reordering in translation result

5 Word = Vector Input Output Word Word Lemma Lemma POS POS Morphology Morphology Word class Word class...

6 Factored translation model Input Output Word Word Lemma Lemma POS POS Morphology Morphology Word class Word class

7 Decomposition Translate input lemma to output lemma Translate morphological and POS factors Generate surface forms given the lemma and linguistic factors

8 neue häuser werden gebaut new houses are built Surface-form häuser Lemma haus POS NN Count plural Case nominative Gender neutral

9 neue häuser werden gebaut new houses are built Input phrase expansion Translate input lemma to output lemma haus house, home, building, shell Translate morphological and POS factors NN plural-nominative-neutral NN plural, NN singular Generate surface forms given the lemma and linguistic factors house NN plural houses house NN singular house home NN plural homes

10 neue häuser werden gebaut new houses are built häuser haus NN plural-nominative-neutral List of translation options Translate input lemma to output lemma {? house??,? home??,? building??,? shell??} Translate morphological and POS factors {? house NN plural,? home NN plural,? building NN plural,? shell NN plural,? house NN singular,... } Generate surface forms given the lemma and linguistic factors {houses house NN plural, homes home NN plural, buildings building NN plural, shells shell NN plural, house house NN singular,... }

11 Synchronous factored models Translation steps: on the phrase level Generation steps: on the word level

12 Training Prepare on training data (automatic tools on the corpus to add information) Establish word alignment (symmetrized GIZA++ alignments) Map steps to form components of the overall model Extract phrase pairs that are consistent with the word alignment Estimate scoring functions (conditional phrase translation probabilities or lexical translation probabilities)

13 Word alignment

14 Extract phrase natürlich hat john # naturally john has

15 Extract phrase for other factors ADV V NNP # ADV NNP V

16 Training the generation model On the output side only: No word alignment Additional monolingual data may be used Learn on a word-for-word basis

17 Map factor(s) to factor(s) Example: word POS and POS word The/DET big/adj tree/nn Count collection: count( the, DET )++ count( big, ADJ )++ count( tree, NN )++ Probability distributions (maximum likelihood estimates) p( the DET ) and p( DET the ) p( big ADJ ) and p( ADJ big ) p( tree NN ) and p( NN tree )

18 Combination of components Language model Reordering model Translation steps Generation steps

20 Efficient decoding Mapping steps additional complexity Single table multiple tables

21 Pre-computation Prior to the heuristic beam search: The expansions of mapping steps can be pre-computed can be stored as translation options All possible translation options are computed before decoding. No change to fundamental search algorithm

22 Beam search Empty hypothesis New hypothesis by using all applicable translation options Generate further hypothesis in the same manner Cover the full input sentence Highest scoring complete hypothesis = Best translation according to the model

23 Problem Too many translation options to handle caused by a vast increase of expansions by one or more mapping steps

24 Current solution Early pruning of expansions Limitation on the number of translation options per input phrase (max: 50)

25 Experiments and results Moses system

26 Syntactically enriched output Input Output Word Tri-gram Word 7-gram POS

27 Syntactically enriched output Model BLEU English - German Europarl, 30 million words, 2006 best published result 18.15% baseline (surface) 18.04% surface + POS 18.15% surface + POS + morph 18.22%

28 Morphological analysis and generation Input Output Word Word Lemma Lemma POS POS Morphology Morphology

29 Morphological analysis and generation German - English News Commentary data, 1 million words, 2007 Model BLEU baseline (surface) 18.19% + POS LM 19.05% pure lemma / morph model 14.46% backoff lemma / morph model 19.47%

30 Use of automatic word classes Input Output Word Tri-gram Word 7-gram Word class

31 Use of automatic word classes English - Chinese IWSLT, sentences, 2006 Model BLEU baseline (surface) 19.54% surface + word class 21.10%

32 Integrated recasing Input Output Lower-cased Lower-cased Mixed-cased

33 Integrated recasing Chinese - English IWSLT, sentences, 2006 Model standard two-pass: SMT + recase BLEU 20.65% integrated factored model (optimized) 21.08%

34 References P. Koehn and H. Hoang, "Factored translation models", Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP- CoNLL), vol. 868, p. 876, P. Koehn, Statistical Machine Translation, Cambridge University Press, UK, pp , P. Porkaew, A. Takhom and T. Supnithi, "Factored Translation Model in English-to-Thai Translation", Eighth International Symposium on Natural Language Processing, S. Li, D. Wong and L. Chao, "Korean-Chinese statistical translation model", Proceedings of the 2012 International Conference on Machine Learning and Cybernetics, Xian, 2012.

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011 Teresa Herrmann, Mohammed Mediani, Jan Niehues and Alex Waibel Karlsruhe Institute of Technology Karlsruhe, Germany firstname.lastname@kit.edu