Theoretical and Methodological Issues in MT (TMI), Skövde, Sweden, Sep. 7-9, Statistical MT from TMI-1988 to TMI-2007: What has happened?

Similar documents
Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data

Language Model and Grammar Extraction Variation in Machine Translation

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

Noisy SMS Machine Translation in Low-Density Languages

The MSR-NRC-SRI MT System for NIST Open Machine Translation 2008 Evaluation

Speech Recognition at ICSI: Broadcast News and beyond

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Improved Reordering for Shallow-n Grammar based Hierarchical Phrase-based Translation

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Cross Language Information Retrieval

Domain Adaptation in Statistical Machine Translation of User-Forum Data using Component-Level Mixture Modelling

The Strong Minimalist Thesis and Bounded Optimality

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

arxiv: v1 [cs.cl] 2 Apr 2017

The KIT-LIMSI Translation System for WMT 2014

Re-evaluating the Role of Bleu in Machine Translation Research

Learning Methods in Multilingual Speech Recognition

Calibration of Confidence Measures in Speech Recognition

Greedy Decoding for Statistical Machine Translation in Almost Linear Time

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

A heuristic framework for pivot-based bilingual dictionary induction

A Case Study: News Classification Based on Term Frequency

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Constructing Parallel Corpus from Movie Subtitles

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Training and evaluation of POS taggers on the French MULTITAG corpus

CS 598 Natural Language Processing

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

The NICT Translation System for IWSLT 2012

Regression for Sentence-Level MT Evaluation with Pseudo References

A Quantitative Method for Machine Translation Evaluation

Detecting English-French Cognates Using Orthographic Edit Distance

Deep Neural Network Language Models

Prediction of Maximal Projection for Semantic Role Labeling

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Evaluation of a Simultaneous Interpretation System and Analysis of Speech Log for User Experience Assessment

Cross-Lingual Dependency Parsing with Universal Dependencies and Predicted PoS Labels

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Applications of memory-based natural language processing

Lecture 1: Machine Learning Basics

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

Parsing of part-of-speech tagged Assamese Texts

Using dialogue context to improve parsing performance in dialogue systems

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Clickthrough-Based Translation Models for Web Search: from Word Models to Phrase Models

The RWTH Aachen University English-German and German-English Machine Translation System for WMT 2017

Parallel Evaluation in Stratal OT * Adam Baker University of Arizona

Modeling full form lexica for Arabic

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

Investigation on Mandarin Broadcast News Speech Recognition

Multi-Lingual Text Leveling

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Cross-lingual Text Fragment Alignment using Divergence from Randomness

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

BYLINE [Heng Ji, Computer Science Department, New York University,

Improvements to the Pruning Behavior of DNN Acoustic Models

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

South Carolina English Language Arts

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Modeling function word errors in DNN-HMM based LVCSR systems

Probabilistic Latent Semantic Analysis

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Switchboard Language Model Improvement with Conversational Data from Gigaword

Large vocabulary off-line handwriting recognition: A survey

Corrective Feedback and Persistent Learning for Information Extraction

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Arabic Orthography vs. Arabic OCR

Natural Language Processing. George Konidaris

Modeling function word errors in DNN-HMM based LVCSR systems

Effect of Word Complexity on L2 Vocabulary Learning

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Annotation Projection for Discourse Connectives

Task Tolerance of MT Output in Integrated Text Processes

Universiteit Leiden ICT in Business

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

Finding Translations in Scanned Book Collections

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Mandarin Lexical Tone Recognition: The Gating Paradigm

Learning Methods for Fuzzy Systems

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Multilingual Sentiment and Subjectivity Analysis

Word Segmentation of Off-line Handwritten Documents

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

What s in a Step? Toward General, Abstract Representations of Tutoring System Log Data

Combining Bidirectional Translation and Synonymy for Cross-Language Information Retrieval

A study of speaker adaptation for DNN-based speech synthesis

Developing a TT-MCTAG for German with an RCG-based Parser

arxiv:cmp-lg/ v1 7 Jun 1997 Abstract

Transcription:

Theoretical and Methodological Issues in MT (TMI), Skövde, Sweden, Sep. 7-9, 2007 Statistical MT from TMI-1988 to TMI-2007: What has happened? Hermann Ney E. Matusov, A. Mauser, D. Vilar, R. Zens Human Language Technology and Pattern Recognition Computer Science Department RWTH Aachen University D-52056 Aachen, Germany H. Ney c RWTH Aachen 1 9-Sep-2007

Contents 1 History 3 2 EU Project TC-Star (2004-2007) 9 3 Statistical MT 19 3.1 Training................................. 19 3.2 Phrase Extraction........................... 23 3.3 Phrase Models and Log-Linear Scoring............... 28 3.4 Generation............................... 36 4 Recent Extensions 44 4.1 System Combination......................... 45 4.2 Gappy Phrases............................. 51 4.3 Statistical MT With No/Scarce Resources.............. 58 H. Ney c RWTH Aachen 2 9-Sep-2007

1 History use of statistics has been controversial in NLP: Chomsky 1969:... the notion probability of a sentence is an entirely useless one, under any known interpretation of this term. was considered to be true by most experts in NLP and AI Statistics and NLP: Myths and Dogmas H. Ney c RWTH Aachen 3 9-Sep-2007

History: Statistical Translation short (and simplified) history: 1949 Shannon/Weaver: statistical (=information theoretic) approach 1950 1970 empirical/statistical approaches to NLP ( empiricism ) 1969 Chomsky: ban on statistics in NLP 1970? hype of AI and rule-based approaches 1988 TMI: Brown presents IBM s statistical approach 1988 1995 statistical translation at IBM Research: corpus: Canadian Hansards: English/French parliamentary debates DARPA evaluation in 1994: comparable to conventional approaches (Systran) 1992 TMI: Empiricist vs. Rationalist Methods in MT controversial panel discussion (?) H. Ney c RWTH Aachen 4 9-Sep-2007

limited domain: After IBM: 1995... speech translation: travelling, appointment scheduling,... projects: Verbmobil (German) EU projects: Eutrans, PF-Star unlimited domain: DARPA TIDES 2001-04: written text (newswire): Arabic/Chinese to English EU TC-Star 2004-07: speech-to-speech translation DARPA GALE 2005-07+: Arabic/Chinese to English speech and text ASR, MT and information extraction measure: HTER (= human translation error rate) H. Ney c RWTH Aachen 5 9-Sep-2007

Verbmobil 1993-2000 German national project: general effort in 1993-2000: about 100 scientists per year statistical MT in 1996-2000: 5 scientists per year task: input: SPOKEN language for restricted domain: appointment scheduling, travelling, tourism information,... vocabulary size: about 10 000 words (=full forms) competing approaches and systems end-to-end evaluation in June 2000 (U Hamburg) human evaluation (blind): is sentence approx. correct: yes/no? overall result: statistical MT highly competitive similar results for European projects: Eutrans (1998-2000) and PF-Star (2001-2004) Translation Method Error [%] Semantic Transfer 62 Dialog Act Based 60 Example Based 51 Statistical 29 H. Ney c RWTH Aachen 6 9-Sep-2007

ingredients of the statistical approach: Bayes decision rule: minimizes the decision errors consistent and holistic criterion probabilistic dependencies: toolbox of statistics problem-specific models (in lieu of big tables ) learning from examples: statistical estimation and machine learning suitable training criteria approach: statistical MT = structural (linguistic?) modelling + statistical decision/estimation H. Ney c RWTH Aachen 7 9-Sep-2007

Analogy: ASR and Statistical MT Klatt in 1980 about the principles of DRAGON and HARPY (1976); p. 261/2 in Lea, W. (1980): Trends in Speech Recognition :...the application of simple structured models to speech recognition. It might seem to someone versed in the intricacies of phonology and the acoustic-phonetic characteristics of speech that a search of a graph of expected acoustic segments is a naive and foolish technique to use to decode a sentence. In fact such a graph and search strategy (and probably a number of other simple models) can be constructed and made to work very well indeed if the proper acoustic-phonetic details are embodied in the structure. my adaption to statistical MT:...the application of simple structured models to machine translation. It might seem to someone versed in the intricacies of morphology and the syntactic-semantic characteristics of language that a search of a graph of expected sentence fragments is a naive and foolish technique to use to translate a sentence. In fact such a graph and search strategy (and probably a number of other simple models) can be constructed and made to work very well indeed if the proper syntactic-semantic details are embodied in the structure. H. Ney c RWTH Aachen 8 9-Sep-2007

2 EU Project TC-Star (2004-2007) March 2007: state-of-the-art for speech/language translation domain: speeches given in the European Parliament work on a real-life task: unlimited domain large vocabulary speech input: cope with disfluencies handle recognition errors sentence segmentation reasonable performance H. Ney c RWTH Aachen 9 9-Sep-2007

Speech-to-Speech Translation speech in source language ASR: automatic speech recognition text in source language SLT: spoken language translation text in target language TTS: text-to-speech synthesis speech in target language H. Ney c RWTH Aachen 10 9-Sep-2007

characteristic features of TC-Star: full chain of core technologies: ASR, SLT(=MT), TTS and their interactions unlimited domain and real-life world task: primary domain: speeches in European Parliament periodic evaluations of all core technologies H. Ney c RWTH Aachen 11 9-Sep-2007

TC-Star: Approaches to MT (IBM, IRST, LIMSI, RWTH, UKA, UPC) phrase-based approaches and extensions extraction of phrase pairs, weighted FST,... estimation of phrase table probabilities improved alignment methods log-linear combination of models (scoring of competing hypotheses) use of morphosyntax (verb forms, numerus, noun/adjective,...) language modelling (neural net, sentence level,...) word and phrase re-ordering (local re-ordering, shallow parsing, MaxEnt for phrases) generation (search): efficiency is crucial H. Ney c RWTH Aachen 12 9-Sep-2007

system combination for MT generate improved output from several MT engines problem: word re-ordering interface ASR-MT: effect of word recognition errors pass on ambiguities of ASR sentence segmentation more details: webpage + papers H. Ney c RWTH Aachen 13 9-Sep-2007

speech in source language automatic speech recognition (ASR) human speech recognition text editing ASR input verbatim input text input spoken language translation spoken language translation (spoken) language translation translation result translation result translation result H. Ney c RWTH Aachen 14 9-Sep-2007

Evaluation 2007: Spanish English three types of input to translation: ASR: (erroneous) recognizer output verbatim: correct transcription text: final text edition (after removing effects of spoken language: false starts, hesitations,...) best results (system combination) of evaluation 2007: Input BLEU [%] PER [%] WER [%] ASR (WER= 5.9%) 44.8 30.4 43.1 Verbatim 53.5 25.8 35.5 Text 53.6 26.7 37.2 H. Ney c RWTH Aachen 15 9-Sep-2007

E S 2007: Human vs. Automatic Evaluation BLEU(sub) 25 30 35 40 45 50 IBM IRST LIMSI RWTH UKA UPC UDS ROVER Reverso Systran FTE Verbatim ASR 2.6 2.8 3.0 3.2 3.4 3.6 mean(a,f) H. Ney c RWTH Aachen 16 9-Sep-2007

English Spanish: Human vs. Automatic Evaluation observations: good performance: BLEU: close to 50% PER: close to 30% fairly good correlation between adequacy/fluency (human) and BLEU (automatic) degradation: from text to verbatim: none or small from verbatim to ASR: PER corresponds to ASR errors H. Ney c RWTH Aachen 17 9-Sep-2007

Today s Statistical MT four key components in building today s MT systems: training: word alignment and probabilistic lexicon of (source,target) word pairs phrase extraction: find (source,target) fragments (= phrases ) in bilingual training corpus log-linear model: combine various types of dependencies between F and E generation (search, decoding): generate most likely (= plausible ) target sentence ASR: some similar components (not all!) H. Ney c RWTH Aachen 18 9-Sep-2007

3 Statistical MT starting point: probabilistic models in Bayes decision rule: { } { } F Ê(F) = arg max p(e F) = arg max p(e) p(f E) E E 3.1 Training distributions p(e) and p(f E): are unknown and must be learned complex: distribution over strings of symbols using them directly is not possible (sparse data problem)! therefore: introduce (simple) structures by decomposition into smaller units that are easier to learn and hopefully capture some true dependencies in the data example: ALIGNMENTS of words and positions: bilingual correspondences between words (rather than sentences) (counteracts sparse data and supports generalization capabilities) H. Ney c RWTH Aachen 19 9-Sep-2007

En vertu de les nouvelles propositions, quel est le cout prevu de administration et de perception de les droits? Example of Alignment (Canadian Hansards)? proposal new the under fees collecting and administering of cost anticipated the is What H. Ney c RWTH Aachen 20 9-Sep-2007

standard procedure: sequence of IBM-1,...,IBM-5 and HMM models: (conferences before 2000; Comp.Ling.2003+2004) EM algorithm (and its approximations) implementation in GIZA++ remarks on training: based on single word lexica p(f e) and p(e f); no context dependency simplifications: only IBM-1 and HMM alternative concept for alignment (and generation): ITG approach [Wu ACL 1995/6] H. Ney c RWTH Aachen 21 9-Sep-2007

HMM: Recognition vs. Translation speech recognition text translation Pr(x T 1 T, w) = Pr(fJ 1 J, ei 1 ) = [p(s t s t 1, S w, w) p(x t s t, w)] [p(a j a j 1, I) p(f j e aj )] s T 1 t a J 1 j time t = 1,..., T source positions j = 1,..., J observations x T 1 observations f1 J with acoustic vectors x t with source words f j states s = 1,..., S w target positions i = 1,..., I of word w with target words e I 1 path: t s = s t alignment: j i = a j always: monotonous partially monotonous transition prob. p(s t s t 1, S w, w) alignment prob. p(a j a j 1, I) emission prob. p(x t s t, w) lexicon prob. p(f j e aj ) H. Ney c RWTH Aachen 22 9-Sep-2007

3.2 Phrase Extraction segmentation into two-dim. blocks blocks have to be consistent with the word alignment: words within the phrase cannot be aligned to words outside the phrase unaligned words are attached to adjacent phrases purpose: decomposition of a sentence pair (F, E) into phrase pairs ( f k, ẽ k ), k = 1,..., K: p(e F) = p(ẽ K 1 f K 1 ) = k p(ẽ k f k ) (after suitable re-ordering at phrase level) H. Ney c RWTH Aachen 23 9-Sep-2007

Phrase Extraction: Example possible phrase pairs:? day of time a suggest may I if wenn ich eine Uhrzeit vorschlagen darf? impossible phrase pairs:? day of time a suggest may I if wenn ich eine Uhrzeit vorschlagen darf? H. Ney c RWTH Aachen 24 9-Sep-2007

Example: Alignments for Phrase Extraction source sentence gloss notation I VERY HAPPY WITH YOU AT TOGETHER. target sentence I enjoyed my stay with you. Viterbi alignment for F E:. you with stay my enjoyed i I VERY HAPPY WITH YOU AT TOGETHER. H. Ney c RWTH Aachen 25 9-Sep-2007

Example: Alignments for Phrase Extraction Viterbi: F E Viterbi: E F union intersection refined H. Ney c RWTH Aachen 26 9-Sep-2007

Alignments for Phrase Extraction most alignment models are asymmetric: F E and E F will give different results in practice: combine both directions using a simple heuristic intersection: only use alignments where both directions agree union: use all alignments from both directions refined: start from intersection and include adjacent alignments from each direction effect on number of extracted phrases and on translation quality (IWSLT 2005) heuristic # phrases BLEU[%] TER[%] WER[%] PER[%] union 489 035 49.5 36.4 38.9 29.2 refined 1 055 455 54.1 34.9 36.8 28.9 intersection 3 582 891 56.0 34.3 35.7 29.2 H. Ney c RWTH Aachen 27 9-Sep-2007

3.3 Phrase Models and Log-Linear Scoring combination of various types of dependencies using log-linear framework (maximum entropy): p(e F) = exp [ m λ mh m (E, F) ] Ẽ exp [ m λ mh m (Ẽ, F) ] with models (feature functions) h m (E, F), m = 1,..., M Bayes decision rule: F Ê(F) = argmax E = argmax E { } p(e F) = argmax E { } λ m h m (E, F) m { exp [ λ m h m (E, F) ]} m consequence: do not worry about normalization include additional feature functions by checking BLEU ( trial and error ) H. Ney c RWTH Aachen 28 9-Sep-2007

Source Language Text Preprocessing F Global Search Ê = argmax{p(e F)} E = argmax{ λ mh m (E, F)} E m Ê Postprocessing Models Language Models Phrase Models Word Models Reordering Models...... Target Language Text H. Ney c RWTH Aachen 29 9-Sep-2007

Phrase Model Scoring most models h m (E, F) are based on segmentation into two-dim. blocks k := 1,..., K five baseline models: phrase lexicon in both directions: p( f k ẽ k ) and p(ẽ k f k ) estimation: relative frequencies single-word lexicon in both directions: p(f j ẽ k ) and p(e i f k ) model: IBM-1 across phrase estimation: relative frequencies monolingual (fourgram) LM 7 free parameters: 5 exponents + phrase/word penalty H. Ney c RWTH Aachen 30 9-Sep-2007

history: Och et al.; EMNLP 1999: alignment templates ( with alignment information ) and comparison with single-word based approach Zens et al., 2002: German Conference on AI, Springer 2002; phrase models used by many groups (Och ISI/Koehn/...) later extensions, mainly for rescoring N-best lists: phrase count model IBM-1 p(f j e I 1 ) deletion model word n-gram posteriors sentence length posterior H. Ney c RWTH Aachen 31 9-Sep-2007

Experimental Results: Chin-Engl. NIST BLEU[%] Search Model Dev Test monotone 4-gram LM + phrase model p( f ẽ) 31.9 29.5 + word penalty 32.0 30.7 + inverse phrase model p(ẽ f) 33.4 31.4 + phrase penalty 34.0 31.6 + inverse word model p(e f) (noisy-or) 35.4 33.8 non-monotone + distance-based reordering 37.6 35.6 + phrase orientation model 38.8 37.3 + 6-gram LM (instead of 4-gram) 39.2 37.8 Dev: NIST 02 eval set; Test: combined NIST 03-NIST 05 eval sets H. Ney c RWTH Aachen 32 9-Sep-2007

Re-ordering Models soft constraints ( scores ): distance-based reordering model phrase orientation model hard constraints (to reduce search complexity): level of source words: local re-ordering IBM (forward) constraints IBM backward constraints level of source phrases: IBM constraints (e.g. #skip=2) side track: ITG constraints H. Ney c RWTH Aachen 33 9-Sep-2007

Phrase Orientation Model left phrase orientation right phrase orientation target positions i target positions i j j j j source positions source positions H. Ney c RWTH Aachen 34 9-Sep-2007

Re-ordering Constraints dependence on specific language pairs: German - English Spanish - English French - English Japanese - English (BTEC) Chinese - English Arabic - English H. Ney c RWTH Aachen 35 9-Sep-2007

3.4 Generation constraints: no empty phrases, no gaps and no overlaps operations with interdependencies: find segment boundaries allow re-ordering in target language find most plausible sentence similar to: memory-based and example-based translation search strategies: (Tillmann et al.: Coling 2000, Comp.Ling. 2003; Ueffing et al. EMNLP 2002) H. Ney c RWTH Aachen 36 9-Sep-2007

Travelling Salesman Problem: Redraw Network (J=6) 4 2 3 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5 6 6 6 6 3 3 3 4 5 5 5 6 6 2 5 6 2 4 6 2 4 5 2 3 6 2 3 5 2 3 4 6 1 2 3 5 4 3 2 4 2 5 3 5 5 4 6 6 6 6 2 3 4 5 1 3 4 5 6 2 4 6 4 H. Ney c RWTH Aachen 37 9-Sep-2007

Reordering: IBM Constraints uncovered position covered position uncovered position for extension IBM constraints: #skip=3 result: limited reordering lattice 1 j J H. Ney c RWTH Aachen 38 9-Sep-2007

DP-based Algorithm for Statistical MT extensions: phrases rather than words rest cost estimate for uncovered positions input: source language string f 1...f j...f J for each cardinality c = 1, 2,..., J do for each set C {1,..., J} of covered positions with C = c do for each target suffix string ẽ do evaluate score Q(C, ẽ) :=... apply beam pruning traceback: recover optimal word sequence H. Ney c RWTH Aachen 39 9-Sep-2007

DP-based Algorithm for Statistical MT dynamic programming beam search: build up hypotheses of increasing cardinality: each hypothesis (C, ẽ) has two parts: coverage hyp. (C) + lexical hyp. (ẽ) consider and prune competing hypotheses: with the same coverage vector with the same cardinality additional: observation pruning H. Ney c RWTH Aachen 40 9-Sep-2007

Effect of Phrase Length How does the translation accuracy depend on the length of the matching phrases? experimental analysis: measure BLEU separately for each sentence curve: plot BLEU vs. average length of matching phrases experimental results: phrase length 1 3: BLEU from 20% to 40% H. Ney c RWTH Aachen 41 9-Sep-2007

Effect of Phrase Length (Chin.-Engl. NIST) 1 0.9 All MaxLen 3 All Lin. Regression MaxLen 3 Lin. Regression 0.8 0.7 0.6 BLEU 0.5 0.4 0.3 0.2 0.1 0 1 1.5 2 2.5 3 3.5 4 4.5 avg. source phrase length H. Ney c RWTH Aachen 42 9-Sep-2007

Conclusions about Statistical MT memory effect: more and longer matching phrases: help improve translation accuracy today s SMT is closer to example/memory-based MT than 10 years ago most important difference to example/memory-based MT: consistent scoring (handles weak interdependencies and conflicting requirements) fully automatic training (starting from a sentence-aligned bilingual corpus) H. Ney c RWTH Aachen 43 9-Sep-2007

4 Recent Extensions system combination gappy phrases statistical MT without data? H. Ney c RWTH Aachen 44 9-Sep-2007

4.1 System Combination concept for combining translations from several MT engines: align the system outputs: non-monotone alignment (as in training) construct a confusion network from the aligned hypotheses use weights and language model to select the best translation use of adapted language model: adaptation to translated test sentences 10-best lists of each individual system as input first work presented at EACL 2006; (similar approaches in GALE) H. Ney c RWTH Aachen 45 9-Sep-2007

Build Confusion Network Example: 0.25 would your like coffee or tea (1+3) system 0.35 have you tea or coffee hypotheses 0.10 would like your coffee or with weights 0.30 I have some coffee tea would you like alignment have would you your $ like coffee coffee or or tea tea and would would your your like like coffee coffee or or $ tea re-ordering I $ would would you your like like have $ some $ coffee coffee $ or tea tea H. Ney c RWTH Aachen 46 9-Sep-2007

Extract Consensus Translation introduce confidence factors for each system and vote $ would your like $ $ coffee or tea confusion $ have you $ $ $ coffee or tea network $ would your like $ $ coffee or $ I would you like have some coffee $ tea voting $/0.7 would/0.65 you/0.65 $/0.35 $/0.7 $/0.7 coffee/1.0 or/0.7 tea/0.9 I/0.3 have/0.35 your/0.35 like/0.65 have/0.3 some/0.3 $/0.3 $/0.1 refinements: use each system output as primary reference (combine several confusion networks) include language model H. Ney c RWTH Aachen 47 9-Sep-2007

Results combination of 5 MT systems developed for the GALE 2007 evaluation (Arabic NIST05, case-insensitive): PER [%] BLEU [%] TER [%] worst system 33.9 44.2 47.4 best system 28.4 55.3 38.9 combination 27.7 57.1 36.8 often: improvements, in particular for ERROR measures (like PER) word re-ordering and alignment: sentence structure is not always preserved adapted language model gives a bonus to n-grams present in the original phrases question: What is the human performance? H. Ney c RWTH Aachen 48 9-Sep-2007

Experimental Results Effect of individual system combination components: (TC-STAR 2007 evaluation data, English-to-Spanish, verbatim condition) BLEU[%] WER[%] PER[%] NIST worst single system best single system 49.3 52.4 39.8 36.7 30.0 27.9 9.95 10.45 system combination: single confusion net (uniform weights) 53.0 35.3 27.1 10.60 + manual weight 53.4 35.5 27.0 10.62 + union of all confusion nets 53.8 35.6 26.8 10.60 + adapted LM 54.3 35.2 27.4 10.65 + automatic weight optimization 54.5 35.5 27.5 10.62 H. Ney c RWTH Aachen 49 9-Sep-2007

Shortcomings of Present MT Rover Task: TC-STAR 2006 Spanish-to-English evaluation data, 300 sentences "Human MT Rover": human experts generate the output sentence. System BLEU[%] WER[%] PER[%] NIST worst single system best single system 52.0 54.1 35.8 34.2 27.2 25.5 9.33 9.47 system combination human system combination 55.2 58.2 32.9 31.5 25.1 24.3 9.63 9.85 result: room for improvement: BLEU: from 54.1% to 58.2% (human) vs. 55.2% (automatic) both for lexical choices (PER) and word order H. Ney c RWTH Aachen 50 9-Sep-2007

4.2 Gappy Phrases concept: allow for gaps in the phrase pairs effect: long-distance dependencies history: McTait & Trujillo 1999: discontiguous translation patterns U. Block 2000 (Verbmobil): (translation) pattern pairs R. Zens: diploma thesis 2002, RWTH Aachen (unpublished) D. Chiang 2005: hierarchical phrases H. Ney c RWTH Aachen 51 9-Sep-2007

so far: (source,target) phrase pairs (α, β) without gaps: p(β α) discontiguous phrase pairs (α 1 Aα 2, β 1 Bβ 2 ) WITH gaps (A, B): p(β 1 Bβ 2 α 1 Aα 2 ) = p(a B) p(β 1 _β 2 α 1 _α 2 ) H. Ney c RWTH Aachen 52 9-Sep-2007

H. Ney c RWTH Aachen 53 9-Sep-2007

H. Ney c RWTH Aachen 54 9-Sep-2007

H. Ney c RWTH Aachen 55 9-Sep-2007

ongoing work: heuristics for gappy phrase extraction scoring of phrase models generation (search): top-down vs. bottom-up, efficiency,... H. Ney c RWTH Aachen 56 9-Sep-2007

Preliminary Experimental Results IWSLT 2007, Chinese-to-English task System BLEU TER WER PER mono.pbt 29.6 56.0 58.3 48.9 best PBT 37.2 48.0 48.7 44.3 gappy PBT 35.0 50.5 51.3 46.4 Examples: best PBT Please tell me how to get there. gappy PBT Do you have any cancellation, please let me know. Reference If there is a cancellation, please let me know. best PBT Take me to a hospital? gappy PBT What should I take to go to the hospital? Reference What should I take with me to the hospital? H. Ney c RWTH Aachen 57 9-Sep-2007

4.3 Statistical MT With No/Scarce Resources two aspects of statistical MT: decision process (from source F to target E): Ê = arg max{p(e) p(f E)} E learning the probability models: language model p(e): monolingual corpus lexicon/translation model p(f E): bilingual corpus idea: bilingual corpus: sometimes difficult to get substitute: conventional bilingual dictionary (and use uniform prob. distributions) consequence: morphology and morphosyntax helpful (all SMT systems use full-form words!) H. Ney c RWTH Aachen 58 9-Sep-2007

observations: Spanish English WER PER BLEU OOVs dictionary 60.4 49.3 19.4 20.7 +adjective treatment 56.4 46.8 23.8 18.9 1k 52.4 40.7 30.0 10.6 +dictionary 48.0 36.5 36.0 6.8 +adjective treatment 44.5 34.8 40.9 5.9 13k 41.8 30.7 43.2 2.8 +dictionary 40.6 29.6 46.3 2.4 +adjective treatment 38.3 29.0 49.6 2.2 1.3M 34.5 25.5 54.7 0.14 +adjective treatment 33.5 25.2 56.4 0.14 significant effect of OOV words: difference in PER is largely caused by OOV effect! reasonable translation quality using small corpora dictionary and morpho-syntactic information are helpful H. Ney c RWTH Aachen 59 9-Sep-2007

Summary today s statistical MT: IBM models for word alignment: learning from bilingual data from words to phrases: phrase extraction, scoring models and generation (search) algorithms experience with various tasks and distant language pairs text + speech helpful conditions: availability of bilingual corpora automatic evaluation measures public evaluation campaigns more powerful computers and algorithms/implementations H. Ney c RWTH Aachen 60 9-Sep-2007

THE END H. Ney c RWTH Aachen 61 9-Sep-2007