Improving the IBM Alignment Models Using Variational Bayes

Similar documents
Domain Adaptation in Statistical Machine Translation of User-Forum Data using Component-Level Mixture Modelling

The KIT-LIMSI Translation System for WMT 2014

Language Model and Grammar Extraction Variation in Machine Translation

The NICT Translation System for IWSLT 2012

Cross-Lingual Dependency Parsing with Universal Dependencies and Predicted PoS Labels

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

arxiv: v1 [cs.cl] 2 Apr 2017

The MSR-NRC-SRI MT System for NIST Open Machine Translation 2008 Evaluation

Noisy SMS Machine Translation in Low-Density Languages

The RWTH Aachen University English-German and German-English Machine Translation System for WMT 2017

Greedy Decoding for Statistical Machine Translation in Almost Linear Time

Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Improved Reordering for Shallow-n Grammar based Hierarchical Phrase-based Translation

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data

Initial approaches on Cross-Lingual Information Retrieval using Statistical Machine Translation on User Queries

Experts Retrieval with Multiword-Enhanced Author Topic Model

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Enhancing Morphological Alignment for Translating Highly Inflected Languages

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Probabilistic Latent Semantic Analysis

A heuristic framework for pivot-based bilingual dictionary induction

Re-evaluating the Role of Bleu in Machine Translation Research

Lecture 1: Machine Learning Basics

Constructing Parallel Corpus from Movie Subtitles

Regression for Sentence-Level MT Evaluation with Pseudo References

TINE: A Metric to Assess MT Adequacy

Training and evaluation of POS taggers on the French MULTITAG corpus

Combining Bidirectional Translation and Synonymy for Cross-Language Information Retrieval

3 Character-based KJ Translation

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Online Updating of Word Representations for Part-of-Speech Tagging

Extracting Opinion Expressions and Their Polarities Exploration of Pipelines and Joint Models

A Quantitative Method for Machine Translation Evaluation

arxiv:cmp-lg/ v1 22 Aug 1994

Cross Language Information Retrieval

Cross-lingual Text Fragment Alignment using Divergence from Randomness

Reinforcement Learning by Comparing Immediate Reward

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Clickthrough-Based Translation Models for Web Search: from Word Models to Phrase Models

CS 598 Natural Language Processing

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

BYLINE [Heng Ji, Computer Science Department, New York University,

What Can Neural Networks Teach us about Language? Graham Neubig a2-dlearn 11/18/2017

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Finding Your Friends and Following Them to Where You Are

Truth Inference in Crowdsourcing: Is the Problem Solved?

Speech Recognition at ICSI: Broadcast News and beyond

A cognitive perspective on pair programming

Mining Topic-level Opinion Influence in Microblog

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Georgetown University at TREC 2017 Dynamic Domain Track

Speech Emotion Recognition Using Support Vector Machine

A Case Study: News Classification Based on Term Frequency

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

A study of speaker adaptation for DNN-based speech synthesis

Bootstrapping and Evaluating Named Entity Recognition in the Biomedical Domain

The stages of event extraction

Prediction of Maximal Projection for Semantic Role Labeling

Unsupervised Dependency Parsing without Gold Part-of-Speech Tags

Investigation on Mandarin Broadcast News Speech Recognition

Semi-supervised Training for the Averaged Perceptron POS Tagger

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

EXECUTIVE SUMMARY. TIMSS 1999 International Mathematics Report

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Deep Neural Network Language Models

Learning Computational Grammars

EXECUTIVE SUMMARY. TIMSS 1999 International Science Report

Learning Probabilistic Behavior Models in Real-Time Strategy Games

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

(Sub)Gradient Descent

Applications of memory-based natural language processing

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

On the Combined Behavior of Autonomous Resource Management Agents

Learning Rules from Incomplete Examples via Implicit Mention Models

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Accurate Unlexicalized Parsing for Modern Hebrew

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

Learning From the Past with Experiment Databases

Using Semantic Relations to Refine Coreference Decisions

A Neural Network GUI Tested on Text-To-Phoneme Mapping

TIMSS Highlights from the Primary Grades

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Semi-Supervised Face Detection

Lecture 10: Reinforcement Learning

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

On document relevance and lexical cohesion between query terms

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

The taming of the data:

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Top US Tech Talent for the Top China Tech Company

Discriminative Learning of Beam-Search Heuristics for Planning

Learning Methods in Multilingual Speech Recognition

Transcription:

Improving the IBM Alignment Models Using Variational Bayes Darcey Riley and Daniel Gildea Computer Science Dept. University of Rochester Rochester, NY 14627 Abstract Bayesian approaches have been shown to reduce the amount of overfitting that occurs when running the EM algorithm, by placing prior probabilities on the model parameters. We apply one such Bayesian technique, variational Bayes, to the IBM models of word alignment for statistical machine translation. We show that using variational Bayes improves the performance of the widely used GIZA++ software, as well as improving the overall performance of the Moses machine translation system in terms of BLEU score. 1 Introduction The IBM Models of word alignment (Brown et al., 1993), along with the Hidden Markov Model (HMM) (Vogel et al., 1996), serve as the starting point for most current state-of-the-art machine translation systems, both phrase-based and syntax-based (Koehn et al., 2007; Chiang, 2005; Galley et al., 2004). Both the IBM Models and the HMM are trained using the EM algorithm (Dempster et al., 1977). Recently, Bayesian techniques have become widespread in applications of EM to natural language processing tasks, as a very general method of controlling overfitting. For instance, Johnson (2007) showed the benefits of such techniques when applied to HMMs for unsupervised part of speech tagging. In machine translation, Blunsom et al. (2008) and DeNero et al. (2008) use Bayesian techniques to learn bilingual phrase pairs. In this setting, which involves finding a segmentation of the input sentences into phrasal units, it is particularly important to control the tendency of EM to choose longer phrases, which explain the training data well but are unlikely to generalize. However, most state-of-the-art machine translation systems today are built on the basis of wordlevel alignments of the type generated by GIZA++ from the IBM Models and the HMM. Overfitting is also a problem in this context, and improving these word alignment systems could be of broad utility in machine translation research. Moore (2004) discusses details of how EM overfits the data when training IBM Model 1. He discovers that the EM algorithm is particularly susceptible to overfitting in the case of rare words, due to the garbage collection phenomenon. Suppose a sentence contains an English word e 1 that occurs nowhere else in the data, and its French translation f 1. Suppose that same sentence also contains a word e 2 which occurs frequently in the overall data but whose translation in this sentence, f 2, co-occurs with it infrequently. If the translation t(f 2 e 2 ) occurs with probability 0.1, then the sentence will have a higher probability if EM assigns the rare word and its actual translation a probability of t(f 1 e 1 ) = 0.5, and assigns the rare word s translation to f 2 a probability of t(f 2 e 1 ) = 0.5, than if it assigns a probability of 1 to the correct translation t(f 1 e 1 ). Moore suggests a number of solutions to this issue, including add-n smoothing and initializing the probabilities based on a heuristic rather than choosing uniform probabilities. When combined, his solutions cause a significant decrease in alignment error rate (AER). More recently, Mermer and Saraclar (2011) have added a Bayesian prior to IBM Model 1 using Gibbs sampling for inference, showing improvements in BLEU scores. In this paper, we describe the results of incorpo- 306 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, pages 306 310, Jeju, Republic of Korea, 8-14 July 2012. c 2012 Association for Computational Linguistics

rating variational Bayes (VB) into the widely used GIZA++ software for word alignment. We use VB both because it converges more quickly than Gibbs sampling, and because it can be applied in a fairly straightforward manner to all of the models implemented by GIZA++. In Section 2, we describe VB in more detail. In Section 3, we present results for VB for the various models, in terms of perplexity of held-out test data, alignment error rate (AER), and the BLEU scores which result from using our version of GIZA++ in the end-to-end phrase-based machine translation system Moses. 2 Variational Bayes and GIZA++ Beal (2003) gives a detailed derivation of a variational Bayesian algorithm for HMMs. The result is a very slight change to the M step of the original EM algorithm. During the M step of the original algorithm, the expected counts collected in the E step are normalized to give the new values of the parameters: θ xi y = E[c(x i y)] j E[c(x (1) j y)] The variational Bayesian M step performs an inexact normalization, where the resulting parameters will add up to less than one. It does this by passing the expected counts collected in the E step through the function f(v) = exp(ψ(v)), where ψ is the digamma function, and α is the hyperparameter of the Dirichlet prior (Johnson, 2007): θ xi y = f(e[c(x i y)] + α) f( j (E[c(x j y)] + α)) (2) This modified M step can be applied to any model which uses a multinomial distribution; for this reason, it works for the IBM Models as well as HMMs, and is thus what we use for GIZA++. In practice, the digamma function has the effect of subtracting 0.5 from its argument. When α is set to a low value, this results in anti-smoothing. For the translation probabilities, because about 0.5 is subtracted from the expected counts, small counts corresponding to rare co-occurrences of words will be penalized heavily, while larger counts will not be affected very much. Thus, low values of α cause the algorithm to favor words which co-occur frequently and to distrust words that co-occur rarely. Sentence pair count e 2 f 3 9 e 2 f 2 2 e 1 e 2 f 1 f 2 1 Table 1: An example of data with rare words. In this way, VB controls the overfitting that would otherwise occur with rare words. On the other hand, higher values of α can be chosen if smoothing is desired, for instance in the case of the alignment probabilities, which state how likely a word in position i of the English sentence is to align to a word in position j of the French sentence. For these probabilities, smoothing is important because we do not want to rule out any alignment altogether, no matter how infrequently it occurs in the data. We implemented VB for the translation probabilities as well as for the position alignment probabilities of IBM Model 2. We discovered that adding VB for the translation probabilities improved the performance of the system. However, including VB for the alignment probabilities had relatively little effect, because the alignment table in its original form does some smoothing during normalization by interpolating the counts with a uniform distribution. Because VB can itself be a form of smoothing, the two versions of the code behave similarly. We did not experiment with VB for the distortion probabilities of the HMM or Models 3 and 4, as these distributions have fewer parameters and are likely to have reliable counts during EM. Thus, in Section 3, we present the results of using VB for the translation probabilities only. 3 Results First, we ran our modified version of GIZA++ on a simple test case designed to be similar to the example from Moore (2004) discussed in Section 1. Our test case, shown in Table 1, had three different sentence pairs; we included nine instances of the first, two instances of the second, and one of the third. Human intuition tells us that f 2 should translate to e 2 and f 1 should translate to e 1. However, the EM algorithm without VB prefers e 1 as the translation 307

AER 0.50 0.45 0.40 0.35 0.30 0.25 0.20 0.15 AER after Entire Training Chinese (baseline) Chinese (variational Bayes) German (baseline) German (variational Bayes) 0.10 1e-10 1e-08 1e-06 0.0001 0.01 1 Alpha Test Perplexity 1600 1400 1200 1000 800 600 Model 1 Susceptibility to Overfitting 400 5 10 15 20 25 Iterations of Model 1 Figure 1: Determining the best value of α for the translation probabilities. Training data is 10,000 sentence pairs from each language pair. VB is used for Model 1 only. This table shows the AER for different values of α after training is complete (five iterations each of Models 1, HMM, 3, and 4). Figure 2: Effect of variational Bayes on overfitting for Model 1. Training data is 10,000 sentence pairs. This table contrasts the test perplexities of Model 1 with variational Bayes and Model 1 without variational Bayes after different numbers of training iterations. Variational Bayes successfully controls overfitting. of f 2, due to the garbage collection phenomenon described above. The EM algorithm with VB does not overfit this data and prefers e 2 as f 2 s translation. For our experiments with bilingual data, we used three language pairs: French and English, Chinese and English, and German and English. We used Canadian Hansard data for French-English, Europarl data for German-English, and newswire data for Chinese-English. For measuring alignment error rate, we used 447 French-English sentences provided by Hermann Ney and Franz Och containing both sure and possible alignments, while for German-English we used 220 sentences provided by Chris Callison-Burch with sure alignments only, and for Chinese-English we used the first 400 sentences of the data provided by Yang Liu, also with sure alignments only. For computing BLEU scores, we used single reference datasets for French- English and German-English, and four references for Chinese-English. For minimum error rate training, we used 1000 sentences for French-English, 2000 sentences for German-English, and 1274 sentences for Chinese-English. Our test sets contained 1000 sentences each for French-English and German-English, and 686 sentences for Chinese- English. For scoring the Viterbi alignments of each system against gold-standard annotated alignments, we use the alignment error rate (AER) of Och and Ney (2000), which measures agreement at the level of pairs of words. We ran our code on ten thousand sentence pairs to determine the best value of α for the translation probabilities t(f e). For our training, we ran GIZA++ for five iterations each of Model 1, the HMM, Model 3, and Model 4. Variational Bayes was only used for Model 1. Figure 1 shows how VB, and different values of α in particular, affect the performance of GIZA++ in terms of AER. We discover that, after all training is complete, VB improves the performance of the overall system, lowering AER (Figure 1) for all three language pairs. We find that low values of α cause the most consistent improvements, and so we use α = 0 for the translation probabilities in the remaining experiments. Note that, while a value of α = 0 does not define a probabilistically valid Dirichlet prior, it does not cause any practical problems in the update equation for VB. Figure 2 shows the test perplexity after GIZA++ has been run for twenty-five iterations of Model 1: without VB, the test perplexity increases as training continues, but it remains stable when VB is used. Thus, VB eliminates the need for the early stopping that is often employed with GIZA++. After choosing 0 as the best value of α for the 308

AER 0.50 0.45 0.40 0.35 0.30 0.25 0.20 0.15 0.10 AER for Different Corpus Sizes Chinese (baseline) Chinese (variational Bayes) German (baseline) German (variational Bayes) 0.05 10000 30000 50000 70000 90000 Corpus Size Figure 3: Performance of GIZA++ on different amounts of test data. Variational Bayes is used for Model 1 only. Table shows AER after all the training has completed (five iterations each of Models 1, HMM, 3, and 4). AER French Chinese German Baseline 0.14 0.42 0.43 M1 Only 0.12 0.39 0.41 HMM Only 0.14 0.42 0.42 M3 Only 0.14 0.42 0.43 M4 Only 0.14 0.42 0.43 All Models 0.19 0.44 0.45 Table 2: Effect of Adding Variational Bayes to Specific Models translation probabilities, we reran the test above (five iterations each of Models 1, HMM, 3, and 4, with VB turned on for Model 1) on different amounts of data. We found that the results for larger data sizes were comparable to the results for ten thousand sentence pairs, both with and without VB (Figure 3). We then tested whether VB should be used for the later models. In all of these experiments, we ran Models 1, HMM, 3, and 4 for five iterations each, training on the same ten thousand sentence pairs that we used in the previous experiments. In Table 2, we show the performance of the system when no VB is used, when it is used for each of the four models individually, and when it is used for all four models simultaneously. We saw the most overall improvement when VB was used only for Model 1; using VB for all four models simultaneously caused the most improvement to the test perplexity, but at the cost of BLEU Score French Chinese German Baseline 26.34 21.03 21.14 M1 Only 26.54 21.58 21.73 All Models 26.46 22.08 21.96 Table 3: BLEU Scores the AER. For the MT experiments, we ran GIZA++ through Moses, training Model 1, the HMM, and Model 4 on 100,000 sentence pairs from each language pair. We ran three experiments, one with VB turned on for all models, one with VB turned on for Model 1 only, and one (the baseline) with VB turned off for all models. When VB was turned on, we ran GIZA++ for five iterations per model as in our earlier tests, but when VB was turned off, we ran GIZA++ for only four iterations per model, having determined that this was the optimal number of iterations for baseline system. VB was used for the translation probabilities only, with α set to 0. As can be seen in Table 3, using VB increases the BLEU score for all three language pairs. For French, the best results were achieved when VB was used for Model 1 only; for Chinese and German, on the other hand, using VB for all models caused the most improvements. For French, the BLEU score increased by 0.20; for German, it increased by 0.82; for Chinese, it increased by 1.05. Overall, VB seems to have the greatest impact on the language pairs that are most difficult to align and translate to begin with. 4 Conclusion We find that applying variational Bayes with a Dirichlet prior to the translation models implemented in GIZA++ improves alignments, both in terms of AER and the BLEU score of an end-to-end translation system. Variational Bayes is especially beneficial for IBM Model 1, because its lack of fertility and position information makes it particularly susceptible to the garbage collection phenomenon. Applying VB to Model 1 alone tends to improve the performance of later models in the training sequence. Model 1 is an essential stepping stone in avoiding local minima when training the following models, and improvements to Model 1 lead to improvements in the end-to-end system. 309

References Matthew J. Beal. 2003. Variational Algorithms for Approximate Bayesian Inference. Ph.D. thesis, University College London. Phil Blunsom, Trevor Cohn, and Miles Osborne. 2008. Bayesian synchronous grammar induction. In Neural Information Processing Systems (NIPS). Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer. 1993. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19(2):263 311. David Chiang. 2005. A hierarchical phrase-based model for statistical machine translation. In Proceedings of ACL-05, pages 263 270, Ann Arbor, MI. A. P. Dempster, N. M. Laird, and D. B. Rubin. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 39(1):1 21. John DeNero, Alexandre Bouchard-Côté, and Dan Klein. 2008. Sampling alignment structure under a Bayesian translation model. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, pages 314 323, Honolulu, Hawaii, October. Michel Galley, Mark Hopkins, Kevin Knight, and Daniel Marcu. 2004. What s in a translation rule? In Proceedings of NAACL-04, pages 273 280, Boston. Mark Johnson. 2007. Why doesn t EM find good HMM POS-taggers? In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pages 296 305, Prague, Czech Republic, June. Association for Computational Linguistics. Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of ACL, Demonstration Session, pages 177 180. Coskun Mermer and Murat Saraclar. 2011. Bayesian word alignment for statistical machine translation. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics (ACL-11), pages 182 187. Robert C. Moore. 2004. Improving IBM word alignment Model 1. In Proceedings of the 42nd Meeting of the Association for Computational Linguistics (ACL 04), Main Volume, pages 518 525, Barcelona, Spain, July. Franz Josef Och and Hermann Ney. 2000. Improved statistical alignment models. In Proceedings of ACL- 00, pages 440 447, Hong Kong, October. Stephan Vogel, Hermann Ney, and Christoph Tillmann. 1996. HMM-based word alignment in statistical translation. In COLING-96, pages 836 841. 310