Machine Translation. 09: Monolingual Data. Rico Sennrich. University of Edinburgh. R. Sennrich MT / 20

Machine Translation 09: Monolingual Data Rico Sennrich University of Edinburgh R. Sennrich MT 2018 09 1 / 20

Refresher why monolingual data? language models are an important component in statistical machine translation monolingual data is far more abundant than parallel data phrase-based SMT models suffer from independence assumption; LMs can mitigate this monolingual data may better match target domain R. Sennrich MT 2018 09 1 / 20

MT 2018 09 1 Language Models in NMT 2 Training End-to-End NMT Model with Monolingual Data 3 "Unsupervised" MT from Monolingual Data R. Sennrich MT 2018 09 2 / 20

Language Models in NMT [Gülçehre et al., 2015] shallow fusion: rescore beam with language model ( ensembling) deep fusion: extra, LM-specific hidden layer (a) Shallow Fusion (Sec. 4.1) (b) Deep Fusion (Sec. 4.2) rare words. But instead of relying on an external Figure 1: Graphical illustrations De-En of the proposedcs-en fusion methods. alignment tool, we used the attention mech- Dev Test Dev Test anism of the NMT model learned to extract byalignments. the LM from monolingual NMT Baseline corpora 25.51 is coder 23.61use21.47 the signal 21.89 from the TM fully, while the This method consistently improved not overwritten. the results It is bypossible Shallow to usefusion monolingual 25.53 controller 23.69 21.95 controls 22.18 the magnitude of the LM signal. approximately 1.0 BLEU score. corpora as well while finetuning Deep Fusion all the parameters, 25.88 24.00 22.49 22.36 but in this paper, we alter only the output pa- In our experiments, we empirically found that it 7 Results and Analysis Table 4: rameters in the stage of finetuning. R. Results Sennrich for De-En and was MT Cs-En better 2018 translation to initialize 09 the bias bg to a small, neg- 3 / 20

MT 2018 09 1 Language Models in NMT 2 Training End-to-End NMT Model with Monolingual Data 3 "Unsupervised" MT from Monolingual Data R. Sennrich MT 2018 09 4 / 20

Monolingual Data in NMT NMT is a conditional language model p(u i ) = f(z i, u i 1, c i ) Problem for monolingual training instances, source context c i is missing R. Sennrich MT 2018 09 5 / 20

Monolingual Training Instances solutions: missing data imputation for c i missing data indicator: 0 works, but danger of catastrophic forgetting impute c i with neural network we do this indirectly by back-translating the target sentence R. Sennrich MT 2018 09 6 / 20

Evaluation: English German 30.0 BLEU 20.0 23.6 24.6 26.5 10.0 0.0 NMT parallel +missing data indicator +backtranslation R. Sennrich MT 2018 09 7 / 20

Back-Translation: Comparison to Phrase-based SMT back-translated parallel data back-translation has been proposed for phrase-based SMT [Schwenk, 2008, Bertoldi and Federico, 2009, Lambert et al., 2011] PBSMT already has LM main rationale: phrase-table domain adaptation rationale in NMT: train end-to-end model on monolingual data BLEU system WMT IWSLT (in-domain) (out-of-domain) PBSMT gain +0.7 +0.1 NMT gain +2.9 +1.2 Table: Gains on English German from adding back-translated News Crawl data. R. Sennrich MT 2018 09 8 / 20

Autoencoders general principle: train network that encodes input, and learns to reconstruct input from encoded representation unsupervised representation learning john likes his cat y1 y2 y3 y4 Decoder s1 s2 s3 s4 h1 h2 h3 h4 Encoder x1 x2 x3 x4 john likes his cat R. Sennrich MT 2018 09 9 / 20

Autoencoders in Neural Machine Translation autoencoders are used via multi-task learning: Published as a conference paper at ICLR 2016 shared models, multiple task-specific objectives German (translation) English (unsupervised) English German (unsupervised) Figure 4: Many-to-many setting multiple encoders, multiple decoders. We consider this scheme in a limited context of machine translation to utilize the large monolingual corpora in both the source and the target languages. Here, we consider a single translation task and two unsupervised autoencoder tasks. [Luong et al., 2016] does idea consist of ordered still sentences, worke.g., if paragraphs. we use Unfortunately, attention in many applications mechanism? that include machinetranslation, we onlyhavesentence-leveldatawherethesentencesareunordered. Toaddress (far less that, weof splitaeach representation sentence into two halves; we thenbottleneck) use one half to predict the other half. 3.5 LEARNING apparently, yes (for low-resource language pairs): Dong al. (2015) adopted an alternating training approach, where they optimize each task for a [Currey fixedet number al., of parameter 2017] updates (or mini-batches) before switching to the next task (which is a different language pair). In our setting, our tasks are more diverse and contain different amounts of training data. As a result, we allocate different numbers of parameter updates for each task, which are expressed with the mixing ratio values αi (for each task i). Each parameter update consists of training data from one task only. When switching between tasks, we select randomly a new task i with probability αi source. Les Dissonances a aparut pe scena muzicala în 2004... j αj reference Les Dissonances appeared on the music scene in 2004... Our convention is that the first task is the reference task with α1 = 1.0 and the number of training parameter updates forbaseline that task is prespecified Les Dissonville to be N. A typical appeared task i will onthen thebe music trained for scene αi N in 2004... α1 parameter updates. Such + copied conventionmakes LesitDissonances easier for us to fairly appeared compareon the the samemusic reference scene task in 2004... in a single-task setting which has also been trained for exactly N parameter updates. analysis: BPE-based system gets better at copying unknown names: When sharing an encoder or a decoder, we share both the recurrent connections and the corresponding embeddings. R. Sennrich MT 2018 09 10 / 20

Dual Learning [He et al., 2016] dual-learning game closed loop of two translation systems translate sentence from language A into language B and back loss functions: is sentence in language B natural? loss is negative log-probability under (static) LM is second translation similar to original? loss is standard cross-entropy, with original as reference use reinforcement learning to update weights we can also start with sentence in language B R. Sennrich MT 2018 09 11 / 20

Parameter Pre-Training [Ramachandran et al., 2017] core idea: pre-train encoder and decoder on language modelling task models are fine-tuned with translation objective, along with continued use of LM objective (with shared parameters) W X Y Z <EOS> Softmax Second RNN Layer First RNN Layer Embedding A B C <EOS> W X Y Z Figure 1: Pretrained sequence to sequence model. The red parameters are the encoder and the blue parameters are the decoder. All parameters in a shaded box are pretrained, either from the source side (light red) or target side (light blue) language model. Otherwise, they are randomly initialized. 2 Methods machine translation as an example application. In the following section, we will describe our basic First, two monolingual datasets are collected, unsupervised pretraining procedure for sequence one for the source side language, and one for the to sequence learning and how to modify R. Sennrich sequence target side language. A language model (LM) is MT 2018 09 12 / 20

MT 2018 09 1 Language Models in NMT 2 Training End-to-End NMT Model with Monolingual Data 3 "Unsupervised" MT from Monolingual Data R. Sennrich MT 2018 09 13 / 20

Bilingual Lexicon Induction learn lexical correspondences from monolingual data correspondences are based on various types of similarity: contextual similarity temporal similarity orthographic similarity frequency similarity today we look at distributional word representations (contextual similarity) R. Sennrich MT 2018 09 14 / 20

Embedding Space Similarities Across Languages [Mikolov et al., 2013] R. Sennrich MT 2018 09 15 / 20

Learning to Map Between Vector Spaces supervised mapping [Mikolov et al., 2013] we can learn linear transformation between embedding spaces with small dictionary. given linear transformation matrix W, and two vector representations x i, y i in source and target language training objective (optimized with SGD): argmin W n W x i y i 2 i=1 training requires small seed lexicon of (x, y) pairs after mapping, induce bilingual lexicon via nearest neighbor search R. Sennrich MT 2018 09 16 / 20

Learning to Map Between Vector Spaces unsupervised mapping [Miceli Barone, 2016, Conneau et al., 2017] adversarial training: co-train classifier (adversary) that predicts whether embedding represents source or target language word objective of linear, orthogonal transformation: Published a conference paper at ICLR 2018 fool classifier by making embeddings as similar as possible Figure 1: Toy illustration of the method. (A) There are two distributions of word embeddings, English words in red denoted by X and Italian words in blue denoted by Y, which we want to align/translate. Each dot represents a word in that space. The size of the dot is proportional to the frequency of the words in the training corpus of that language. (B) Using adversarial learning, we learn a rotation matrix W which roughly aligns the [Conneau et al., 2017] two distributions. The green stars are randomly selected words that are fed to the discriminator to determine whether the two word embeddings come from the same distribution. (C) The mapping W is further refined via Procrustes. This method uses frequent words aligned by the previous step as anchor points, and minimizes an energy function that corresponds to a spring system between anchor points. The refined mapping is then used to map all words in the dictionary. (D) R. Finally, Sennrichwe translate MT by 2018 using 09 the mapping W and a distance metric, 17 / 20

Learning to Map Between Vector Spaces warning these are recent research results open questions remain under what conditions will this method succeed / fail? method was tested with typologically relatively similar languages method was tested with similar monolingual data (same domains and genres) R. Sennrich MT 2018 09 18 / 20

Improving Word Order [Lample et al., 2017] joint training of both translation directions use translation model to back-translate monolingual data learn encoder-decoder to reconstruct original sentence from noisy translation iterate several times use various other tricks and objectives to improve learning pre-trained embeddings denoising autencoder as additional objective shared encoder / decoder parameters in both directions adversarial objective BLEU system en-fr en-de supervised 28.0 21.3 word-by-word [Conneau et al., 2017] 6.3 7.1 [Lample et al., 2017] 15.1 9.6 R. Sennrich MT 2018 09 19 / 20

Conclusion there are various ways to learn from monolingual data combination with language model pre-training and parameter sharing creating synthetic training data methods are especially useful when: parallel data is sparse monolingual data is highly relevant (in-domain) hot research topic: learning to translate without parallel data R. Sennrich MT 2018 09 20 / 20

Bibliography I Bertoldi, N. and Federico, M. (2009). Domain adaptation for statistical machine translation with monolingual resources. In Proceedings of the Fourth Workshop on Statistical Machine Translation StatMT 09. Association for Computational Linguistics. Conneau, A., Lample, G., Ranzato, M., Denoyer, L., and Jégou, H. (2017). Word Translation Without Parallel Data. CoRR, abs/1710.04087. Currey, A., Miceli Barone, A. V., and Heafield, K. (2017). Copied Monolingual Data Improves Low-Resource Neural Machine Translation. In Proceedings of the Second Conference on Machine Translation, pages 148 156, Copenhagen, Denmark. Association for Computational Linguistics. Gülçehre, c., Firat, O., Xu, K., Cho, K., Barrault, L., Lin, H., Bougares, F., Schwenk, H., and Bengio, Y. (2015). On Using Monolingual Corpora in Neural Machine Translation. CoRR, abs/1503.03535. He, D., Xia, Y., Qin, T., Wang, L., Yu, N., Liu, T., and Ma, W.-Y. (2016). Dual Learning for Machine Translation. In Lee, D. D., Sugiyama, M., Luxburg, U. V., Guyon, I., and Garnett, R., editors, Advances in Neural Information Processing Systems 29, pages 820 828. Curran Associates, Inc. Lambert, P., Schwenk, H., Servan, C., and Abdul-Rauf, S. (2011). Investigations on Translation Model Adaptation Using Monolingual Data. In Proceedings of the Sixth Workshop on Statistical Machine Translation, pages 284 293, Edinburgh, Scotland. Association for Computational Linguistics. R. Sennrich MT 2018 09 21 / 20

Bibliography II Lample, G., Denoyer, L., and Ranzato, M. (2017). Unsupervised Machine Translation Using Monolingual Corpora Only. CoRR, abs/1711.00043. Luong, M., Le, Q. V., Sutskever, I., Vinyals, O., and Kaiser, L. (2016). Multi-task Sequence to Sequence Learning. In ICLR 2016. Miceli Barone, A. V. (2016). Towards cross-lingual distributed representations without parallel text trained with adversarial autoencoders. In Proceedings of the 1st Workshop on Representation Learning for NLP, pages 121 126, Berlin, Germany. Association for Computational Linguistics. Mikolov, T., Le, Q. V., and Sutskever, I. (2013). Exploiting Similarities among Languages for Machine Translation. CoRR, abs/1309.4168. Ramachandran, P., Liu, P., and Le, Q. (2017). Unsupervised Pretraining for Sequence to Sequence Learning. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 383 391, Copenhagen, Denmark. Association for Computational Linguistics. Schwenk, H. (2008). Investigations on Large-Scale Lightly-Supervised Training for Statistical Machine Translation. In International Workshop on Spoken Language Translation, pages 182 189. R. Sennrich MT 2018 09 22 / 20