Refresher. Machine Translation MT Language Models in NMT. why monolingual data? [Gülçehre et al., 2015]

Refresher Machine Translation 09: Monolingual Data Rico Sennrich why monolingual data? language models are an important component in statistical machine translation monolingual data is far more abundant than parallel data phrase-based SMT models suffer from independence assumption; LMs can mitigate this monolingual data may better match target domain University of Edinburgh R. Sennrich MT 2018 09 1 / 20 R. Sennrich MT 2018 09 1 / 20 MT 2018 09 Language Models in NMT [Gülçehre et al., 2015] shallow fusion: rescore beam with language model ( ensembling) deep fusion: extra, LM-specific hidden layer 1 Language Models in NMT 2 Training End-to-End NMT Model with Monolingual Data 3 "Unsupervised" MT from Monolingual Data R. Sennrich MT 2018 09 2 / 20 (a) Shallow Fusion (Sec. 4.1) (b) Deep Fusion (Sec. 4.2) rare words. But instead of relying on an external Figure 1: Graphical illustrations De-En of the proposedcs-en fusion methods. alignment tool, we used the attention mech- Dev Test Dev Test anism of the NMT model learned to extract byalignments. the LM from monolingual NMT Baseline corpora 25.51 is coder 23.61use21.47 the signal 21.89 from the TM fully, while the This method consistently improved not overwritten. the results It is bypossible Shallow to usefusion monolingual 25.53 controller 23.69 21.95 controls 22.18 the magnitude of the LM signal. approximately 1.0 BLEU score. corpora as well while finetuning Deep Fusion all the parameters, 25.88 24.00 22.49 22.36 but in this paper, we alter only the output pa- In our experiments, we empirically found that it 7 Results and Analysis Table 4: rameters in the stage of finetuning. R. Results Sennrich for De-En and was MT Cs-En better 2018 translation to initialize 09 the bias b g to a small, negative number. This allows the decoder to decide 3 / 20 tasks on WMT 15 dataset. 7.1 Zh-En: OpenMT 15 4.2.1 Balancing the LM and TM In addition to NMT-based systems, we also trained the importance of the LM only when it is deemed In order for the decoder to flexibly balance the input from the LM and TM, we augment the a phrase-based as well as hierarchical phrase- BLEU score improvements necessary. respectively on De-En decoder

MT 2018 09 Monolingual Data in NMT NMT is a conditional language model p(u i ) = f(z i, u i 1, c i ) 1 Language Models in NMT Problem for monolingual training instances, source context c i is missing 2 Training End-to-End NMT Model with Monolingual Data 3 "Unsupervised" MT from Monolingual Data Monolingual Training Instances R. Sennrich MT 2018 09 4 / 20 Evaluation: English German R. Sennrich MT 2018 09 5 / 20 solutions: missing data imputation for c i missing data indicator: 0 works, but danger of catastrophic forgetting 30.0 impute c i with neural network we do this indirectly by back-translating the target sentence BLEU 20.0 23.6 24.6 26.5 10.0 0.0 NMT parallel +missing data indicator +backtranslation R. Sennrich MT 2018 09 6 / 20 R. Sennrich MT 2018 09 7 / 20

We evaluate the multi-task learning setup on a wide variety of sequence-to-sequence tasks: con- Back-Translation: Comparison to Phrase-based SMT back-translated parallel data back-translation has been proposed for phrase-based SMT [Schwenk, 2008, Bertoldi and Federico, 2009, Lambert et al., 2011] PBSMT already has LM main rationale: phrase-table domain adaptation rationale in NMT: train end-to-end model on monolingual data Autoencoders general principle: train network that encodes input, and learns to reconstruct input from encoded representation unsupervised representation learning Decoder john likes his cat y 1 y 2 y 3 y 4 BLEU system WMT IWSLT (in-domain) (out-of-domain) PBSMT gain +0.7 +0.1 NMT gain +2.9 +1.2 Table: Gains on English German from adding back-translated News Crawl data. Encoder s 1 s 2 s 3 s 4 h 1 h 2 h 3 h 4 x 1 x 2 x 3 x 4 john likes his cat R. Sennrich MT 2018 09 8 / 20 R. Sennrich MT 2018 09 9 / 20 Autoencoders in Neural Machine Translation Dual Learning [He et al., 2016] autoencoders are used via multi-task learning: Published as a conference paper at ICLR 2016 shared models, multiple task-specific objectives German (translation) English (unsupervised) English German (unsupervised) Figure 4: Many-to-many setting multiple encoders, multiple decoders. We consider this scheme in a limited context of machine translation to utilize the large monolingual corpora in both the source and the target languages. Here, we consider a single translation task and two unsupervised autoencoder tasks. does idea consist of ordered still sentences, worke.g., if paragraphs. we use Unfortunately, attention in many applications mechanism? that include machinetranslation, we onlyhavesentence-leveldatawherethesentencesareunordered. Toaddress (far less that, weof splitaeach representation sentence into two halves; we thenbottleneck) use one half to predict the other half. 3.5 LEARNING apparently, yes (for low-resource language pairs): Dong al. (2015) adopted an alternating training approach, where they optimize each task for a [Currey fixedet number al., of parameter 2017] updates (or mini-batches) before switching to the next task (which is a different language pair). In our setting, our tasks are more diverse and contain different amounts of training data. As a result, we allocate different numbers of parameter updates for each task, which are expressed with the mixing ratio values αi (for each task i). Each parameter update consists of training data from one task only. When switching between tasks, we select randomly a new task i with probability αi source. Les Dissonances a aparut pe scena muzicala în 2004... j αj reference Les Dissonances appeared on the music scene in 2004... Our convention is that the first task is the reference task with α1 = 1.0 and the number of training parameter updates forbaseline that task is prespecified Les Dissonville to be N. A typical appeared task i will onthen thebe music trained for scene αi N in 2004... α1 parameter updates. Such + copied conventionmakes LesitDissonances easier for us to fairly appeared compareon the the samemusic reference scene task in 2004... in a single-task setting which has also been trained for exactly N parameter updates. [Luong et al., 2016] analysis: BPE-based system gets better at copying unknown names: When sharing an encoder or a decoder, we share both the recurrent connections and the corresponding embeddings. R. Sennrich MT 2018 09 10 / 20 dual-learning game closed loop of two translation systems translate sentence from language A into language B and back loss functions: is sentence in language B natural? loss is negative log-probability under (static) LM is second translation similar to original? loss is standard cross-entropy, with original as reference use reinforcement learning to update weights we can also start with sentence in language B R. Sennrich MT 2018 09 11 / 20 4 EXPERIMENTS

Parameter Pre-Training MT 2018 09 [Ramachandran et al., 2017] core idea: pre-train encoder and decoder on language modelling task models are fine-tuned with translation objective, along with continued use of LM objective (with shared parameters) 1 Language Models in NMT W X Y Z <EOS> Softmax Second RNN Layer First RNN Layer 2 Training End-to-End NMT Model with Monolingual Data Embedding A B C <EOS> W X Y Z Figure 1: Pretrained sequence to sequence model. The red parameters are the encoder and the blue parameters are the decoder. All parameters in a shaded box are pretrained, either from the source side (light red) or target side (light blue) language model. Otherwise, they are randomly initialized. 3 "Unsupervised" MT from Monolingual Data 2 Methods In the following section, we will describe our basic unsupervised pretraining procedure for sequence to sequence learning and how to modify sequence to sequence learning to effectively make use of the pretrained weights. We then show several extensions to improve the basic model. Bilingual Lexicon Induction 2.1 Basic Procedure Given an input sequence x 1, x 2,..., x m and an output sequence y n, y n 1,..., y 1, the objective of sequence to sequence learning is to maximize the likelihood p(y n, y n 1,..., y 1 x 1, x 2,..., x m). Common sequence to sequence learning methods decompose this objective as p(y n, y n 1,..., y 1 x 1, x 2,..., x m) = n t=1 p(yt yt 1,..., y1; x1, x2,..., xm). In sequence to sequence learning, an RNN encoder is used to represent x 1,..., x m as a hidden vector, which is given to an RNN decoder to produce the output sequence. Our method is based on the observation that without the encoder, the decoder essentially acts like a language model on y s. Similarly, the encoder with an additional output layer also acts like a language model. Thus it is natural to use trained languages models to initialize the encoder and decoder. Therefore, the basic procedure of our approach is to pretrain both the seq2seq encoder and decoder networks with language models, which can be trained on large amounts of unlabeled text data. This can be seen in Figure 1, where the parameters in the shaded boxes are pretrained. In the following we will describe the method in detail using machine translation as an example application. First, two monolingual datasets are collected, one for the source side language, and one for the target side language. A language model (LM) is trained on each dataset independently, giving an LM trained on the source side corpus and an LM trained on the target side corpus. R. Sennrich MT 2018 09 12 / 20 After two language models are trained, a multilayer seq2seq model M is constructed. The embedding and first LSTM layers of the encoder and decoder are initialized with the pretrained weights. To be even more efficient, the softmax of the decoder is initialized with the softmax of the pretrained target side LM. learn lexical correspondences from monolingual data 2.2 Monolingual language modeling losses correspondences are based on various types of similarity: contextual similarity temporal similarity orthographic similarity frequency similarity After the seq2seq model M is initialized with the two LMs, it is fine-tuned with a labeled dataset. However, this procedure may lead to catastrophic forgetting, where the model s performance on the language modeling tasks falls dramatically after fine-tuning (Goodfellow et al., 2013). This may hamper the model s ability to generalize, especially when trained on small labeled datasets. To ensure that the model does not overfit the labeled data, we regularize the parameters that were pretrained by continuing to train with the monolingual language modeling losses. The seq2seq and language modeling losses are weighted equally. today we look at distributional word representations (contextual similarity) In our ablation study, we find that this technique is complementary to pretraining and is important in achieving high performance. R. Sennrich MT 2018 09 13 / 20 Embedding Space Similarities Across Languages 384 [Mikolov et al., 2013] R. Sennrich MT 2018 09 14 / 20 R. Sennrich MT 2018 09 15 / 20

Learning to Map Between Vector Spaces Learning to Map Between Vector Spaces supervised mapping [Mikolov et al., 2013] we can learn linear transformation between embedding spaces with small dictionary. given linear transformation matrix W, and two vector representations x i, y i in source and target language training objective (optimized with SGD): argmin W n W x i y i 2 i=1 training requires small seed lexicon of (x, y) pairs after mapping, induce bilingual lexicon via nearest neighbor search R. Sennrich MT 2018 09 16 / 20 Learning to Map Between Vector Spaces warning these are recent research results open questions remain under what conditions will this method succeed / fail? method was tested with typologically relatively similar languages method was tested with similar monolingual data (same domains and genres) R. Sennrich MT 2018 09 18 / 20 unsupervised mapping [Miceli Barone, 2016, Conneau et al., 2017] adversarial training: co-train classifier (adversary) that predicts whether embedding represents source or target language word objective of linear, orthogonal transformation: Published a conference paper at ICLR 2018 fool classifier by making embeddings as similar as possible Figure 1: Toy illustration of the method. (A) There are two distributions of word embeddings, English words in red denoted by X and Italian words in blue denoted by Y, which we want to align/translate. Each dot represents a word in that space. The size of the dot is proportional to the frequency of the words in the training corpus of that language. (B) Using adversarial learning, we learn a rotation matrix W which roughly aligns the [Conneau et al., 2017] two distributions. The green stars are randomly selected words that are fed to the discriminator to determine whether the two word embeddings come from the same distribution. (C) The mapping W is further refined via Procrustes. This method uses frequent words aligned by the previous step as anchor points, and minimizes an energy function that corresponds to a spring system between anchor points. The refined mapping is then used to map all words in the dictionary. (D) R. Finally, Sennrichwe translate MT by 2018 using 09 the mapping W and a distance metric, 17 / 20 dubbed CSLS, that expands the space where there is high density of points (like the area around the word cat ), so that hubs (like the word cat ) become less close to other word vectors than they would otherwise (compare to the same region in panel (A)). Improving Word Order In practice, Mikolov et al. (2013b) obtained better results on the word translation task using a simple linear mapping, and did not observe any improvement when using more advanced strategies like multilayer neural networks. Xing et al. (2015) showed that these results are improved by enforcing an orthogonality constraint on W. In that case, the equation (1) boils down to the Procrustes problem, which advantageously offers a closed form solution obtained from the singular value decomposition (SVD) of Y X T : [Lample et al., 2017] joint training of both translation directions use translation model to back-translate monolingual data W = argmin W X Y F = UV T, with UΣV T = SVD(Y X T ). (2) W Od(R) learn encoder-decoder to reconstruct original sentence from noisy In this paper, we show how to learn this mapping W without cross-lingual supervision; an illustration of the approach is given in Fig. 1. First, we learn an initial proxy of W by using an adversarial criterion. Then, we use the words that match the best as anchor points for Procrustes. Finally, we improve performance over less frequent words by changing the metric of the space, which leads to spread more of those points in dense regions. Next, we describe the details of each of these steps. translation iterate several times use various other tricks and objectives to improve learning 2.1 DOMAIN-ADVERSARIAL SETTING pre-trained embeddings In this section, we present our domain-adversarial approach for learning W without cross-lingual supervision. denoising Let X autencoder = {x 1,..., x n } and asy additional = {y 1,..., y m } objective be two sets of n and m word embeddings coming shared from aencoder source and a/ target decoder languageparameters respectively. A model in both is trained directions to discriminate between elements randomly sampled from W X = {W x 1,..., W x n } and Y. We call this model the discriminator. adversarial W is trained objective to prevent the discriminator from making accurate predictions. As a result, this is a two-player game, where the discriminator aims at maximizing its ability to identify the origin of an embedding, and W aims at preventing the discriminator from doing sobleu by making W X and Y as similar as possible. This approach is in line with the work of Ganin et al. (2016), who proposed to learn latent system representations invariant to the input domain, where in our en-fr case, a domain en-de is represented by a language (source or target). supervised 28.0 21.3 word-by-word [Conneau et al., 2017] 6.3 7.1 Discriminator ( objective We refer to the discriminator parameters as θ D. We consider the probability P θd source = 1 [Lample et z al., ) that 2017] a vector z is the mapping of a source 15.1 embedding 9.6 (as opposed to a target embedding) according to the discriminator. The discriminator loss can be written as: L D (θ D W ) = 1 n R. Sennrich MT 2018 09 ( ) log P θd source = 1 W 1 m 19 / 20 ( ) xi log P θd source = 0 yi. (3) n m i=1 i=1

Conclusion Bibliography I Bertoldi, N. and Federico, M. (2009). Domain adaptation for statistical machine translation with monolingual resources. In Proceedings of the Fourth Workshop on Statistical Machine Translation StatMT 09. Association for Computational Linguistics. there are various ways to learn from monolingual data combination with language model pre-training and parameter sharing creating synthetic training data methods are especially useful when: parallel data is sparse monolingual data is highly relevant (in-domain) hot research topic: learning to translate without parallel data Conneau, A., Lample, G., Ranzato, M., Denoyer, L., and Jégou, H. (2017). Word Translation Without Parallel Data. CoRR, abs/1710.04087. Currey, A., Miceli Barone, A. V., and Heafield, K. (2017). Copied Monolingual Data Improves Low-Resource Neural Machine Translation. In Proceedings of the Second Conference on Machine Translation, pages 148 156, Copenhagen, Denmark. Association for Computational Linguistics. Gülçehre, c., Firat, O., Xu, K., Cho, K., Barrault, L., Lin, H., Bougares, F., Schwenk, H., and Bengio, Y. (2015). On Using Monolingual Corpora in Neural Machine Translation. CoRR, abs/1503.03535. He, D., Xia, Y., Qin, T., Wang, L., Yu, N., Liu, T., and Ma, W.-Y. (2016). Dual Learning for Machine Translation. In Lee, D. D., Sugiyama, M., Luxburg, U. V., Guyon, I., and Garnett, R., editors, Advances in Neural Information Processing Systems 29, pages 820 828. Curran Associates, Inc. Lambert, P., Schwenk, H., Servan, C., and Abdul-Rauf, S. (2011). Investigations on Translation Model Adaptation Using Monolingual Data. In Proceedings of the Sixth Workshop on Statistical Machine Translation, pages 284 293, Edinburgh, Scotland. Association for Computational Linguistics. R. Sennrich MT 2018 09 20 / 20 R. Sennrich MT 2018 09 21 / 20 Bibliography II Lample, G., Denoyer, L., and Ranzato, M. (2017). Unsupervised Machine Translation Using Monolingual Corpora Only. CoRR, abs/1711.00043. Luong, M., Le, Q. V., Sutskever, I., Vinyals, O., and Kaiser, L. (2016). Multi-task Sequence to Sequence Learning. In ICLR 2016. Miceli Barone, A. V. (2016). Towards cross-lingual distributed representations without parallel text trained with adversarial autoencoders. In Proceedings of the 1st Workshop on Representation Learning for NLP, pages 121 126, Berlin, Germany. Association for Computational Linguistics. Mikolov, T., Le, Q. V., and Sutskever, I. (2013). Exploiting Similarities among Languages for Machine Translation. CoRR, abs/1309.4168. Ramachandran, P., Liu, P., and Le, Q. (2017). Unsupervised Pretraining for Sequence to Sequence Learning. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 383 391, Copenhagen, Denmark. Association for Computational Linguistics. Schwenk, H. (2008). Investigations on Large-Scale Lightly-Supervised Training for Statistical Machine Translation. In International Workshop on Spoken Language Translation, pages 182 189. R. Sennrich MT 2018 09 22 / 20