Institute of Computational Linguistics Contrastive Evaluation of Larger-context Neural Machine Translation Kolloquium Talk 2018 Mathias Müller 4/10/18 KOLLO, Mathias Müller
Larger-context neural machine translation 4/10/18 KOLLO, Mathias Müller Page 2
Why larger context? Source However, the European Central Bank (ECB) took an interest in it in a report on virtual currencies published in October. It describes bitcoin as "the most successful virtual currency, [ ]. Target Dennoch hat die Europäische Zentralbank (EZB) in einem im Oktober veröffentlichten Bericht über virtuelle Währungen Interesse hierfür gezeigt. Sie beschreibt Bitcoin als "die virtuelle Währung mit dem größten Erfolg [ ]. (example taken from newstest2013.{de,en}) 4/10/18 KOLLO, Mathias Müller Page 3
Why larger context? 4/10/18 KOLLO, Mathias Müller Page 4
Why larger context? Source It describes bitcoin as "the most successful virtual currency. Target Es beschreibt den Bitcoin als "die erfolgreichste virtuelle Währung". 4/10/18 KOLLO, Mathias Müller Page 5
How to incorporate larger context? Open question, preliminary works: gated auxiliary context or warm start decoder initialization with a document summary (Wang et al., 2017) additional encoder and attention network for previous source sentence (Jean et al., 2017) Concatenate previous source sentence, mark with a prefix (Tiedemann and Scherrer, 2017) both source and target context (Miculicich Werlen et al., submitted) hierarchical attention, among other solutions (Bawden et al., submitted) 4/10/18 KOLLO, Mathias Müller Page 6
Additional Encoder and attention network on top of Nematus (Sennrich et al., 2017) which follows standard practice: an encoder-decoder framework with attention (Bahdanau et al., 2014) Encoder and Decoder are gated recurrent units (GRUs), a variant of RNNs Decoder is a GRU conditioned on source sentence, the source sentence context in turn is generated by the encoder, and modulated by attention we also condition on preceding sentences, with additional encoders and separate attention networks 4/10/18 KOLLO, Mathias Müller Page 7
Additional Encoder and attention network on top of Nematus (Sennrich et al., 2017) which follows standard practice: an encoder-decoder framework with attention (Bahdanau et al., 2014) Encoder and Decoder are gated recurrent units (GRUs), a variant of RNNs Decoder is a GRU conditioned on source sentence, the source sentence context in turn is generated by the encoder, and modulated by attention we also condition on preceding sentences, with additional encoders and separate attention networks 4/10/18 KOLLO, Mathias Müller Page 7
Recurrent neural networks refresher 4/10/18 KOLLO, Mathias Müller Page 8
RNN variant: gated recurrent unit (GRU) Figure taken from Chung et al. (2014) 4/10/18 KOLLO, Mathias Müller Page 9
Conditional gated recurrent unit (cgru) Detailed formulas: https://github.com/nyu-dl/dl4mt-tutorial/blob/master/docs/cgru.pdf 4/10/18 KOLLO, Mathias Müller Page 10
Extension of cgru for n contexts Detailed formulas: https://github.com/bricksdont/ncgru/blob/master/ct.pdf 4/10/18 KOLLO, Mathias Müller Page 11
How to incorporate larger context? Additional encoder and attention networks for previous context (Jean et al., 2017) in Nematus Technically: an extension of deep transition (Pascanu et al., 2013) with additional GRU steps that attend to contexts other than the current source sentence Intuitively: while generating the next word, the decoder has access to previous source or target sentence Multiple encoders share most of the parameters because embedding matrices are tied (Press and Wolf, 2016) 4/10/18 KOLLO, Mathias Müller Page 12
Actual systems we have trained Nematus systems with standard parameters, similar to Edinburgh s WMT 17 submissions English to German (why?) Training data from WMT 17 1) Baseline system without additional context 2) + source context: 1 previous source sentence if any 3) + target context: 1 previous target sentence if any 4/10/18 KOLLO, Mathias Müller Page 13
How to evaluate larger-context systems? Need: evaluation that focuses on specific linguistic phenomena Challenge Set for contrastive evaluation Source Despite the fact that it is a part of China, Hong Kong determines its currency policy separately. Target Hongkong bestimmt, obwohl es zu China gehört, seine Währungspolitik selbst. Contrastive Hongkong bestimmt, obwohl er zu China gehört, seine Währungspolitik selbst. (example taken from newstest2009) 4/10/18 KOLLO, Mathias Müller Page 14
How to evaluate larger-context systems? Previous work with manually constructed sets: Guillou and Hardmeier (2016); Isabelle et al. (2017); Bawden et al., (submitted) Larger-scale automatic sets: Sennrich (2017); Rios et al. (2017); Burlot and Yvon (2017); ours 4/10/18 KOLLO, Mathias Müller Page 15
Our test set of contrastive examples Sources: WMT, CS Corpus, OpenSubtitles Good candidates extracted automatically after linguistic processing (parsing, coreference resolution) focused on personal pronouns Roughly 600k examples 4/10/18 KOLLO, Mathias Müller Page 16
4/10/18 KOLLO, Mathias Müller Page 17
Results: BLEU System newstest2015 (dev) newstest2017 (test) Baseline 24.80 23.02 C10 22.68 21.47 C11 24.48 22.38 4/10/18 KOLLO, Mathias Müller Page 18
Contrastive scores where EN pronoun is it Baseline C10 C11 Overall performance 0.44 0.47 0.64 Baseline C10 C11 it : er 0.18 0.27 0.50 it : es 0.84 0.76 0.83 it : sie 0.3 0.39 0.62 4/10/18 KOLLO, Mathias Müller Page 19
Contrastive scores where EN pronoun is it Baseline C10 C11 intrasegmental 0.61 0.60 0.67 extrasegmental 0.41 0.45 0.64 ê distance ê Baseline C10 C11 0 0.61 0.60 0.67 1 0.36 0.43 0.64 2 0.46 0.43 0.58 3 0.53 0.53 0.66 3+ 0.67 0.56 0.76 4/10/18 KOLLO, Mathias Müller Page 20
Current activities Last steps for the contrastive evaluation experiments: Publish our resource and work at WMT 18 Ongoing work: inductive biases of fully convolutional (Gehring et al., 2017) or self-attention ( transformer ) models (Vaswani et al., 2017); collaboration with Edinburgh Low-resource experiments with Rumansh: pretraining transformer models with self-attentional language models (adaptation of Ramachandran et al., 2017) 4/10/18 KOLLO, Mathias Müller Page 22
Thanks! Code currently here: https://gitlab.cl.uzh.ch/mt/nematus-context2 4/10/18 KOLLO, Mathias Müller Page 23
Bibliography Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to align and translate." arxiv preprint arxiv:1409.0473 (2014). Bawden, Rachel, et al. Evaluating Discourse Phenomena in Neural Machine Translation. (Submitted to NAACL 2018) Burlot, Franck, and François Yvon. "Evaluating the morphological competence of Machine Translation Systems." Proceedings of the Second Conference on Machine Translation. 2017. Chung, Junyoung, et al. "Empirical evaluation of gated recurrent neural networks on sequence modeling." arxiv preprint arxiv:1412.3555 (2014). Gehring, Jonas, et al. "Convolutional sequence to sequence learning." arxiv preprint arxiv:1705.03122 (2017). Guillou, Liane, and Christian Hardmeier. "PROTEST: A Test Suite for Evaluating Pronouns in Machine Translation." LREC. 2016. Isabelle, Pierre, Colin Cherry, and George Foster. "A Challenge Set Approach to Evaluating Machine Translation." Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017. 4/10/18 KOLLO, Mathias Müller Page 24
Bibliography Jean, Sebastien, et al. "Does Neural Machine Translation Benefit from Larger Context?." arxiv preprint arxiv:1704.05135 (2017). Miculicich Werlen, Lesly, et al. Self-Attentive Residual Decoder for Neural Machine Translation. (Submitted to NAACL 2018) Pascanu, Razvan, et al. "How to construct deep recurrent neural networks." In Proceedings of the Second International Conference on Learning Representations (ICLR 2014) Press, Ofir, and Lior Wolf. "Using the Output Embedding to Improve Language Models." Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. Vol. 2. 2017. Ramachandran, Prajit, Peter Liu, and Quoc Le. "Unsupervised Pretraining for Sequence to Sequence Learning." Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017. Rikters, Matīss, Mark Fishel, and Ondřej Bojar. "Visualizing neural machine translation attention and confidence." The Prague Bulletin of Mathematical Linguistics 109.1 (2017): 39-50. 4/10/18 KOLLO, Mathias Müller Page 25
Bibliography Rios Gonzales, Annette, Laura Mascarell, and Rico Sennrich. "Improving Word Sense Disambiguation in Neural Machine Translation with Sense Embeddings." Proceedings of the Second Conference on Machine Translation. 2017. Sennrich, Rico, et al. "Nematus: a Toolkit for Neural Machine Translation." Proceedings of the Software Demonstrations of the 15th Conference of the European Chapter of the Association for Computational Linguistics. 2017. Sennrich, Rico. "How Grammatical is Character-level Neural Machine Translation? Assessing MT Quality with Contrastive Translation Pairs." Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. Vol. 2. 2017. Tiedemann, Jörg, and Yves Scherrer. "Neural Machine Translation with Extended Context." Proceedings of the Third Workshop on Discourse in Machine Translation. 2017. Vaswani, Ashish, et al. "Attention is all you need." Advances in Neural Information Processing Systems. 2017. Wang, Longyue, et al. "Exploiting Cross-Sentence Context for Neural Machine Translation." Proceedings of EMNLP. 2017. 4/10/18 KOLLO, Mathias Müller Page 26
Appendix: Notions of depth in RNN networks generally three types of depth (Pascanu et al., 2013): stacked layers deep transition deep output (each layer individually recurrent) (units not individually recurrent) (units not individually recurrent) in Nematus, the decoder is implemented as a cgru with deep transition and deep output crucially: attention over source sentence vectors C is a deep transition step 4/10/18 KOLLO, Mathias Müller Page 27