Machine Translation. 09: Monolingual Data. Rico Sennrich. University of Edinburgh. R. Sennrich MT / 20

Similar documents
Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Python Machine Learning

Lecture 1: Machine Learning Basics

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

arxiv: v1 [cs.cl] 2 Apr 2017

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

arxiv: v1 [cs.lg] 15 Jun 2015

The RWTH Aachen University English-German and German-English Machine Translation System for WMT 2017

Residual Stacking of RNNs for Neural Machine Translation

Deep Neural Network Language Models

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

arxiv: v2 [cs.cv] 30 Mar 2017

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Noisy SMS Machine Translation in Low-Density Languages

Cross Language Information Retrieval

Model Ensemble for Click Prediction in Bing Search Ads

(Sub)Gradient Descent

Learning Methods in Multilingual Speech Recognition

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

arxiv: v1 [cs.lg] 7 Apr 2015

THE world surrounding us involves multiple modalities

arxiv: v1 [cs.cv] 10 May 2017

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Domain Adaptation in Statistical Machine Translation of User-Forum Data using Component-Level Mixture Modelling

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Assignment 1: Predicting Amazon Review Ratings

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

The KIT-LIMSI Translation System for WMT 2014

Overview of the 3rd Workshop on Asian Translation

A study of speaker adaptation for DNN-based speech synthesis

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

Ask Me Anything: Dynamic Memory Networks for Natural Language Processing

Second Exam: Natural Language Parsing with Neural Networks

Calibration of Confidence Measures in Speech Recognition

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING

Learning Methods for Fuzzy Systems

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

A heuristic framework for pivot-based bilingual dictionary induction

arxiv: v2 [cs.ir] 22 Aug 2016

Attributed Social Network Embedding

Generative models and adversarial training

Dual-Memory Deep Learning Architectures for Lifelong Learning of Everyday Human Behaviors

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Speech Recognition at ICSI: Broadcast News and beyond

SARDNET: A Self-Organizing Feature Map for Sequences

Speech Emotion Recognition Using Support Vector Machine

Language Model and Grammar Extraction Variation in Machine Translation

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Глубокие рекуррентные нейронные сети для аспектно-ориентированного анализа тональности отзывов пользователей на различных языках

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Build on students informal understanding of sharing and proportionality to develop initial fraction concepts.

A Case Study: News Classification Based on Term Frequency

The NICT Translation System for IWSLT 2012

Evolutive Neural Net Fuzzy Filtering: Basic Description

What Can Neural Networks Teach us about Language? Graham Neubig a2-dlearn 11/18/2017

Modeling function word errors in DNN-HMM based LVCSR systems

EQuIP Review Feedback

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

Constructing Parallel Corpus from Movie Subtitles

Syntactic Patterns versus Word Alignment: Extracting Opinion Targets from Online Reviews

Cross-Lingual Dependency Parsing with Universal Dependencies and Predicted PoS Labels

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Clickthrough-Based Translation Models for Web Search: from Word Models to Phrase Models

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Lip Reading in Profile

Linking Task: Identifying authors and book titles in verbose queries

arxiv: v4 [cs.cl] 28 Mar 2016

Probabilistic Latent Semantic Analysis

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

arxiv: v2 [cs.cv] 3 Aug 2017

Knowledge Transfer in Deep Convolutional Neural Nets

Artificial Neural Networks written examination

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Using Proportions to Solve Percentage Problems I

DNN ACOUSTIC MODELING WITH MODULAR MULTI-LINGUAL FEATURE EXTRACTION NETWORKS

End-to-End SMT with Zero or Small Parallel Texts 1. Abstract

INPE São José dos Campos

Improvements to the Pruning Behavior of DNN Acoustic Models

WHEN THERE IS A mismatch between the acoustic

Georgetown University at TREC 2017 Dynamic Domain Track

Rhythm-typology revisited.

Evolution of Symbolisation in Chimpanzees and Neural Nets

A Comparison of Two Text Representations for Sentiment Analysis

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

Human Emotion Recognition From Speech

Finding Translations in Scanned Book Collections

Using dialogue context to improve parsing performance in dialogue systems

South Carolina English Language Arts

Modeling function word errors in DNN-HMM based LVCSR systems

Corpus Linguistics (L615)

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

CS Machine Learning

Forget catastrophic forgetting: AI that learns after deployment

UNIVERSITY OF CALIFORNIA SANTA CRUZ TOWARDS A UNIVERSAL PARAMETRIC PLAYER MODEL

Transcription:

Machine Translation 09: Monolingual Data Rico Sennrich University of Edinburgh R. Sennrich MT 2018 09 1 / 20

Refresher why monolingual data? language models are an important component in statistical machine translation monolingual data is far more abundant than parallel data phrase-based SMT models suffer from independence assumption; LMs can mitigate this monolingual data may better match target domain R. Sennrich MT 2018 09 1 / 20

MT 2018 09 1 Language Models in NMT 2 Training End-to-End NMT Model with Monolingual Data 3 "Unsupervised" MT from Monolingual Data R. Sennrich MT 2018 09 2 / 20

Language Models in NMT [Gülçehre et al., 2015] shallow fusion: rescore beam with language model ( ensembling) deep fusion: extra, LM-specific hidden layer (a) Shallow Fusion (Sec. 4.1) (b) Deep Fusion (Sec. 4.2) rare words. But instead of relying on an external Figure 1: Graphical illustrations De-En of the proposedcs-en fusion methods. alignment tool, we used the attention mech- Dev Test Dev Test anism of the NMT model learned to extract byalignments. the LM from monolingual NMT Baseline corpora 25.51 is coder 23.61use21.47 the signal 21.89 from the TM fully, while the This method consistently improved not overwritten. the results It is bypossible Shallow to usefusion monolingual 25.53 controller 23.69 21.95 controls 22.18 the magnitude of the LM signal. approximately 1.0 BLEU score. corpora as well while finetuning Deep Fusion all the parameters, 25.88 24.00 22.49 22.36 but in this paper, we alter only the output pa- In our experiments, we empirically found that it 7 Results and Analysis Table 4: rameters in the stage of finetuning. R. Results Sennrich for De-En and was MT Cs-En better 2018 translation to initialize 09 the bias bg to a small, neg- 3 / 20

MT 2018 09 1 Language Models in NMT 2 Training End-to-End NMT Model with Monolingual Data 3 "Unsupervised" MT from Monolingual Data R. Sennrich MT 2018 09 4 / 20

Monolingual Data in NMT NMT is a conditional language model p(u i ) = f(z i, u i 1, c i ) Problem for monolingual training instances, source context c i is missing R. Sennrich MT 2018 09 5 / 20

Monolingual Training Instances solutions: missing data imputation for c i missing data indicator: 0 works, but danger of catastrophic forgetting impute c i with neural network we do this indirectly by back-translating the target sentence R. Sennrich MT 2018 09 6 / 20

Evaluation: English German 30.0 BLEU 20.0 23.6 24.6 26.5 10.0 0.0 NMT parallel +missing data indicator +backtranslation R. Sennrich MT 2018 09 7 / 20

Back-Translation: Comparison to Phrase-based SMT back-translated parallel data back-translation has been proposed for phrase-based SMT [Schwenk, 2008, Bertoldi and Federico, 2009, Lambert et al., 2011] PBSMT already has LM main rationale: phrase-table domain adaptation rationale in NMT: train end-to-end model on monolingual data BLEU system WMT IWSLT (in-domain) (out-of-domain) PBSMT gain +0.7 +0.1 NMT gain +2.9 +1.2 Table: Gains on English German from adding back-translated News Crawl data. R. Sennrich MT 2018 09 8 / 20

Autoencoders general principle: train network that encodes input, and learns to reconstruct input from encoded representation unsupervised representation learning john likes his cat y1 y2 y3 y4 Decoder s1 s2 s3 s4 h1 h2 h3 h4 Encoder x1 x2 x3 x4 john likes his cat R. Sennrich MT 2018 09 9 / 20

Autoencoders in Neural Machine Translation autoencoders are used via multi-task learning: Published as a conference paper at ICLR 2016 shared models, multiple task-specific objectives German (translation) English (unsupervised) English German (unsupervised) Figure 4: Many-to-many setting multiple encoders, multiple decoders. We consider this scheme in a limited context of machine translation to utilize the large monolingual corpora in both the source and the target languages. Here, we consider a single translation task and two unsupervised autoencoder tasks. [Luong et al., 2016] does idea consist of ordered still sentences, worke.g., if paragraphs. we use Unfortunately, attention in many applications mechanism? that include machinetranslation, we onlyhavesentence-leveldatawherethesentencesareunordered. Toaddress (far less that, weof splitaeach representation sentence into two halves; we thenbottleneck) use one half to predict the other half. 3.5 LEARNING apparently, yes (for low-resource language pairs): Dong al. (2015) adopted an alternating training approach, where they optimize each task for a [Currey fixedet number al., of parameter 2017] updates (or mini-batches) before switching to the next task (which is a different language pair). In our setting, our tasks are more diverse and contain different amounts of training data. As a result, we allocate different numbers of parameter updates for each task, which are expressed with the mixing ratio values αi (for each task i). Each parameter update consists of training data from one task only. When switching between tasks, we select randomly a new task i with probability αi source. Les Dissonances a aparut pe scena muzicala în 2004... j αj reference Les Dissonances appeared on the music scene in 2004... Our convention is that the first task is the reference task with α1 = 1.0 and the number of training parameter updates forbaseline that task is prespecified Les Dissonville to be N. A typical appeared task i will onthen thebe music trained for scene αi N in 2004... α1 parameter updates. Such + copied conventionmakes LesitDissonances easier for us to fairly appeared compareon the the samemusic reference scene task in 2004... in a single-task setting which has also been trained for exactly N parameter updates. analysis: BPE-based system gets better at copying unknown names: When sharing an encoder or a decoder, we share both the recurrent connections and the corresponding embeddings. R. Sennrich MT 2018 09 10 / 20

Dual Learning [He et al., 2016] dual-learning game closed loop of two translation systems translate sentence from language A into language B and back loss functions: is sentence in language B natural? loss is negative log-probability under (static) LM is second translation similar to original? loss is standard cross-entropy, with original as reference use reinforcement learning to update weights we can also start with sentence in language B R. Sennrich MT 2018 09 11 / 20

Parameter Pre-Training [Ramachandran et al., 2017] core idea: pre-train encoder and decoder on language modelling task models are fine-tuned with translation objective, along with continued use of LM objective (with shared parameters) W X Y Z <EOS> Softmax Second RNN Layer First RNN Layer Embedding A B C <EOS> W X Y Z Figure 1: Pretrained sequence to sequence model. The red parameters are the encoder and the blue parameters are the decoder. All parameters in a shaded box are pretrained, either from the source side (light red) or target side (light blue) language model. Otherwise, they are randomly initialized. 2 Methods machine translation as an example application. In the following section, we will describe our basic First, two monolingual datasets are collected, unsupervised pretraining procedure for sequence one for the source side language, and one for the to sequence learning and how to modify R. Sennrich sequence target side language. A language model (LM) is MT 2018 09 12 / 20

MT 2018 09 1 Language Models in NMT 2 Training End-to-End NMT Model with Monolingual Data 3 "Unsupervised" MT from Monolingual Data R. Sennrich MT 2018 09 13 / 20

Bilingual Lexicon Induction learn lexical correspondences from monolingual data correspondences are based on various types of similarity: contextual similarity temporal similarity orthographic similarity frequency similarity today we look at distributional word representations (contextual similarity) R. Sennrich MT 2018 09 14 / 20

Embedding Space Similarities Across Languages [Mikolov et al., 2013] R. Sennrich MT 2018 09 15 / 20

Learning to Map Between Vector Spaces supervised mapping [Mikolov et al., 2013] we can learn linear transformation between embedding spaces with small dictionary. given linear transformation matrix W, and two vector representations x i, y i in source and target language training objective (optimized with SGD): argmin W n W x i y i 2 i=1 training requires small seed lexicon of (x, y) pairs after mapping, induce bilingual lexicon via nearest neighbor search R. Sennrich MT 2018 09 16 / 20

Learning to Map Between Vector Spaces unsupervised mapping [Miceli Barone, 2016, Conneau et al., 2017] adversarial training: co-train classifier (adversary) that predicts whether embedding represents source or target language word objective of linear, orthogonal transformation: Published a conference paper at ICLR 2018 fool classifier by making embeddings as similar as possible Figure 1: Toy illustration of the method. (A) There are two distributions of word embeddings, English words in red denoted by X and Italian words in blue denoted by Y, which we want to align/translate. Each dot represents a word in that space. The size of the dot is proportional to the frequency of the words in the training corpus of that language. (B) Using adversarial learning, we learn a rotation matrix W which roughly aligns the [Conneau et al., 2017] two distributions. The green stars are randomly selected words that are fed to the discriminator to determine whether the two word embeddings come from the same distribution. (C) The mapping W is further refined via Procrustes. This method uses frequent words aligned by the previous step as anchor points, and minimizes an energy function that corresponds to a spring system between anchor points. The refined mapping is then used to map all words in the dictionary. (D) R. Finally, Sennrichwe translate MT by 2018 using 09 the mapping W and a distance metric, 17 / 20

Learning to Map Between Vector Spaces warning these are recent research results open questions remain under what conditions will this method succeed / fail? method was tested with typologically relatively similar languages method was tested with similar monolingual data (same domains and genres) R. Sennrich MT 2018 09 18 / 20

Improving Word Order [Lample et al., 2017] joint training of both translation directions use translation model to back-translate monolingual data learn encoder-decoder to reconstruct original sentence from noisy translation iterate several times use various other tricks and objectives to improve learning pre-trained embeddings denoising autencoder as additional objective shared encoder / decoder parameters in both directions adversarial objective BLEU system en-fr en-de supervised 28.0 21.3 word-by-word [Conneau et al., 2017] 6.3 7.1 [Lample et al., 2017] 15.1 9.6 R. Sennrich MT 2018 09 19 / 20

Conclusion there are various ways to learn from monolingual data combination with language model pre-training and parameter sharing creating synthetic training data methods are especially useful when: parallel data is sparse monolingual data is highly relevant (in-domain) hot research topic: learning to translate without parallel data R. Sennrich MT 2018 09 20 / 20

Bibliography I Bertoldi, N. and Federico, M. (2009). Domain adaptation for statistical machine translation with monolingual resources. In Proceedings of the Fourth Workshop on Statistical Machine Translation StatMT 09. Association for Computational Linguistics. Conneau, A., Lample, G., Ranzato, M., Denoyer, L., and Jégou, H. (2017). Word Translation Without Parallel Data. CoRR, abs/1710.04087. Currey, A., Miceli Barone, A. V., and Heafield, K. (2017). Copied Monolingual Data Improves Low-Resource Neural Machine Translation. In Proceedings of the Second Conference on Machine Translation, pages 148 156, Copenhagen, Denmark. Association for Computational Linguistics. Gülçehre, c., Firat, O., Xu, K., Cho, K., Barrault, L., Lin, H., Bougares, F., Schwenk, H., and Bengio, Y. (2015). On Using Monolingual Corpora in Neural Machine Translation. CoRR, abs/1503.03535. He, D., Xia, Y., Qin, T., Wang, L., Yu, N., Liu, T., and Ma, W.-Y. (2016). Dual Learning for Machine Translation. In Lee, D. D., Sugiyama, M., Luxburg, U. V., Guyon, I., and Garnett, R., editors, Advances in Neural Information Processing Systems 29, pages 820 828. Curran Associates, Inc. Lambert, P., Schwenk, H., Servan, C., and Abdul-Rauf, S. (2011). Investigations on Translation Model Adaptation Using Monolingual Data. In Proceedings of the Sixth Workshop on Statistical Machine Translation, pages 284 293, Edinburgh, Scotland. Association for Computational Linguistics. R. Sennrich MT 2018 09 21 / 20

Bibliography II Lample, G., Denoyer, L., and Ranzato, M. (2017). Unsupervised Machine Translation Using Monolingual Corpora Only. CoRR, abs/1711.00043. Luong, M., Le, Q. V., Sutskever, I., Vinyals, O., and Kaiser, L. (2016). Multi-task Sequence to Sequence Learning. In ICLR 2016. Miceli Barone, A. V. (2016). Towards cross-lingual distributed representations without parallel text trained with adversarial autoencoders. In Proceedings of the 1st Workshop on Representation Learning for NLP, pages 121 126, Berlin, Germany. Association for Computational Linguistics. Mikolov, T., Le, Q. V., and Sutskever, I. (2013). Exploiting Similarities among Languages for Machine Translation. CoRR, abs/1309.4168. Ramachandran, P., Liu, P., and Le, Q. (2017). Unsupervised Pretraining for Sequence to Sequence Learning. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 383 391, Copenhagen, Denmark. Association for Computational Linguistics. Schwenk, H. (2008). Investigations on Large-Scale Lightly-Supervised Training for Statistical Machine Translation. In International Workshop on Spoken Language Translation, pages 182 189. R. Sennrich MT 2018 09 22 / 20