Unsupervised Machine Translation

Similar documents
arxiv: v1 [cs.cl] 2 Apr 2017

Cross Language Information Retrieval

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

arxiv: v2 [cs.cv] 30 Mar 2017

Probabilistic Latent Semantic Analysis

THE world surrounding us involves multiple modalities

Finding Translations in Scanned Book Collections

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Python Machine Learning

Learning Methods in Multilingual Speech Recognition

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

Using Web Searches on Important Words to Create Background Sets for LSI Classification

A heuristic framework for pivot-based bilingual dictionary induction

Unsupervised Cross-Lingual Scaling of Political Texts

Word Segmentation of Off-line Handwritten Documents

Lecture 1: Machine Learning Basics

Generative models and adversarial training

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

SARDNET: A Self-Organizing Feature Map for Sequences

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data

Detecting English-French Cognates Using Orthographic Edit Distance

Clickthrough-Based Translation Models for Web Search: from Word Models to Phrase Models

Multilingual Sentiment and Subjectivity Analysis

arxiv: v1 [cs.lg] 15 Jun 2015

Learning Methods for Fuzzy Systems

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

Les cartes au poisson

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

TRANSFER LEARNING IN MIR: SHARING LEARNED LATENT REPRESENTATIONS FOR MUSIC AUDIO CLASSIFICATION AND SIMILARITY

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Visual CP Representation of Knowledge

DNN ACOUSTIC MODELING WITH MODULAR MULTI-LINGUAL FEATURE EXTRACTION NETWORKS

Noisy SMS Machine Translation in Low-Density Languages

Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Applications of memory-based natural language processing

Books Effective Literacy Y5-8 Learning Through Talk Y4-8 Switch onto Spelling Spelling Under Scrutiny

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING

arxiv: v1 [cs.cl] 20 Jul 2015

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

arxiv: v2 [cs.ir] 22 Aug 2016

Constructing Parallel Corpus from Movie Subtitles

Cross-lingual Text Fragment Alignment using Divergence from Randomness

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Speech Recognition at ICSI: Broadcast News and beyond

The Strong Minimalist Thesis and Bounded Optimality

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

LIM-LIG at SemEval-2017 Task1: Enhancing the Semantic Similarity for Arabic Sentences with Vectors Weighting

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Artificial Neural Networks written examination

BYLINE [Heng Ji, Computer Science Department, New York University,

Multilingual Document Clustering: an Heuristic Approach Based on Cognate Named Entities

Lessons from a Massive Open Online Course (MOOC) on Natural Language Processing for Digital Humanities

arxiv: v1 [cs.cv] 10 May 2017

Deep Neural Network Language Models

LEARNING A SEMANTIC PARSER FROM SPOKEN UTTERANCES. Judith Gaspers and Philipp Cimiano

Accuracy (%) # features

Laboratorio di Intelligenza Artificiale e Robotica

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

Laboratorio di Intelligenza Artificiale e Robotica

CSL465/603 - Machine Learning

Cross-Lingual Dependency Parsing with Universal Dependencies and Predicted PoS Labels

Calibration of Confidence Measures in Speech Recognition

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

When Student Confidence Clicks

A DISTRIBUTIONAL STRUCTURED SEMANTIC SPACE FOR QUERYING RDF GRAPH DATA

WHEN THERE IS A mismatch between the acoustic

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Model Ensemble for Click Prediction in Bing Search Ads

Evolutive Neural Net Fuzzy Filtering: Basic Description

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

Domain Adaptation in Statistical Machine Translation of User-Forum Data using Component-Level Mixture Modelling

On-the-Fly Customization of Automated Essay Scoring

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Combining Bidirectional Translation and Synonymy for Cross-Language Information Retrieval

Data Integration through Clustering and Finding Statistical Relations - Validation of Approach

Linking Task: Identifying authors and book titles in verbose queries

Modeling full form lexica for Arabic

Word Sense Disambiguation

Summarizing Answers in Non-Factoid Community Question-Answering

Syntactic Patterns versus Word Alignment: Extracting Opinion Targets from Online Reviews

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Assignment 1: Predicting Amazon Review Ratings

Ontological spine, localization and multilingual access

A survey of multi-view machine learning

Combining a Chinese Thesaurus with a Chinese Dictionary

Transfer Learning Action Models by Measuring the Similarity of Different Domains

A Reinforcement Learning Variant for Control Scheduling

FUZZY EXPERT. Dr. Kasim M. Al-Aubidy. Philadelphia University. Computer Eng. Dept February 2002 University of Damascus-Syria

Transcription:

Unsupervised Machine Translation Alexis Conneau 3rd year PhD student Facebook AI Research, Université Le Mans Joint work with Guillaume Lample, Marc'Aurelio Ranzato, Ludovic Denoyer, Hervé Jégou

Motivation Neural machine translation works well for language pairs with a lot of parallel data (English-French, English-German, etc.) Performance drops when parallel data is scarce Vietnamese, Norwegian, Basque, Ukrainian, Serbian The creation of parallel data is difficult, and costly 2

Motivation Neural machine translation works well for language pairs with a lot of parallel data (English-French, English-German, etc.) Performance drops when parallel data is scarce Vietnamese, Norwegian, Basque, Ukrainian, Serbian The creation of parallel data is difficult, and costly Most language pairs use English as a pivot However, monolingual data is much easier to find 3

Questions Can we use monolingual data to improve a MT system? Can we reduce the amount of supervision? 4

Questions Can we use monolingual data to improve a MT system? Can we reduce the amount of supervision? Can we even learn WITHOUT ANY supervision? 5

Prior work Semi-supervised Back-translation (Sennrich et al., 2015) 6

Prior work Semi-supervised Back-translation (Sennrich et al., 2015) Small parallel dataset Huge monolingual corpus in the target language English // French // French (mono) 7

Prior work Semi-supervised Back-translation (Sennrich et al., 2015) Small parallel dataset Huge monolingual corpus in the target language Train a (target source) model M t2s English // French // French (mono) 8

Prior work Semi-supervised Back-translation (Sennrich et al., 2015) Small parallel dataset Huge monolingual corpus in the target language Train a (target source) model M t2s Use M t2s to translate the target monolingual corpus English // English (noisy) French // French (mono) 9

Prior work Semi-supervised Back-translation (Sennrich et al., 2015) Small parallel dataset Huge monolingual corpus in the target language Train a (target source) model M t2s Use M t2s to translate the target monolingual corpus Use the two parallel datasets to train M s2t English // English (noisy) French // French (mono) 10

Prior work Semi-supervised Back-translation (Sennrich et al., 2015) Dual learning (He et al., 2016) (source target source) M t2s (M s2t (x s )) = x s (target source target) M s2t (M t2s (x t )) = x t 11

Prior work Semi-supervised Back-translation (Sennrich et al., 2015) Dual learning (He et al., 2016) Pivot-based Related language pairs (Firat et al., 2016; Johnson et al., 2016) Images (Nakayama & Nishida (2017), Lee et al. (2017)) 12

Prior work Semi-supervised Back-translation (Sennrich et al., 2015) Dual learning (He et al., 2016) Pivot-based Related language pairs (Firat et al., 2016; Johnson et al., 2016) Images (Nakayama & Nishida (2017), Lee et al. (2017)) Fully unsupervised Ravi & Knight, 2011 13

Our approach Start with unsupervised word translation Easier task to start with Already insights of why it could work Can be used as a first step towards unsupervised sentence translation 14

Weakly-supervised word translation Exploiting similarities among languages for machine translation (Mikolov et al., 2013)

Weakly-supervised word translation Exploiting similarities among languages for machine translation (Mikolov et al., 2013) Start from two pre-trained monolingual spaces (word2vec) Totally unsupervised Widely used Strong systems for monolingual embeddings Semantically and syntactically relevant Not task-specific, useful across domains

Weakly-supervised word translation Exploiting similarities among languages for machine translation (Mikolov et al., 2013) Start from two pre-trained monolingual spaces (word2vec) Project the source space onto the target space using a small dictionary + =

Weakly-supervised word translation Exploiting similarities among languages for machine translation (Mikolov et al., 2013) Start from two pre-trained monolingual spaces (word2vec) Project the source space onto the target space using a small dictionary + = Feed-forward network does not improve over linear mapping (Mikolov et al., 2013) Orthogonal projection works best Xing et al. (2015), Smith et al. (2017)

Weakly-supervised word translation Linear projection Mikolov et al. (2013) 19

Weakly-supervised word translation Linear projection Mikolov et al. (2013) Orthogonal projection Xing et al. (2015), Smith et al. (2017) Procrustes 20

Weakly-supervised word translation Linear projection Mikolov et al. (2013) Orthogonal projection Xing et al. (2015), Smith et al. (2017) Procrustes Given a source word s, define the translation as: (nearest neighbor according to the cosine distance) 21

Unsupervised word translation Can we find the mapping W in an unsupervised way? + =

Adversarial training If WX and Y are perfectly aligned, these spaces should be undistinguishable 23

Adversarial training If WX and Y are perfectly aligned, these spaces should be undistinguishable Train a discriminator D to discriminate elements from WX and Y Discriminator training 24

Adversarial training If WX and Y are perfectly aligned, these spaces should be undistinguishable Train a discriminator D to discriminate elements from WX and Y Train W to unable the discriminator from making accurate predictions Discriminator training Mapping training 25

Orthogonality constraint Isometric mapping Preserve dot-product Preserve monolingual quality embeddings Training more robust (no mapping collapse) After each training update, project the mapping to the orthogonal manifold: W (1 + )W (WW T )W W W 2 r W (kw T W Idk 2 F ) Cisse et al. (ICML 2017) 26

Results on word translation Adversarial 27

Results on word translation Adversarial 90 Procrustes Adversarial 80 70 60 77.4 77.3 74.9 76.1 69.8 71.3 70.4 61.9 58.2 50 40 30 20 47 29.1 41.5 40.6 18.5 30.2 22.3 10 en-es es-en en-fr fr-en en-ru ru-en en-zh zh-en Word translation retrieval P@1 Adversarial 1.5k source queries, 200k target keys (vocabulary of 200k words for all languages) 28

Unsupervised word translation Summary Given independent monolingual datasets in a source and a target language: We can create high-quality cross-lingual dictionaries We can create high-quality cross-lingual embeddings 29

Unsupervised sentence translation Could we apply the same unsupervised training procedure to sentences? 30

Unsupervised sentence translation Could we apply the same unsupervised training procedure to sentences? Number of points grows exponentially with sentence length 31

Unsupervised sentence translation Could we apply the same unsupervised training procedure to sentences? Number of points grows exponentially with sentence length No similar embeddingstructures across languages 32

Unsupervised sentence translation Could we apply the same unsupervised training procedure to sentences? Number of points grows exponentially with sentence length No similar embeddingstructures across languages Direct application does not work (even in a supervised setting) 33

Proposed architecture Denoising Auto-Encoding Source encoder Source decoder Train a source source denoising autoencoder (DAE) 34

Proposed architecture Denoising Auto-Encoding C noise Source encoder Source decoder Train a source source denoising autoencoder (DAE) Critical to add noise to avoid trivial reconstructions 35

Proposed architecture Denoising Auto-Encoding C noise Source encoder Source decoder Train a source source denoising autoencoder (DAE) Critical to add noise to avoid trivial reconstructions Two sources of noise Word dropout: each word is removed with a probability p (usually 0.1) Ref: Arizona was the first to introduce such a requirement. Arizona was the first to such a requirement. Arizona was first to introduce such a requirement. 36

Proposed architecture Denoising Auto-Encoding C noise Source encoder Source decoder Train a source source denoising autoencoder (DAE) Critical to add noise to avoid trivial reconstructions Two sources of noise Word dropout: each word is removed with a probability p (usually 0.1) Word shuffle: word order is (slightly) shuffled inside sentences Ref: Arizona was the first to introduce such a requirement. Arizona the first was to introduce a requirement such. Arizona was the to introduce first such requirement a. 37

Proposed architecture Denoising Auto-Encoding C noise Source encoder Source decoder C noise Target encoder Target decoder Train a source source denoising autoencoder (DAE) Train a target target denoising autoencoder (DAE) 38

Proposed architecture Denoising Auto-Encoding C noise Source encoder Source decoder Discriminator C noise Target encoder Target decoder Train a source source denoising autoencoder (DAE) Train a target target denoising autoencoder (DAE) Make source and target latent states indistinguishable using adversarial training 39

Proposed architecture Denoising Auto-Encoding C noise Source encoder Source decoder Discriminator C noise Target encoder Target decoder Train a source source denoising autoencoder (DAE) Train a target target denoising autoencoder (DAE) Make source and target latent states indistinguishable using adversarial training We want decoders to operate in the same space share parameters between encoders 40

Proposed architecture Denoising Auto-Encoding C noise Source encoder Source decoder Discriminator C noise Target encoder Target decoder Works on simple / small datasets, with short sentences or small vocabulary 41

Proposed architecture Denoising Auto-Encoding C noise Source encoder Source decoder Discriminator C noise Target encoder Target decoder Works on simple / small datasets, with short sentences or small vocabulary Problem: at test time we want (source target) or (target source) 42

Proposed architecture Denoising Auto-Encoding C noise Source encoder Source decoder Discriminator C noise Target encoder Target decoder Works on simple / small datasets, with short sentences or small vocabulary Problem: at test time we want (source target) or (target source) Cross-Domain training: train the model to perform actual translations We do not have parallel data generate artificial translations for training 43

Proposed architecture Cross-Domain Training M model at previous iter Source encoder Target decoder Train on pairs generated using a stale version of the model 44

Proposed architecture Cross-Domain Training M model at previous iter Source encoder Target decoder Train on pairs generated using a stale version of the model Start with word-by-word translation une photo d une rue bondée en ville. (sentence from monolingual corpus) a photo of a street crowded in a city. a view of a crowded city street. (word-by-word translation) (gold translation) 45

Proposed architecture Cross-Domain Training M model at previous iter Source encoder Target decoder M model at previous iter Target encoder Source decoder Train on pairs generated using a stale version of the model Start with word-by-word translation Symmetric training 46

Recap Denoising autoencoding to learn good sentence representations 47

Recap Denoising autoencoding to learn good sentence representations Match distributions of latent features across the two domains Adversarial training Parameter sharing 48

Recap Denoising autoencoding to learn good sentence representations Match distributions of latent features across the two domains Adversarial training Parameter sharing Cross-lingual training to learn to translate Trick: use stale version of the model to produce a noisy source Use a word-by-word translation model to initialize the algorithm 49

Recap Denoising autoencoding to learn good sentence representations Match distributions of latent features across the two domains Adversarial training Parameter sharing Cross-lingual training to learn to translate Trick: use stale version of the model to produce a noisy source Use a word-by-word translation model to initialize the algorithm Pretrain word embeddings with aligned embeddings 50

Examples of unsupervised translations Source une femme aux cheveux roses habillée en noir parle à un homme. Iteration 0 Iteration 1 Iteration 2 Iteration 3 Reference a woman with pink hair dressed in black talks to a man. 51

Examples of unsupervised translations Source une femme aux cheveux roses habillée en noir parle à un homme. Iteration 0 a woman at hair roses dressed in black speaks to a man. Iteration 1 Iteration 2 Iteration 3 Reference a woman with pink hair dressed in black talks to a man. 52

Examples of unsupervised translations Source une femme aux cheveux roses habillée en noir parle à un homme. Iteration 0 a woman at hair roses dressed in black speaks to a man. Iteration 1 a woman at glasses dressed in black talking to a man. Iteration 2 Iteration 3 Reference a woman with pink hair dressed in black talks to a man. 53

Examples of unsupervised translations Source une femme aux cheveux roses habillée en noir parle à un homme. Iteration 0 a woman at hair roses dressed in black speaks to a man. Iteration 1 a woman at glasses dressed in black talking to a man. Iteration 2 a woman at pink hair dressed in black speaks to a man. Iteration 3 Reference a woman with pink hair dressed in black talks to a man. 54

Examples of unsupervised translations Source une femme aux cheveux roses habillée en noir parle à un homme. Iteration 0 a woman at hair roses dressed in black speaks to a man. Iteration 1 a woman at glasses dressed in black talking to a man. Iteration 2 a woman at pink hair dressed in black speaks to a man. Iteration 3 a woman with pink hair dressed in black is talking to a man. Reference a woman with pink hair dressed in black talks to a man. log(bleu) = min(1 r N c, 0) + X 1 N log p n n=1 c: length of the candidate translation r: average length of a reference over the corpus p_n: number_shared_ngrams(candidate, reference) / length(candidates) 55

Thank you Word translation without parallel data Alexis Conneau *, Guillaume Lample *, Marc'Aurelio Ranzato, Ludovic Denoyer, Hervé Jégou (ICLR 2018) Code: https://github.com/facebookresearch/muse Unsupervised Machine Translation Using Monolingual Corpora Only Guillaume Lample, Alexis Conneau, Ludovic Denoyer, Marc'Aurelio Ranzato (ICLR 2018) 56