Under the hood of Neural Machine Translation. Vincent Vandeghinste

Similar documents
Python Machine Learning

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Learning Methods for Fuzzy Systems

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Artificial Neural Networks

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

arxiv: v1 [cs.cl] 2 Apr 2017

Artificial Neural Networks written examination

Knowledge Transfer in Deep Convolutional Neural Nets

INPE São José dos Campos

Speech Recognition at ICSI: Broadcast News and beyond

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

FUZZY EXPERT. Dr. Kasim M. Al-Aubidy. Philadelphia University. Computer Eng. Dept February 2002 University of Damascus-Syria

Linking Task: Identifying authors and book titles in verbose queries

Cross Language Information Retrieval

Deep Neural Network Language Models

Modeling function word errors in DNN-HMM based LVCSR systems

Seminar - Organic Computing

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Modeling function word errors in DNN-HMM based LVCSR systems

Constructing Parallel Corpus from Movie Subtitles

Residual Stacking of RNNs for Neural Machine Translation

Second Exam: Natural Language Parsing with Neural Networks

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

Speaker Identification by Comparison of Smart Methods. Abstract

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

(Sub)Gradient Descent

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Evolution of Symbolisation in Chimpanzees and Neural Nets

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data

Universiteit Leiden ICT in Business

Language Model and Grammar Extraction Variation in Machine Translation

Probabilistic Latent Semantic Analysis

Overview of the 3rd Workshop on Asian Translation

CS 598 Natural Language Processing

A study of speaker adaptation for DNN-based speech synthesis

A Reinforcement Learning Variant for Control Scheduling

Learning Methods in Multilingual Speech Recognition

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Parsing of part-of-speech tagged Assamese Texts

An empirical study of learning speed in backpropagation

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

A heuristic framework for pivot-based bilingual dictionary induction

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

An OO Framework for building Intelligence and Learning properties in Software Agents

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

A Case Study: News Classification Based on Term Frequency

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

English Language and Applied Linguistics. Module Descriptions 2017/18

Human Emotion Recognition From Speech

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

MYCIN. The MYCIN Task

Assignment 1: Predicting Amazon Review Ratings

Laboratorio di Intelligenza Artificiale e Robotica

University of Groningen. Systemen, planning, netwerken Bosman, Aart

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

Evolutive Neural Net Fuzzy Filtering: Basic Description

Calibration of Confidence Measures in Speech Recognition

Context Free Grammars. Many slides from Michael Collins

arxiv: v1 [cs.lg] 7 Apr 2015

Towards a Machine-Learning Architecture for Lexical Functional Grammar Parsing. Grzegorz Chrupa la

AQUA: An Ontology-Driven Question Answering System

Memory-based grammatical error correction

Noisy SMS Machine Translation in Low-Density Languages

CS Machine Learning

Cross-Lingual Text Categorization

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Attributed Social Network Embedding

Switchboard Language Model Improvement with Conversational Data from Gigaword

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Chapter 9 Banked gap-filling

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

THE world surrounding us involves multiple modalities

Radius STEM Readiness TM

Syntactic systematicity in sentence processing with a recurrent self-organizing network

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Knowledge-Based - Systems

Training and evaluation of POS taggers on the French MULTITAG corpus

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Axiom 2013 Team Description Paper

Cross-lingual Text Fragment Alignment using Divergence from Randomness

Test Effort Estimation Using Neural Network

The Smart/Empire TIPSTER IR System

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Learning to Schedule Straight-Line Code

Applications of memory-based natural language processing

Transcription:

Under the hood of Neural Machine Translation Vincent Vandeghinste

Recipe for (data-driven) machine translation Ingredients 1 (or more) Parallel corpus 1 (or more) Trainable MT engine + Decoder Statistical machine translation Neural machine translation Instructions: Pour the parallel corpus in the engine Let it simmer for a day (when using SMT) + add seasoning (optimization tuning) for a week (when using NMT)

Freely Available Parallel Corpora http://opus.nlpl.eu/

Statistical machine translation (SMT) www.statmt.org STEP 1: WORD ALIGNMENT

Statistical machine translation (SMT) www.statmt.org STEP 2: EXTRACT PHRASE TABLE

Statistical machine translation (SMT) www.statmt.org STEP 3: ESTIMATE LANGUAGE MODEL

Statistical machine translation (SMT) www.statmt.org STEP 4: OPTIMIZE PARAMETERS

Statistical machine translation (SMT) www.statmt.org STEP 5: TRANSLATE

Downsides of SMT Everything depends on the quality of Word Alignments: errors in word alignment are going into the system Separate training of different models translation model (phrase tables with probabilities) language model (n-grams) distortion model Everything happens in a local window max phrase length: 7 max n-gram length: 5 does not cover long distance phenomena subj-verb agreement in Dutch subordinate clauses

Neural machine translation (NMT) www.opennmt.net STEP 1: PREPROCESS

Neural machine translation (NMT) www.opennmt.net STEP 2: TRAIN

Neural machine translation (NMT) www.opennmt.net STEP 3: TRANSLATE

Neural Networks: The Brain Used for information processing and to model the world around us Large interconnected network of neurons Neuron collects inputs from other neurons using dendrites Neurons sum all the inputs and if result is greater than threshold, they fire The fired signal is sent to other neurons through the axon

Artificial Neural Networks: The Perceptron Neurons sum all the inputs and if result is greater than threshold, they fire dendrites axon Inputs are real numbers (positive or negative) Weights are real numbers Each of the inputs are individually weighted added together and passed into the activation function Example activation function: step function: output 1 if input > threshold, 0 otherwise x1=0.6 x2=1.0 w1= 0.5 w2= 0.8 x1*w1= 0.6 * 0.5 = 0.3 x2*w2= 1.0 * 0.8 = 0.8 + 1.1 > threshold= 1.0 FIRE

Training this is a bus this is not a bus People learn by examples (positive and negative)

Training Perceptrons The AND function Calculations Training Data x1 x2 output 0 0 0 Random Initialization of weights sum of weighted input activation ( t = 0.5) error 0 0 0 x1 0 1 0 1 0 0 w1=0.1 w2=0.2 0.2 0 0 0.1 0 0 1 1 1 0.3 0 1 minimize this error: adapt the weights x2

Training Perceptrons The AND function Calculations Training Data x1 x2 output Adapted Weights sum of weighted input activation ( t = 0.5) error x1 0 0 0 0 1 0 w1=0.2 w2=0.3 0 0 0 0.3 0 0 1 0 0 0.2 0 0 1 1 1 0.5 1 0 no more errors: we have learned x2

What is happening? The perceptron is putting all the training instances into two categories: those that fire (category 1) those that don t fire (category 2) It draws a line in a two-dimensional space points on one side fall into category 1 points on other side fall into category 2

What is happening? It is not always possible to draw a line Example: Exclusive OR (XOR) x1 x1 x2 output 0 0 0 0 1 1 1 0 1 1 1 0 x2

What do we need to learn this? A more complex architecture than the perceptron

Language Modeling used to predict the next word trained on large monolingual text In SMT, we represent a set of words as discontinuous units In neural models, we represent words as points in a continuous space (word embeddings: meaning representations of words as a list of numbers)

Language Modeling: n-grams

Neural Language Modeling dictionary: 246 elements one-hot vector: 246 dimensions word embedding: 124 dimensions dimensionality reduction!

Word Embeddings: Properties semantics of each dimension?

Word Embeddings: Properties Words with similar meaning are close to each other

Word Embeddings: Properties Can we do word arithmetic? king man + woman =?

Word Embeddings: Properties

Recurrent Neural Network

Neural Machine Translation (NMT)

NMT: Basic model

NMT Encoding: 1-Hot vector

NMT: Word Embedding

NMT: Hidden layer

NMT Summary Vector

NMT Decoding From a vector to a sequence of words 1. Compute hidden state of the decoder

NMT Decoding From a vector to a sequence of words 2. Next word probability

NMT Decoding From a vector to a sequence of words 3. Generating the next word

The Trouble with Simple Encoder-Decoder Architectures Input sequence is compressed as a fixed-size list of numbers (vector) Translation is generated from this vector This vector must contain every detail about the source sentence be large enough to compress sentences of any length Translation quality decreases as source sentence length increases (with small model)

The Trouble with Simple Encoder-Decoder Architectures

The Trouble with Simple Encoder-Decoder Architectures RNNs remember recent symbols better the further a symbol is, the less likely the RNNs hidden states remember it

Bi-directional representation Combine forward and backward hidden vector: represents the word in the entire sentence Set of these representations = variable-length representation of source sentence

How does the decoder know which part of the encoding is relevant at each step of the generation?

Attention Mechanism The y s are our translated words produced by the decoder, and the x s are our source sentence words. Each decoder output word y_t now depends on a weighted combination of all the input states, not just the last state. The a s are weights that define how much of each input should be considered for each output.

Attention Mechanism Sample translations made by the neural machine translation model with the soft-attention mechanism. Edge thicknesses represent the attention weights found by the attention model.

Advantages of NMT 1. End-to-end training All parameters are simultaneously optimized to minimize a loss function 2. Distributed representations share strength Better exploitation of word and phrase similarities 3. Better exploitation of context NMT can use a much bigger context both source and partial target text to translate more accurately

Why neural machine translation (NMT) 1. Results show that NMT produces automatic translations that are significantly preferred by humans to other machine translation outputs. 2. Similar methods (often called seq2seq) are also effective for many other NLP and language-related applications such as dialogue, image captioning, and summarization. 3. NMT has been used as a representative application of the recent success of deep learning-based artificial intelligence. source: opennmt.net

NMT compared to SMT (Koehn & Knowles 2017) 1. NMT systems have lower quality out of domain, to the point that they completely sacrifice adequacy for the sake of fluency.

NMT compared to SMT (Koehn & Knowles 2017) 2. NMT systems have a steeper learning curve with respect to the amount of training data, resulting in worse quality in lowresource settings, but better performance in high-resource settings.

NMT compared to SMT (Koehn & Knowles 2017) 3. NMT systems that operate at the sub-word level perform better than SMT systems on extremely low-frequency words, but still show weakness in translating low-frequency words belonging to highly-inflected categories (e.g. verbs).

NMT compared to SMT (Koehn & Knowles 2017) 4. NMT systems have lower translation quality on very long sentences, but do comparably better up to a sentence length of about 60 words.

NMT compared to SMT (Koehn & Knowles 2017) 5. The attention model for NMT does not always fulfill the role of a word alignment model, but may in fact dramatically diverge.

Conclusions NMT is better compared to SMT if you have the hardware if you have the time if you have the data NMT is work in progress: a hot research topic speeding up the learning larger vocabularies introducing linguistic information part-of-speech tags syntax trees intelligibility: understanding what is being represented work on low frequency words what with morphology?

Sources and references https://medium.com/technologymadeeasy/for-dummies-the-introduction-to-neural-networkswe-all-need-c50f6012d5eb https://www.xenonstack.com/blog/data-science/overview-of-artificial-neural-networks-and-itsapplications http://www.cs.stir.ac.uk/courses/itnp4b/lectures/kms/2-perceptrons.pdf http://blog.systransoft.com/how-does-neural-machine-translation-work/ https://sites.google.com/site/acl16nmt/home https://devblogs.nvidia.com/introduction-neural-machine-translation-with-gpus/ Koehn & Knowles (2017). Six challenges for Neural Machine Translation. https://arxiv.org/pdf/1706.03872.pdf