Under the hood of Neural Machine Translation. Vincent Vandeghinste

Size: px

Start display at page:

Download "Under the hood of Neural Machine Translation. Vincent Vandeghinste"

Erik Briggs
6 years ago
Views:

1 Under the hood of Neural Machine Translation Vincent Vandeghinste

Recipe for (data-driven) machine translation Ingredients 1 (or more) Parallel corpus 1 (or more) Trainable MT engine + Decoder Statistical machine translation Neural

2 Recipe for (data-driven) machine translation Ingredients 1 (or more) Parallel corpus 1 (or more) Trainable MT engine + Decoder Statistical machine translation Neural machine translation Instructions: Pour the parallel corpus in the engine Let it simmer for a day (when using SMT) + add seasoning (optimization tuning) for a week (when using NMT)

3 Freely Available Parallel Corpora

4 Statistical machine translation (SMT) STEP 1: WORD ALIGNMENT

5 Statistical machine translation (SMT) STEP 2: EXTRACT PHRASE TABLE

6 Statistical machine translation (SMT) STEP 3: ESTIMATE LANGUAGE MODEL

7 Statistical machine translation (SMT) STEP 4: OPTIMIZE PARAMETERS

8 Statistical machine translation (SMT) STEP 5: TRANSLATE

Downsides of SMT Everything depends on the quality of Word Alignments: errors in word alignment are going into the system Separate training of different models translation model (phrase tables with

9 Downsides of SMT Everything depends on the quality of Word Alignments: errors in word alignment are going into the system Separate training of different models translation model (phrase tables with probabilities) language model (n-grams) distortion model Everything happens in a local window max phrase length: 7 max n-gram length: 5 does not cover long distance phenomena subj-verb agreement in Dutch subordinate clauses

10 Neural machine translation (NMT) STEP 1: PREPROCESS

11 Neural machine translation (NMT) STEP 2: TRAIN

12 Neural machine translation (NMT) STEP 3: TRANSLATE

13 Neural Networks: The Brain Used for information processing and to model the world around us Large interconnected network of neurons Neuron collects inputs from other neurons using dendrites Neurons sum all the inputs and if result is greater than threshold, they fire The fired signal is sent to other neurons through the axon

14 Artificial Neural Networks: The Perceptron Neurons sum all the inputs and if result is greater than threshold, they fire dendrites axon Inputs are real numbers (positive or negative) Weights are real numbers Each of the inputs are individually weighted added together and passed into the activation function Example activation function: step function: output 1 if input > threshold, 0 otherwise x1=0.6 x2=1.0 w1= 0.5 w2= 0.8 x1*w1= 0.6 * 0.5 = 0.3 x2*w2= 1.0 * 0.8 = > threshold= 1.0 FIRE

15 Training this is a bus this is not a bus People learn by examples (positive and negative)

16 Training Perceptrons The AND function Calculations Training Data x1 x2 output Random Initialization of weights sum of weighted input activation ( t = 0.5) error x w1=0.1 w2= minimize this error: adapt the weights x2

17 Training Perceptrons The AND function Calculations Training Data x1 x2 output Adapted Weights sum of weighted input activation ( t = 0.5) error x w1=0.2 w2= no more errors: we have learned x2

18 What is happening? The perceptron is putting all the training instances into two categories: those that fire (category 1) those that don t fire (category 2) It draws a line in a two-dimensional space points on one side fall into category 1 points on other side fall into category 2

19 What is happening? It is not always possible to draw a line Example: Exclusive OR (XOR) x1 x1 x2 output x2

20 What do we need to learn this? A more complex architecture than the perceptron

21 Language Modeling used to predict the next word trained on large monolingual text In SMT, we represent a set of words as discontinuous units In neural models, we represent words as points in a continuous space (word embeddings: meaning representations of words as a list of numbers)

22 Language Modeling: n-grams

23 Neural Language Modeling dictionary: 246 elements one-hot vector: 246 dimensions word embedding: 124 dimensions dimensionality reduction!

24 Word Embeddings: Properties semantics of each dimension?

25 Word Embeddings: Properties Words with similar meaning are close to each other

26 Word Embeddings: Properties Can we do word arithmetic? king man + woman =?

27 Word Embeddings: Properties

28 Recurrent Neural Network

29 Neural Machine Translation (NMT)

30 NMT: Basic model

31 NMT Encoding: 1-Hot vector

32 NMT: Word Embedding

33 NMT: Hidden layer

34 NMT Summary Vector

35 NMT Decoding From a vector to a sequence of words 1. Compute hidden state of the decoder

36 NMT Decoding From a vector to a sequence of words 2. Next word probability

37 NMT Decoding From a vector to a sequence of words 3. Generating the next word

38 The Trouble with Simple Encoder-Decoder Architectures Input sequence is compressed as a fixed-size list of numbers (vector) Translation is generated from this vector This vector must contain every detail about the source sentence be large enough to compress sentences of any length Translation quality decreases as source sentence length increases (with small model)

39 The Trouble with Simple Encoder-Decoder Architectures

40 The Trouble with Simple Encoder-Decoder Architectures RNNs remember recent symbols better the further a symbol is, the less likely the RNNs hidden states remember it

41 Bi-directional representation Combine forward and backward hidden vector: represents the word in the entire sentence Set of these representations = variable-length representation of source sentence

42 How does the decoder know which part of the encoding is relevant at each step of the generation?

43 Attention Mechanism The y s are our translated words produced by the decoder, and the x s are our source sentence words. Each decoder output word y_t now depends on a weighted combination of all the input states, not just the last state. The a s are weights that define how much of each input should be considered for each output.

44 Attention Mechanism Sample translations made by the neural machine translation model with the soft-attention mechanism. Edge thicknesses represent the attention weights found by the attention model.

45 Advantages of NMT 1. End-to-end training All parameters are simultaneously optimized to minimize a loss function 2. Distributed representations share strength Better exploitation of word and phrase similarities 3. Better exploitation of context NMT can use a much bigger context both source and partial target text to translate more accurately

46 Why neural machine translation (NMT) 1. Results show that NMT produces automatic translations that are significantly preferred by humans to other machine translation outputs. 2. Similar methods (often called seq2seq) are also effective for many other NLP and language-related applications such as dialogue, image captioning, and summarization. 3. NMT has been used as a representative application of the recent success of deep learning-based artificial intelligence. source: opennmt.net

47 NMT compared to SMT (Koehn & Knowles 2017) 1. NMT systems have lower quality out of domain, to the point that they completely sacrifice adequacy for the sake of fluency.

48 NMT compared to SMT (Koehn & Knowles 2017) 2. NMT systems have a steeper learning curve with respect to the amount of training data, resulting in worse quality in lowresource settings, but better performance in high-resource settings.

49 NMT compared to SMT (Koehn & Knowles 2017) 3. NMT systems that operate at the sub-word level perform better than SMT systems on extremely low-frequency words, but still show weakness in translating low-frequency words belonging to highly-inflected categories (e.g. verbs).

50 NMT compared to SMT (Koehn & Knowles 2017) 4. NMT systems have lower translation quality on very long sentences, but do comparably better up to a sentence length of about 60 words.

51 NMT compared to SMT (Koehn & Knowles 2017) 5. The attention model for NMT does not always fulfill the role of a word alignment model, but may in fact dramatically diverge.

Conclusions NMT is better compared to SMT if you have the hardware if you have the time if you have the data NMT is work in progress: a hot research topic speeding up the learning larger

52 Conclusions NMT is better compared to SMT if you have the hardware if you have the time if you have the data NMT is work in progress: a hot research topic speeding up the learning larger vocabularies introducing linguistic information part-of-speech tags syntax trees intelligibility: understanding what is being represented work on low frequency words what with morphology?

53 Sources and references Koehn & Knowles (2017). Six challenges for Neural Machine Translation.

Python Machine Learning

Python Machine Learning Unlock deeper insights into machine learning with this vital guide to cuttingedge predictive analytics Sebastian Raschka [ PUBLISHING 1 open source I community experience distilled