Neural Machine Translation

Similar documents
System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Python Machine Learning

Second Exam: Natural Language Parsing with Neural Networks

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Deep Neural Network Language Models

A Neural Network GUI Tested on Text-To-Phoneme Mapping

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

arxiv: v4 [cs.cl] 28 Mar 2016

Residual Stacking of RNNs for Neural Machine Translation

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Artificial Neural Networks written examination

Probabilistic Latent Semantic Analysis

Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures

arxiv: v1 [cs.cv] 10 May 2017

arxiv: v1 [cs.cl] 2 Apr 2017

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Model Ensemble for Click Prediction in Bing Search Ads

INPE São José dos Campos

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Lecture 1: Machine Learning Basics

Attributed Social Network Embedding

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval

Modeling function word errors in DNN-HMM based LVCSR systems

Autoencoder and selectional preference Aki-Juhani Kyröläinen, Juhani Luotolahti, Filip Ginter

A study of speaker adaptation for DNN-based speech synthesis

Natural Language Processing. George Konidaris

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Knowledge Transfer in Deep Convolutional Neural Nets

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Learning Methods for Fuzzy Systems

Assignment 1: Predicting Amazon Review Ratings

arxiv: v2 [cs.ir] 22 Aug 2016

Modeling function word errors in DNN-HMM based LVCSR systems

arxiv: v1 [cs.lg] 7 Apr 2015

Human Emotion Recognition From Speech

Calibration of Confidence Measures in Speech Recognition

A Case Study: News Classification Based on Term Frequency

(Sub)Gradient Descent

Syntactic systematicity in sentence processing with a recurrent self-organizing network

Lecture 1: Basic Concepts of Machine Learning

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Dropout improves Recurrent Neural Networks for Handwriting Recognition

On the Formation of Phoneme Categories in DNN Acoustic Models

A deep architecture for non-projective dependency parsing

CS224d Deep Learning for Natural Language Processing. Richard Socher, PhD

CSL465/603 - Machine Learning

Speech Recognition at ICSI: Broadcast News and beyond

CS 598 Natural Language Processing

CS Machine Learning

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Test Effort Estimation Using Neural Network

THE world surrounding us involves multiple modalities

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

What Can Neural Networks Teach us about Language? Graham Neubig a2-dlearn 11/18/2017

Knowledge-Based - Systems

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

Глубокие рекуррентные нейронные сети для аспектно-ориентированного анализа тональности отзывов пользователей на различных языках

Cross Language Information Retrieval

arxiv: v1 [cs.cl] 27 Apr 2016

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Noisy SMS Machine Translation in Low-Density Languages

Linking Task: Identifying authors and book titles in verbose queries

Visual CP Representation of Knowledge

Language Model and Grammar Extraction Variation in Machine Translation

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

The MSR-NRC-SRI MT System for NIST Open Machine Translation 2008 Evaluation

J j W w. Write. Name. Max Takes the Train. Handwriting Letters Jj, Ww: Words with j, w 321

Word Segmentation of Off-line Handwritten Documents

A Vector Space Approach for Aspect-Based Sentiment Analysis

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

arxiv: v1 [cs.lg] 15 Jun 2015

Georgetown University at TREC 2017 Dynamic Domain Track

Clickthrough-Based Translation Models for Web Search: from Word Models to Phrase Models

12- A whirlwind tour of statistics

Detecting English-French Cognates Using Orthographic Edit Distance

Training and evaluation of POS taggers on the French MULTITAG corpus

arxiv: v2 [cs.cl] 26 Mar 2015

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Speaker Identification by Comparison of Smart Methods. Abstract

Indian Institute of Technology, Kanpur

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Using dialogue context to improve parsing performance in dialogue systems

arxiv: v1 [cs.cl] 20 Jul 2015

Word Embedding Based Correlation Model for Question/Answer Matching

Switchboard Language Model Improvement with Conversational Data from Gigaword

Speech Emotion Recognition Using Support Vector Machine

Evolutive Neural Net Fuzzy Filtering: Basic Description

SARDNET: A Self-Organizing Feature Map for Sequences

Device Independence and Extensibility in Gesture Recognition

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Adaptive learning based on cognitive load using artificial intelligence and electroencephalography

Transcription:

Neural Machine Translation Qun Liu, Peyman Passban ADAPT Centre, Dublin City University 29 January 2018, at DeepHack.Babel, MIPT The ADAPT Centre is funded under the SFI Research Centres Programme (Grant 13/RC/2106) and is co-funded under the European Regional Development Fund.

Content Background: Machine Translation and Neural Network Transition: From Discrete Spaces to Continuous Spaces Neural Machine Translation: MT in a Continuous Space Implementing Seq2Seq models with PyTorch Conclusion 1

Background: Machine Translation and Neural Network Statistical Machine Translation (SMT) Deep Learning (DL) and Neural Network (NN) The Gap between DL and MT 2

Parallel Corpus 3

Word Alignment 4

Phrase Table 5

Decoding Process Build translation left to right Select a phrase to translate Maria no dio una bofetada a la bruja verde Mary did not slap the witch green 6

Decoding Process Build translation left to right Select a phrase to translate Find the translation for the phrase Maria no dio una bofetada a la bruja verde Mary did not slap the witch green 7

Decoding Process Build translation left to right Select a phrase to translate Find the translation for the phrase Add the phrase to the end of the partial translation Maria no dio una bofetada a la bruja verde Mary did not slap the witch green 8

Decoding Process Build translation left to right Select a phrase to translate Find the translation for the phrase Add the phrase to the end of the partial translation Mark words as translated Maria no dio una bofetada a la bruja verde Mary did not slap the witch green 9

Decoding Process One to many translation Maria no dio una bofetada a la bruja verde Mary did not slap the witch green 10

Decoding Process Many to one translation Maria no dio una bofetada a la bruja verde Mary did not slap the witch green 11

Decoding Process Many to one translation Maria no dio una bofetada a la bruja verde Mary did not slap the witch green 12

Decoding Process Reordering Maria no dio una bofetada a la bruja verde Mary did not slap the witch green 13

Decoding Process Translation finished! Maria no dio una bofetada a la bruja verde Mary did not slap the witch green 14

Search Space for Phrase-based SMT 约翰 Yuehan 喜欢 xihuan John loves Mary 玛丽 Mali 15

Search Space for Phrase-based SMT 约翰 Yuehan 喜欢 xihuan John loves Mary 玛丽 Mali 16 The search is directed by a weighted combination of various features: Translation probability Language model probability

Background: Machine Translation and Neural Network Statistical Machine Translation (SMT) Deep Learning (DL) and Neural Network (NN) o (slides taken from Kevin Duh s presentation) The Gap between DL and MT 17

Human Neurons - Very Loose Inspiration

Perceptrons - Linear Classifiers

Logistic Regression (1-layer net) Function model: f(x) = σ(w T x) o Parameters: vector w R d o σ is a non-linearity, e.g. sigmoid: o σ(z) = 1/(1 + exp ( z)) o Non-linearity will be important in expressiveness o multi-layer nets. Other non-linearities, e.g., o tanh (z) = (e z e z )/(e z + e z ) 20 Extracted from Kevin Duh s slides in DL4MT Winter School

2-layer Neural Networks Called Multilayer Perceptron (MLP), but more like multilayer logistic regression 21 Extracted from Kevin Duh s slides in DL4MT Winter School

Expressive Power of Non-linearity A deeper architecture is more expressive than a shallow one given same number of nodes [Bishop, 1995] o 1-layer nets only model linear hyperplanes o 2-layer nets can model any continuous function (given sufficient nodes) o >3-layer nets can do so with fewer nodes 22 Extracted from Kevin Duh s slides in DL4MT Winter School

What is Deep Learning? A family of methods that uses deep architectures to learn high-level feature representations 23 Extracted from Kevin Duh s slides in DL4MT Winter School

Automatically Trained Features in FR Automatically trained features make sense! [Lee et al., 2009] Input: Images (raw pixels) Output: Features of Edges, Body Parts, Full Faces 24 Extracted from Kevin Duh s slides in DL4MT Winter School

Current models are becoming more complex 25 Extracted from Kevin Duh s slides in DL4MT Winter School

Background: Machine Translation and Neural Network Statistical Machine Translation (SMT) Deep Learning (DL) and Neural Network (NN) The Gap between DL and MT 26

The Gap between DL and MT Discrete symbols Continuous vectors 27

Content Background: Machine Translation and Neural Network Transition: From Discrete Spaces to Continuous Spaces Neural Machine Translation: MT in a Continuous Space Implementing Seq2Seq models with PyTorch Conclusion 28

Transition From Discrete Space to Continuous Space Word Embedding Express a word in a continuous space Neural Language Model 29

Express a word in a continuous space David John play Mary loves like 30

Express a word in a continuous space John David Mary loves play like 31

One-Hot Vector The dimension of the vector is the vocabulary size Each dimension is correspondent to a word Each word is represented as a vector that: o the element is equal to 1 at the dimension which is correspondent to that word o All the other elements are equal to 0 32

One-Hot Vector: Weakness The dimension is very high (equal to the vocabulary size / 100k) Very little information is carried by a one-hot vector o No syntactic information o No semantic information o No lexical information 33

Distributional Semantic Models Assumption: Words that are used and occur in the same contexts tend to purport similar meanings A typical model: Context Window: o A word is represented as the sum/average/tf-idf of the one-hot vectors appearing in the windows surrounding its every occurrence in the corpus o Effective for word similarity measurement o LSA can be used to reduce the dimension Weakness o Not compositional o Reverse Mapping is not supported 34

Word2Vec: Word Embedding by Neural Networks A word is represented by a dense vector (usually several hundreds dimensions) The Word2Vec matrix are trained by a 2-layer neural network 35 Extracted from Christopher Moody s slides

Word2Vec: CBOW context words current word 36 http://stats.stackexchange.com/questions/177667/inpu t-vector-representation-vs-output-vectorrepresentation-in-word2vec

Word2Vec: Skip-gram current word context words http://stats.stackexchange.com/questions/177667/inpu t-vector-representation-vs-output-vectorrepresentation-in-word2vec 37

Transition From Discrete Space to Continuous Space Word Embedding Express a word in a continuous space Neural Language Model Express a sentence in a continuous space 38

Language Models Given a sentence: w 1 w 2 w 3 w n, a language model is: p(w i w 1 w i 1 ) N-gram Language Model: p(w i w 1 w i 1 ) p(w i w i N+1 w i 1 ) Markov Chain Assumption 39

N-Gram Model A part of the parameter matrix of a bigram language model 40

N-Gram Model Normalize on all words A part of the parameter matrix of a bigram language model 41

Feed Forward Neural Network LM 42 [Bengio et al., 2003]

Feed Forward Neural Network LM 43 [Bengio et al., 2003]

Feed Forward Neural Network LM softmax layer Normalize on the vocabulary size Computational intensive 44 [Bengio et al., 2003]

Feed Forward Neural Network LM One shortcoming of FFNN LM is that it can only take limited length of history, just like N-gram LM An improved NN LM is proposed to solve this problem: Recurrent Neural Network LM 45

Recurrent Neural Network LM 46

Recurrent Neural Network LM Unfold the RNN LM along the timeline: 47

LSTM & GRU: Improved Implementation of RNN Mitigating gradient vanishing and exploding Long distance dependency 48

Language Model for Generation Given language model p(w i w 1 w i 1 ) and a history, we can generate the next word with highest LM score: w t = argmax w t V p(w t w 1 w i 1 ) S 49

Content Background: Machine Translation and Neural Network Transition: From Discrete Spaces to Continuous Spaces Neural Machine Translation: MT in a Continuous Space Implementing Seq2Seq models with PyTorch Conclusion 50

Neural Machine Translation: MT in a Continuous Space 喜欢 约翰 玛丽 Mary John loves Chinese Space English Space 51

Neural Machine Translation: MT in a Continuous Space Neural Machine Translation (NMT) Attention-based NMT 52

Neural Machine Translation The same things with SMT: o Trained with a parallel corpus o The input and output are word sequences The difference with SMT: o A single, large neural network o All the internal computing is conducted on real values without symbols o No word-alignment o No phrase table or rule table o No n-gram language model 53

Neural Machine Translation <bos> 54 https://medium.com/@felixhill/deep-consequences-fa823a588e97#.sqlkiwvho

Neural Machine Translation: MT in a Continuous Space Neural Machine Translation (NMT) Attention-based NMT 55

Weakness of the simple NMT model The only connection between the source sentence and the target sentence is the single vector representation of the source sentence It is hard for this fix-length vector to capture the meaning of the variable-length sentence, especially when the sentence is very long When the sentence becomes longer, the translation quality drops dramatically 56

Attention-based Model: Improvement Keep the states for all words rather than the final state only Use bi-directional RNN to replace single directional RNN Use an attention mechanism as a soft alignment between the source words and target words 57

Bi-directional RNN 58 https://devblogs.nvidia.com/parallelforall/introducti on-neural-machine-translation-gpus-part-3/

Bi-directional RNN The representation for the word in the context. 59 https://devblogs.nvidia.com/parallelforall/introducti on-neural-machine-translation-gpus-part-3/

Bi-directional RNN It contains the context information of the word in both sides 60 https://devblogs.nvidia.com/parallelforall/introducti on-neural-machine-translation-gpus-part-3/

Attention for NMT 61 https://devblogs.nvidia.com/parallelforall/introducti on-neural-machine-translation-gpus-part-3/

Attention for NMT 62 https://devblogs.nvidia.com/parallelforall/introducti on-neural-machine-translation-gpus-part-3/

Attention for NMT 63 https://devblogs.nvidia.com/parallelforall/introducti on-neural-machine-translation-gpus-part-3/

Soft Alignments by Attention Mechanism 64 https://devblogs.nvidia.com/parallelforall/introducti on-neural-machine-translation-gpus-part-3/

Attention-based NMT The attention-based NMT is very successful It s performance has outperformed the SoA of SMT Attention mechanism is used in many DL tasks, such as image caption generation 65

Content Background: Machine Translation and Neural Network Transition: From Discrete Spaces to Continuous Spaces Neural Machine Translation: MT in a Continuous Space Implementing Seq2Seq models with PyTorch Conclusion 66

Implementing Seq2Seq models with PyTorch Encoder-Decoder Model 67

cat Encoding

cat Encoding context

cat sat Encoding context

cat sat on Encoding context

cat sat on the Encoding context

cat sat on the mat Encoding context

cat sat on the mat EOS Encoding (Done!) context

cat sat on the mat EOS Encoding (Done!) context Encoder

cat sat on the mat EOS gorbeh Decoding context Encoder

cat sat on the mat EOS gorbeh ruyeh Decoding context Encoder

cat sat on the mat EOS gorbeh ruyeh hasir Decoding context Encoder

cat sat on the mat EOS gorbeh ruyeh hasir neshast Decoding context Encoder

cat sat on the mat EOS gorbeh ruyeh hasir neshast EOS Decoding context Encoder

cat sat on the mat EOS gorbeh ruyeh hasir neshast EOS Decoding (Done!) context Encoder Decoder

cat sat on the mat EOS gorbeh ruyeh Attention! context α 1 α 2 α 3 α 4 α 5 α 6

cat sat on the mat EOS Attention! α 1 α 2 α 3 α 4 α 5 α 6 + context

Encoder

Encoder

Encoder

Encoder https://stackoverflo w.com/questions/22 2877/what-doessuper-do-in-python

Encoder embedding # Unique Source Words w_i

Encoder input-th embedding

Encoder input-th embedding h_(t-1) h_t output_t input h output

Encoder input h

Decoder+Attention

Decoder+Attention

Decoder+Attention Two embedding tables!?

Decoder+Attention

Decoder+Attention index (digit)

Decoder+Attention 1 x 1-1

Decoder+Attention 1 x 1-1 embedded: 1 x 1 x -1 embedded[0]: 1 x -1

Decoder+Attention 1 x 1-1 embedded: 1 x 1 x -1 embedded[0]: 1 x -1 hidden[0]: 1 x -1

Decoder+Attention decoder s state: embedded[0] ; hidden[0] ;

Decoder+Attention decoder s state: embedded[0] ; hidden[0] ;

Decoder+Attention Softmax ( ) 1 x max_length

cat sat on the mat EOS Attention! α 1 α 2 α 3 α 4 α 5 α 6 + context

Decoder+Attention 1 x max_length unsqueeze(0) 1 x 1 x max_length

Decoder+Attention 1 x max_length x embed

Decoder+Attention context: 1 x 1 x embed

Decoder+Attention Learning Phrase Representations using RNN Encoder Decoder for Statistical Machine Translation, EMNLP, 2014.

Decoder+Attention context: 1 x 1 x embed

Decoder+Attention context: 1 x 1 x embed

Decoder+Attention context: 1 x 1 x embed

Decoder+Attention context: 1 x 1 x embed

Decoder+Attention context: 1 x 1 x embed

Decoder+Attention

Putting together

Putting together

Putting together

Putting together pair: [[a, b, c], [a, b, c, d ]]

Putting together

Putting together training_pair[0]: [a, b, c] training_pair[1]: [a, b, c, d ]

Putting together training_pair[0]: [a, b, c] training_pair[1]: [a, b, c, d ]

Putting together training_pair[0]: [a, b, c] training_pair[1]: [a, b, c, d ]

Putting together

Putting together training_pair[0]: [a, b, c] training_pair[1]: [a, b, c, d ]

Putting together

cat sat on the mat EOS gorbeh ruyeh Decoding context Encoder

Putting together training_pair[0]: [a, b, c] training_pair[0][0]: [a] word embedding

Putting together init the decoder!

Putting together

Putting together

Putting together

Putting together

Putting together

Putting together

Content Background: Machine Translation and Neural Network Transition: From Discrete Spaces to Continuous Spaces Neural Machine Translation: MT in a Continuous Space Implementing Seq2Seq models with PyTorch Conclusion 134

Conclusion MT is a task defined in a discrete space In a deep learning framework, the MT is converted to a task defined in a continuous space Word embedding is used to map a word to a vector Recurrent Neural Network is used to model the word sequence Encoder-Decoder (or Sequence-to-Sequence) model is proposed for neural machine translation Attention-based mechanism is used to provide soft alignment for NMT NMT has outperformed SMT and still has huge potential 135

Further topics Subword level and character level models o Morphologically rich languages o Out-of-Vocabulary problem Multitask and Multiway models o Sharing parameters among Multiple MT models o Low resource or zero-shot language pairs Pure attention models o Higher performance 136

Thanks Q&A Speaker: Qun Liu Email: qun.liu@dcu.ie Speaker: Peyman Passban Email: pe.psbn@gmail.com 137