Neural Machine Translation

Similar documents
A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Residual Stacking of RNNs for Neural Machine Translation

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Artificial Neural Networks written examination

arxiv: v4 [cs.cl] 28 Mar 2016

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

Second Exam: Natural Language Parsing with Neural Networks

arxiv: v1 [cs.lg] 7 Apr 2015

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

Deep Neural Network Language Models

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

On the Formation of Phoneme Categories in DNN Acoustic Models

Modeling function word errors in DNN-HMM based LVCSR systems

arxiv: v3 [cs.cl] 7 Feb 2017

Python Machine Learning

arxiv: v1 [cs.cl] 27 Apr 2016

A deep architecture for non-projective dependency parsing

Глубокие рекуррентные нейронные сети для аспектно-ориентированного анализа тональности отзывов пользователей на различных языках

What Can Neural Networks Teach us about Language? Graham Neubig a2-dlearn 11/18/2017

FUZZY EXPERT. Dr. Kasim M. Al-Aubidy. Philadelphia University. Computer Eng. Dept February 2002 University of Damascus-Syria

Model Ensemble for Click Prediction in Bing Search Ads

arxiv: v2 [cs.ir] 22 Aug 2016

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Modeling function word errors in DNN-HMM based LVCSR systems

arxiv: v1 [cs.lg] 15 Jun 2015

arxiv: v1 [cs.cl] 2 Apr 2017

Georgetown University at TREC 2017 Dynamic Domain Track

arxiv: v1 [cs.cv] 10 May 2017

THE world surrounding us involves multiple modalities

Lecture 1: Machine Learning Basics

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Dropout improves Recurrent Neural Networks for Handwriting Recognition

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Function Tables With The Magic Function Machine

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, VOL XXX, NO. XXX,

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval

Attributed Social Network Embedding

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

Learning Methods for Fuzzy Systems

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

Ask Me Anything: Dynamic Memory Networks for Natural Language Processing

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Lip Reading in Profile

Learning to Schedule Straight-Line Code

INPE São José dos Campos

A study of speaker adaptation for DNN-based speech synthesis

Introduction to Causal Inference. Problem Set 1. Required Problems

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Language Model and Grammar Extraction Variation in Machine Translation

A JOINT MANY-TASK MODEL: GROWING A NEURAL NETWORK FOR MULTIPLE NLP TASKS

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

Masterarbeit. Im Studiengang Informatik. Predicting protein contacts by combining information from sequence and physicochemistry

Detecting English-French Cognates Using Orthographic Edit Distance

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Cultivating DNN Diversity for Large Scale Video Labelling

Syntactic systematicity in sentence processing with a recurrent self-organizing network

Knowledge Transfer in Deep Convolutional Neural Nets

1.1 Examining beliefs and assumptions Begin a conversation to clarify beliefs and assumptions about professional learning and change.

NEURAL DIALOG STATE TRACKER FOR LARGE ONTOLOGIES BY ATTENTION MECHANISM. Youngsoo Jang*, Jiyeon Ham*, Byung-Jun Lee, Youngjae Chang, Kee-Eung Kim

Softprop: Softmax Neural Network Backpropagation Learning

Improving Fairness in Memory Scheduling

Using Deep Convolutional Neural Networks in Monte Carlo Tree Search

ON THE USE OF WORD EMBEDDINGS ALONE TO

*** * * * COUNCIL * * CONSEIL OFEUROPE * * * DE L'EUROPE. Proceedings of the 9th Symposium on Legal Data Processing in Europe

University of Groningen. Systemen, planning, netwerken Bosman, Aart

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Axiom 2013 Team Description Paper

TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen

Calibration of Confidence Measures in Speech Recognition

Linking the Ohio State Assessments to NWEA MAP Growth Tests *

Early Model of Student's Graduation Prediction Based on Neural Network

Using Synonyms for Author Recognition

SELF: CONNECTING CAREERS TO PERSONAL INTERESTS. Essential Question: How Can I Connect My Interests to M y Work?

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

A Reinforcement Learning Variant for Control Scheduling

CSL465/603 - Machine Learning

Statewide Framework Document for:

Using the Attribute Hierarchy Method to Make Diagnostic Inferences about Examinees Cognitive Skills in Algebra on the SAT

arxiv: v5 [cs.ai] 18 Aug 2015

(Sub)Gradient Descent

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Semantic Segmentation with Histological Image Data: Cancer Cell vs. Stroma

MTH 215: Introduction to Linear Algebra

Forget catastrophic forgetting: AI that learns after deployment

Human Emotion Recognition From Speech

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

arxiv: v1 [cs.lg] 20 Mar 2017

Transcription:

Neural Machine Translation Philipp Koehn 12 October 2017

Language Models 1 Modeling variants feed-forward neural network recurrent neural network long short term memory neural network May include input context

Feed Forward Neural Language Model 2 Word 1 Word 2 Word 3 Word 4 C C C C Hidden Layer Word 5

Recurrent Neural Language Model 3 <s> Given word Embedding Predict first word of a sentence Hidden state Same as before, just drawn top-down Predicted word

Recurrent Neural Language Model 4 <s> Given word Embedding Predict second word of a sentence Hidden state Predicted word Re-use hidden state from first word prediction house

Recurrent Neural Language Model 5 <s> house Given word Embedding Predict third word of a sentence Hidden state... and so on Predicted word house is

Recurrent Neural Language Model 6 <s> house is big. Given word Embedding Hidden state Predicted word house is big. </s>

Recurrent Neural Translation Model 7 We predicted words of a sentence Why not also predict ir translations?

Encoder-Decoder Model 8 <s> house is big. </s> das Haus ist groß. Given word Embedding Hidden state Predicted word house is big. </s> das Haus ist groß. </s> Obviously madness Proposed by Google (Sutskever et al. 2014)

What is missing? 9 Alignment of input words to output words Solution: attention mechanism

10 neural translation model with attention

Input Encoding 11 Given word Embedding Hidden state Predicted word Inspiration: recurrent neural network language model on input side

Hidden Language Model States 12 This gives us hidden states H1 H2 H3 H4 H5 H6 These encode left context for each word Same process in reverse: right context for each word Ĥ1 Ĥ2 Ĥ3 Ĥ4 Ĥ5 Ĥ6

Input Encoder 13 Input Word Embeddings Left-to-Right Recurrent NN Right-to-Left Recurrent NN Input encoder: concatenate bidrectional RNN states Each word representation includes full left and right sentence context

Encoder: Math 14 Input Word Embeddings Left-to-Right Recurrent NN Right-to-Left Recurrent NN Input is sequence of words x j, mapped into embedding space Ē x j Bidirectional recurrent neural networks hj = f( h j+1, Ē x j) hj = f( h j 1, Ē x j) Various choices for function f(): feed-forward layer, GRU, LSTM,...

Decoder 15 We want to have a recurrent neural network predicting output words Hidden State Output Words

Decoder 16 We want to have a recurrent neural network predicting output words Hidden State Output Words We feed decisions on output words back into decoder state

Decoder 17 We want to have a recurrent neural network predicting output words Input Context Hidden State Output Words We feed decisions on output words back into decoder state Decoder state is also informed by input context

More Detail 18 Decoder is also recurrent neural network over sequence of hidden states s i ci-1 ci Context s i = f(s i 1, Ey 1, c i ) si-1 si State Again, various choices for function f(): feed-forward layer, GRU, LSTM,... ti-1 ti Word Prediction Output word y i is selected by computing a vector t i (same size as vocabulary) yi-1 yi Selected Word t i = W (Us i 1 + V Ey i 1 + Cc i ) Eyi-1 Eyi Embedding n finding highest value in vector t i If we normalize t i, we can view it as a probability distribution over words Ey i is embedding of output word y i

Attention 19 Encoder States Attention Hidden State Output Words Given what we have generated so far (decoder hidden state)... which words in input should we pay attention to (encoder states)?

Attention 20 Encoder States Attention Hidden State Output Words Given: previous hidden state of decoder s i 1 representation of input words h j = ( h j, h j ) Predict an alignment probability a(s i 1, h j ) to each input word j (modeled with with a feed-forward neural network layer)

Attention 21 Encoder States Attention Input Context Hidden State Output Words Normalize attention (softmax) α ij = exp(a(s i 1, h j )) k exp(a(s i 1, h k )) Relevant input context: weigh input words according to attention: c i = j α ijh j

Attention 22 Encoder States Attention Input Context Hidden State Output Words Use context to predict next hidden state and output word

Encoder-Decoder with Attention 23 Input Word Embeddings Left-to-Right Recurrent NN Right-to-Left Recurrent NN Attention Input Context Hidden State Output Words

24 training

Computation Graph 25 Math behind neural machine translation defines a computation graph Forward and backward computation to compute gradients for model training x W 1 prod b 1 sum sigmoid W 2 prod b 2 sum sigmoid

Problem: Recurrent Neural Networks 26 RNNs imply dynamically sized graph Size of graph depends on length, of input and output sentence

Unrolling RNNs 27 For a given training example, length of input and output sentence known Build out entire computation graph Input Word Embeddings Left-to-Right Recurrent NN Right-to-Left Recurrent NN

Fully Computed Graph 28 Input Word Embeddings Left-to-Right Recurrent NN Right-to-Left Recurrent NN Attention Input Context Hidden State Predicted Output Words Error Given Output Words

Update from Word 1 29 Input Word Embeddings Left-to-Right Recurrent NN Right-to-Left Recurrent NN Attention Input Context Hidden State Predicted Output Words Error Given Output Words

Update from Word 2 30 Input Word Embeddings Left-to-Right Recurrent NN Right-to-Left Recurrent NN Attention Input Context Hidden State Predicted Output Words Error Given Output Words

Update from Word 3 31 Input Word Embeddings Left-to-Right Recurrent NN Right-to-Left Recurrent NN Attention Input Context Hidden State Predicted Output Words Error Given Output Words

Batching 32 Already large degree of parallelism most computations on vectors, matrices efficient implementations for CPU and GPU Furr parallelism by batching processing several sentence pairs at once scalar operation vector operation vector operation matrix operation matrix operation 3d tensor operation Typical batch sizes 50 100 sentence pairs

Batches 33 Sentences have different length When batching, fill up unneeded cells in tensors A lot of wasted computations

Mini-Batches 34 Sort sentences by length, break up into mini-batches Example: Maxi-batch 1600 sentence pairs, mini-batch 80 sentence pairs

Overall Organization of Training 35 Shuffle corpus Break into maxi-batches Break up each maxi-batch into mini-batches Process mini-batch, update parameters Once done, repeat Typically 5-15 epochs needed (passes through entire training corpus)

36 inference

Inference 37 Given a trained model... we now want to translate test sentences We only need execute forward step in computation graph

Word Prediction 38 ci-1 ci Context yi cat Eyi si-1 si State this ti-1 ti Word Prediction of fish yi-1 yi Selected Word re dog Eyi-1 Eyi Embedding se

Selected Word 39 ci-1 ci Context yi cat Eyi si-1 si State this ti-1 ti Word Prediction of fish yi-1 yi Selected Word re dog Eyi-1 Eyi Embedding se

Embedding 40 ci-1 ci Context yi cat Eyi si-1 si State this ti-1 ti Word Prediction of fish yi-1 yi Selected Word re dog Eyi-1 Eyi Embedding se

Distribution of Word Predictions 41 yi cat this of fish re dog se

Select Best Word 42 yi cat this of fish re dog se

Select Second Best Word 43 yi cat this of fish re dog se this

Select Third Best Word 44 yi cat this of fish re dog se this se

Use Selected Word for Next Predictions 45 yi cat this of fish re dog se this se

Select Best Continuation 46 yi cat cat this this of se fish re dog se

Select Next Best Continuations 47 yi cat cat this this cat of se cats fish dog re dog cats se

Continue... 48 yi cat cat this this cat of se cats fish dog re dog cats se

Beam Search 49 <s> </s> </s> </s> </s> </s> </s>

Best Paths 50 <s> </s> </s> </s> </s> </s> </s>

Beam Search Details 51 Normalize score by length No recombination (paths cannot be merged)

Output Word Predictions 52 Input Sentence: ich glaube aber auch, er ist clever genug um seine Aussagen vage genug zu halten, so dass sie auf verschiedene Art und Weise interpretiert werden können. Best Alternatives but (42.1%) however (25.3%), I (20.4%), yet (1.9%), and (0.8%), nor (0.8%),... I (80.4%) also (6.0%),, (4.7%), it (1.2%), in (0.7%), nor (0.5%), he (0.4%),... also (85.2%) think (4.2%), do (3.1%), believe (2.9%),, (0.8%), too (0.5%),... believe (68.4%) think (28.6%), feel (1.6%), do (0.8%),... he (90.4%) that (6.7%), it (2.2%), him (0.2%),... is (74.7%) s (24.4%), has (0.3%), was (0.1%),... clever (99.1%) smart (0.6%),... enough (99.9%) to (95.5%) about (1.2%), for (1.1%), in (1.0%), of (0.3%), around (0.1%),... keep (69.8%) maintain (4.5%), hold (4.4%), be (4.2%), have (1.1%), make (1.0%),... his (86.2%) its (2.1%), statements (1.5%), what (1.0%), out (0.6%), (0.6%),... statements (91.9%) testimony (1.5%), messages (0.7%), comments (0.6%),... vague (96.2%) v@@ (1.2%), in (0.6%), ambiguous (0.3%),... enough (98.9%) and (0.2%),... so (51.1%), (44.3%), to (1.2%), in (0.6%), and (0.5%), just (0.2%), that (0.2%),... y (55.2%) that (35.3%), it (2.5%), can (1.6%), you (0.8%), we (0.4%), to (0.3%),... can (93.2%) may (2.7%), could (1.6%), are (0.8%), will (0.6%), might (0.5%),... be (98.4%) have (0.3%), interpret (0.2%), get (0.2%),... interpreted (99.1%) interpre@@ (0.1%), constru@@ (0.1%),... in (96.5%) on (0.9%), differently (0.5%), as (0.3%), to (0.2%), for (0.2%), by (0.1%),... different (41.5%) a (25.2%), various (22.7%), several (3.6%), ways (2.4%), some (1.7%),... ways (99.3%) way (0.2%), manner (0.2%),.... (99.2%) </S> (0.2%),, (0.1%),... </s> (100.0%)