Deep Learning. Mohammad Ali Keyvanrad Lecture 17: Neural Text Generation

Deep Learning Mohammad Ali Keyvanrad Lecture 17: Neural Text Generation

OUTLINE Introduction Machine Translation Bidirectional LSTM Attention Mechanism Google s Multilingual NMT 12/24/2017 M.A Keyvanrad Deep Learning (Lecture17-Neural Text Generation) 2

OUTLINE Introduction Machine Translation Bidirectional LSTM Attention Mechanism Google s Multilingual NMT 12/24/2017 M.A Keyvanrad Deep Learning (Lecture17-Neural Text Generation) 3

Introduction Predominant techniques for text generation Template or rule-based systems Require infeasible amounts of hand-engineering Deep learning recently achieved great empirical success on some text generation tasks. Using end-to-end neural network models An encoder model to produce a hidden representation of the source text Followed by a decoder model to generate the target 12/24/2017 M.A Keyvanrad Deep Learning (Lecture17-Neural Text Generation) 4

Introduction Modeling discrete sequences of text tokens Given a sequence U = (u 1, u 2,, u S ) General Form of model Input sequence X Output sequence Y 12/24/2017 M.A Keyvanrad Deep Learning (Lecture17-Neural Text Generation) 5

Introduction For example : machine translation tasks X might be a sentence in English Y the translated sentence in Chinese 12/24/2017 M.A Keyvanrad Deep Learning (Lecture17-Neural Text Generation) 6

Introduction Other examples Task language modeling machine translation grammar correction summarization dialogue speech transcription image captioning question answering X (example) none (empty sequence) source sequence in English noisy, ungrammatical sentence body of news article conversation history audio / speech features image supporting text + knowledge base + question Y (example) tokens from news corpus target sequence in French corrected sentence headline of article next response in turn text transcript caption describing image answer 12/24/2017 M.A Keyvanrad Deep Learning (Lecture17-Neural Text Generation) 7

OUTLINE Introduction Machine Translation Bidirectional LSTM Attention Mechanism Google s Multilingual NMT 12/24/2017 M.A Keyvanrad Deep Learning (Lecture17-Neural Text Generation) 8

Machine Translation The classic test of language understanding Both language analysis & generation Translation is a US$40 billion a year industry Huge commercial use Google translates over 100 billion words a day Facebook ebay 12/24/2017 M.A Keyvanrad Deep Learning (Lecture17-Neural Text Generation) 9

Machine Translation Machine Translation A naive word-based system would completely fail location of subject, verb, Historical Approaches were based on probabilistic models Translation model: telling us what a sentence/phrase in a source language most likely translates into Language model: telling us how likely a given sentence/phrase is overall. LSTMs can generate arbitrary output sequences after seeing the entire input They can even focus in on specific parts of the input automatically 12/24/2017 M.A Keyvanrad Deep Learning (Lecture17-Neural Text Generation) 10

Progress in Machine Translation 12/24/2017 M.A Keyvanrad Deep Learning (Lecture17-Neural Text Generation) 11

Neural Machine Translation Neural Machine Translation The approach of modeling the entire MT process via one big artificial neural network Sometimes we compromise this goal a little 12/24/2017 M.A Keyvanrad Deep Learning (Lecture17-Neural Text Generation) 12

Neural MT: The Bronze Age En-Es translator Constructed on 31 En, 40 Es words Max 10 word sentence Binary encoding of words 50 inputs, 66 outputs 1 or 3 hidden 150-unit layers Ave WER: 1.3 words [Allen 1987 IEEE 1st ICNN] 12/24/2017 M.A Keyvanrad Deep Learning (Lecture17-Neural Text Generation) 13

Neural Machine Translation Sequence-to-sequence (Seq2Seq) model An end-to-end model made up of two recurrent neural networks (or LSTM) Encoder: takes the model s input sequence as input and encodes it into a fixed-size "context vector Decoder: uses the context vector from above as a "seed from which to generate an output sequence. Seq2Seq models are often referred to as "encoder decoder models" 12/24/2017 M.A Keyvanrad Deep Learning (Lecture17-Neural Text Generation) 14

Neural Machine Translation Seq2Seq architecture encoder Read the input sequence to Seq2Seq model and generate a fixed-dimensional context vector C Encoder will use a recurrent neural network cell usually an LSTM to read the input tokens 12/24/2017 M.A Keyvanrad Deep Learning (Lecture17-Neural Text Generation) 15

Neural Machine Translation It s so difficult to compress an arbitrary-length sequence into a single fixed-size vector encoder will usually consist of stacked LSTMs The final layer s LSTM hidden state will be used as C. [Sutskever et al. 2014] 12/24/2017 M.A Keyvanrad Deep Learning (Lecture17-Neural Text Generation) 16

Neural Machine Translation A deep recurrent neural network [Sutskever et al. 2014] 12/24/2017 M.A Keyvanrad Deep Learning (Lecture17-Neural Text Generation) 17

Neural Machine Translation Process the input sequence in reverse Last thing that the encoder sees will (roughly) corresponds to the first thing that the model outputs This makes it easier for the decoder to "get started" on the output Once it has the first few words translated correctly, it s much easier to go on to construct a correct sentence 12/24/2017 M.A Keyvanrad Deep Learning (Lecture17-Neural Text Generation) 18

Neural Machine Translation Seq2Seq architecture decoder The decoder is also an LSTM network We ll run all layers of LSTM, one after the other, following up with a softmax on the final We pass output word into the first layer Both the encoder and decoder are trained at the same time 12/24/2017 M.A Keyvanrad Deep Learning (Lecture17-Neural Text Generation) 19

Four big wins of Neural MT End-to-end training All parameters are simultaneously optimized to minimize a loss function on the network s output Distributed representations share strength Better exploitation of word and phrase similarities Better exploitation of context NMT can use a much bigger context both source and partial target text to translate more accurately More fluent text generation Deep learning text generation is much higher quality 12/24/2017 M.A Keyvanrad Deep Learning (Lecture17-Neural Text Generation) 20

Neural Machine Translation NMT aggressively rolled out by industry! 2016/02: Microsoft launches deep neural network MT running offline on Android/iOS. 2016/08: Systran launches purely NMT model One of the oldest machine translation companies that has done extensive work for the United States Department of Defense. 2016/09: Google launches NMT 12/24/2017 M.A Keyvanrad Deep Learning (Lecture17-Neural Text Generation) 21

OUTLINE Introduction Machine Translation Bidirectional LSTM Attention Mechanism Google s Multilingual NMT 12/24/2017 M.A Keyvanrad Deep Learning (Lecture17-Neural Text Generation) 22

Bidirectional LSTM A word can have a dependency on another word before or after it. Bidirectional LSTM fix this problem Traversing a sequence in both directions The hidden states are concatenated to get the final context vector 12/24/2017 M.A Keyvanrad Deep Learning (Lecture17-Neural Text Generation) 23

OUTLINE Introduction Machine Translation Bidirectional LSTM Attention Mechanism Google s Multilingual NMT 12/24/2017 M.A Keyvanrad Deep Learning (Lecture17-Neural Text Generation) 24

Attention Mechanism Vanilla seq2seq & long sentences Problem: fixed-dimensional representation Y 12/24/2017 M.A Keyvanrad Deep Learning (Lecture17-Neural Text Generation) 25

Attention Mechanism Solution Pool of source states 12/24/2017 M.A Keyvanrad Deep Learning (Lecture17-Neural Text Generation) 26

Attention Mechanism Word alignments Phrase-based SMT aligned words in a preprocessing-step, usually using EM 12/24/2017 M.A Keyvanrad Deep Learning (Lecture17-Neural Text Generation) 27

Attention Mechanism Learning both translation & alignment 12/24/2017 M.A Keyvanrad Deep Learning (Lecture17-Neural Text Generation) 28

Attention Mechanism Different parts of an input have different levels of significance. Example: the ball is on the field "ball, "on, and "field, are the words that are most important Different parts of the output may even consider different parts of the input "important The first word of output is usually based on the first few words of the input The last word is likely based on the last few words of input Attention mechanisms make use of this observation. 12/24/2017 M.A Keyvanrad Deep Learning (Lecture17-Neural Text Generation) 29

Attention Mechanism Attention mechanisms Decoder network look at the entire input sequence at every decoding step Decoder can then decide what input words are important at any point in time 12/24/2017 M.A Keyvanrad Deep Learning (Lecture17-Neural Text Generation) 30

Attention Mechanism Our input is a sequence of words x 1,..., x n that we want to translate Our target sentence is a sequence of words y 1,..., y m Encoder Capture contextual representation of each word in the sentence All h 1,..., h n are the hidden vectors representing the input sentence These vectors are the output of a bi-lstm for instance 12/24/2017 M.A Keyvanrad Deep Learning (Lecture17-Neural Text Generation) 31

Attention Mechanism Decoder We want to compute the hidden states s i of the decoder S i 1 is the previous hidden vector Y i 1 is the generated word at the previous step c i is a context vector that capture the context from the original sentence context vector captures relevant information for the i-th decoding time step unlike the standard Seq2Seq in which there s only one context vector 12/24/2017 M.A Keyvanrad Deep Learning (Lecture17-Neural Text Generation) 32

Attention Mechanism For each hidden vector from the original sentence, compute a score Alignment model: a is any function with values in R for instance a single layer fully-connected neural network Computing the context vector c i weighted average of the hidden vectors from the original sentence The vector α i is called the attention vector 12/24/2017 M.A Keyvanrad Deep Learning (Lecture17-Neural Text Generation) 33

Attention Mechanism The graphical illustration of the proposed model generate the t-th target word y t given a source sentence (x 1 ; x 2 ; ; x T ) 12/24/2017 M.A Keyvanrad Deep Learning (Lecture17-Neural Text Generation) 34

Attention Mechanism Attention vector for machine translation English to French Each pixel shows the weight α ij of the annotation of the j-th source word for the ith target word 12/24/2017 M.A Keyvanrad Deep Learning (Lecture17-Neural Text Generation) 35

Attention Mechanism Alignment model Needs to be evaluated T x T y times for each sentence In order to reduce computation, we use a single layer multilayer perceptron Hidden Layer ( neuron) e t,1 12/24/2017 M.A Keyvanrad Deep Learning (Lecture17-Neural Text Generation) 36

Attention Mechanism Global vs. Local Avoid focusing on everything at each time 12/24/2017 M.A Keyvanrad Deep Learning (Lecture17-Neural Text Generation) 37

Attention Mechanism The major advantage of attention-based models is their ability to efficiently translate long sentences. Minh-Thang Luong, 2015] 12/24/2017 M.A Keyvanrad Deep Learning (Lecture17-Neural Text Generation) 38

OUTLINE Introduction Machine Translation Bidirectional LSTM Attention Mechanism Google s Multilingual NMT 12/24/2017 M.A Keyvanrad Deep Learning (Lecture17-Neural Text Generation) 39

Google s Multilingual NMT State-of-the-art in Neural Machine Translation (NMT) Bilingual 12/24/2017 M.A Keyvanrad Deep Learning (Lecture17-Neural Text Generation) 40

Google s Multilingual NMT State-of-the-art in Neural Machine Translation (NMT) Multilingual 12/24/2017 M.A Keyvanrad Deep Learning (Lecture17-Neural Text Generation) 41

Google s Multilingual NMT Google s Multilingual NMT System Simplicity: single model Low-resource language improvements Zero-shot translation Translate between language pairs it has never seen in this combination Train: Portuguese English + English Spanish Test: Portuguese Spanish 12/24/2017 M.A Keyvanrad Deep Learning (Lecture17-Neural Text Generation) 42

Google s Multilingual NMT Architecture 12/24/2017 M.A Keyvanrad Deep Learning (Lecture17-Neural Text Generation) 43

Google s Multilingual NMT A token at the beginning of the input sentence to indicate the target language 12/24/2017 M.A Keyvanrad Deep Learning (Lecture17-Neural Text Generation) 44

Dealing with the large output vocabulary NMT systems have a hard time dealing with large vocabulary size softmax can be quite expensive to compute Scaling softmax Hierarchical Softmax Reducing vocabulary simply limit the vocabulary size to a small number and replace words outside the vocabulary with a tag <UNK> Handling unknown words 12/24/2017 M.A Keyvanrad Deep Learning (Lecture17-Neural Text Generation) 45

References Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to align and translate." arxiv preprint arxiv:1409.0473 (2014). Thang Luong, Hieu Pham, and Chris Manning. Effective Approaches to Attention-based Neural Machine Translation. EMNLP 15. Johnson, Melvin, et al. "Google's multilingual neural machine translation system: enabling zero-shot translation." arxiv preprint arxiv:1611.04558 (2016). Wu, Yonghui, et al. "Google's neural machine translation system: Bridging the gap between human and machine translation." arxiv preprint arxiv:1609.08144 (2016). 12/24/2017 M.A Keyvanrad Deep Learning (Lecture17-Neural Text Generation) 46

12/24/2017 M.A Keyvanrad Deep Learning (Lecture17-Neural Text Generation) 47

12/24/2017 M.A Keyvanrad Deep Learning (Lecture17-Neural Text Generation) 48