Supervised Learning with Neural Networks and Machine Translation with LSTMs Ilya Sutskever in collaboration with: Minh-Thang Luong Quoc Le Oriol Vinyals Wojciech Zaremba Google Brain
Deep Neural Networks 1. Can perform an astonishingly wide range of computations 2. Can be learned automatically powerful models learnable models deep neural networks
Powerful models are necessary A weak model will never get good performance Examples of weak models: Single layer logistic regression Linear SVM Small neural nets Small conv nets A neural network needs to be large and deep to be powerful powerful models learnable models deep neural networks
Why are deep nets powerful? A single neuron can implement boolean logic, and thus arbitrary computation OR AND -0.5-1.5 +1 +1 +1 NOT +0.5-1 +1
Why are deep nets powerful? A single neuron can implement boolean logic, and general thus computation and computers Mid-sized 2-hidden layer neural network can sort N N-bit numbers Intuitively, sorting requires log N parallel steps Backpropagation can find this circuit
Why are deep nets powerful? A single neuron can implement boolean logic, and general thus computation and computers Mid-sized 2-hidden layer neural network can sort N N-bit numbers Intuitively, sorting requires log N parallel steps Backpropagation can find this circuit Neurons are more economical than boolean logic
The Deep Learning Hypothesis Human perception is fast Neurons fire at most 100 times a second Humans solve perception in 0.1 seconds our neurons fire 10 times, at most Anything a human can do in 0.1 seconds, a big 10layer neural network can do, too! 20-layer neural networks can be trained well in practice Two years ago we could only train 10-layer networks
Implication DNNs, once trained, should do well on all perception problem Vision, speech, emotion, face recognition, instinct If there exists a human expert that can solve hard problems in a fraction of a second, then large deep neural networks could do so, too Instantaneous translation Speed reading Identifying the obvious thing to do in a complicated situation or a game
Learning Powerful models are useless unless we can train them Supervised backpropagation works! Not clear why 20-layer neural nets easily trainable with backprop powerful models learnable models deep neural networks
Learning Algorithm While not done Pick an example (x, y) Run the network of x to get a prediction p Use the gradient to bring p slightly closer to y Theory must make nontrivial assumptions about dataset powerful models learnable models deep neural networks
How to solve hard problems? Use a lot of good AND labelled training data Use a big deep neural network powerful models learnable models deep neural networks
How to solve hard problems? Use a lot of good AND labelled training data Use a big deep neural network Success is the only possible outcome powerful models learnable models deep neural networks
The deep learning hypothesis is true! Not a controversial statement Big deep nets get the best results ever on: Speech recognition Object recognition Language modelling
Summary Big deep nets with 10-20 layers can do great things Supervised backpropagation can train 10-20-layer nets Ergo: we can do a whole lot if we have a large, good supervised training set
Deep nets can t solve all problems???
Deep nets can t solve all problems DNNs couldn t solve problems where the input and the output are very structured Sequence to sequence Graph to graph So far, we ve addressed the most basic form of supervised learning
Key limitation Inputs and outputs must be fixed-sized vectors Great for images: input is a big image of a fixed size output is a 1-of-N encoding of category
Key limitation Inputs and outputs must be fixed-sized vectors Great for images: input is a big image of a fixed size output is a 1-of-N encoding of category output Bad news for machine translation, question answering, squiggle recognition The enemy: Unit-specific connections Input
Our contribution: solving the sequence to sequence problem It s a fundamental capability Applications: MT, Q&A, ASR, squiggle recognition Nice feature: Our approach has minimal innovation We demonstrate that the approach is viable by matching state-of-the-art results on machine translation State-of-the-art in MT is strong So approach should do well on many other tasks too
Recurrent Neural Networks (RNNs) RNNs can work with sequences t=1 t=2 t=3 t=4 t=5 t=6 out out out out out out hid hid hid hid hid hid inp inp inp inp inp inp Key idea: each timestep has a layer with the same weights Time Problem is solved
Recurrent Neural Networks (RNNs) Are neural networks that can process sequences well Very expressive models Backpropagation is applicable Fun fact: recurrent neural networks were trained in the original backpropagation paper in 1986 Has trouble learning long-term dependencies Vanishing gradient problem (Hochreiter 1991; Bengio et al., 1994) There are ways to learn RNNs but they are complicated
Long Short-Term Memory (LSTM) Modify / hack the RNN architecture so that the vanishing gradient problem goes away Do so without sacrificing expressive power A model that achieves this purpose will be useful
Long Short-Term Memory (LSTM) RNNs overwrite the hidden state LSTMs add to the hidden state Addition has nice gradients All terms in a sum get a nice gradient LSTM is good at noticing long-range correlations It uses sums instead of overwriting Main advantage: requires little tuning Hugely important
RNN t=1 t=2 t=3 t=4 t=5 t=6 out out out out out out hid hid hid hid hid hid inp inp inp inp inp inp
LSTM t=1 t=2 t=3 t=4 t=5 t=6 out out out out out out hid inp + hid inp + hid inp + hid inp + hid inp + hid inp
The heart of the LSTM The memory cell output X M M + X H I1 I2 F O X H X
Sequence to sequence Length of input sequence = length of output sequence = bad Not good for either ASR and MT Existing strategies for mapping sequences to sequences has an HMM-like component Normal ASR approaches have a big complicated transducer The Connectionist Sequence Classification (CTC) assumes monotonic alignments, uses an HMM But we want something simpler and more generic Should be applicable to any sequence-to-sequence problem Including MT, where words can be reordered in many ways
Main idea Neural nets are excellent at learning very complicated functions Coerce a neural network / LSTM to read one sequence and produce another Learning will take care of the rest
Main idea Target sequence A B C Input sequence D X Y Z Q X Y Z
That s it! The LSTM needs to read the entire input sequence, and then produce the target sequence from memory The input sequence is stored by a single LSTM hidden state
Relevant Related Work Independently and simultaneously, Kyunghyun Cho et al. (Bengio lab) invented basically the same approach, with a model that s related to the LSTM More interestingly and impressively, Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio developed a model that could learn to attend to different parts of the input sentence No need to remember the entire input sequence Maximal benefit with smaller hidden states
Step 1: can the LSTM reconstruct the input sentence? Can this scheme learn the identity function? Target sequence A B C D A B C D A B C Answer: it can, and it can do it very easily. It just does it effortlessly and perfectly.
Step 2: small dataset experiments: EuroParl French to English Low-entropy parliament language 20M words in total Small vocabulary Sentence length no longer than 25 The net was doing something non-trivial We were inspired
Digression: decoding Formally, given an input sentence, the LSTM defines a distribution over output sentences Therefore, we should produce the sentence with the highest probability But there are exponentially many sentences, how to find it? We use a simple greedy strategy
Decoding in a nutshell Proceed left to right Maintain N partial translations Expand each translation with possible next words Discard all but the top N new partial translations 2 partial hypothesis I My expand and sort expand hypotheses I decided My decision I thought I tried My thinking My direction 2 new partial hypotheses prune I decided My decision
Why does simple beam-search work? The LSTM is trained to predict the next word given previous words Maintain a list of partial translations Extend each partial translation, evaluate each extensions, and discard all but the top-k Most search improvement is obtained with a beam of size 2 A full 1 BLEU point
Model for big experiments 160K input words 80k output words 4 layers of 1000D LSTM different LSTMs for input and output language 384M parameters
The model A B C D 80k softmax by 1000 dims This is very big! 1000 LSTM cells 2000 dims per timestep 2000 x 4 = 8k dims per sentence A B C D A B C 160k vocab in input language
Parallelization Parallelization is important More parallelization is better -- ongoing work 8 GPUs Speed: 6,700 words per second Idea: layer per GPU
Learning parameters Learning parameters are very simple and straightforward: Learning rate = 0.7 / batch_size init scale = -0.08 0.08 norm of gradient is clipped to 5 learning rate is halved every 0.5 epochs after 5 epochs No momentum (may be a mistake )
GPU6 GPU5 A B C D A B C D 80k softmax by 1000 dims This is very big! Split softmax into 4 GPUs GPU4 GPU3 GPU2 1000 LSTM cells 2000 dims per timestep GPU1 2000 x 4 = 8k dims per sentence A B C D A B C 160k vocab in input language
GPU6 GPU5 A B C D A B C D 80k softmax by 1000 dims This is very big! Split softmax into 4 GPUs GPU4 GPU3 GPU2 1000 LSTM cells 2000 dims per timestep GPU1 2000 x 4 = 8k dims per sentence A B C D A B C 160k vocab in input language
GPU6 GPU5 A B C D A B C D 80k softmax by 1000 dims This is very big! Split softmax into 4 GPUs GPU4 GPU3 GPU2 1000 LSTM cells 2000 dims per timestep GPU1 2000 x 4 = 8k dims per sentence A B C D A B C 160k vocab in input language
GPU6 GPU5 A B C D A B C D 80k softmax by 1000 dims This is very big! Split softmax into 4 GPUs GPU4 GPU3 GPU2 1000 LSTM cells 2000 dims per timestep GPU1 2000 x 4 = 8k dims per sentence A B C D A B C 160k vocab in input language
GPU6 GPU5 A B C D A B C D 80k softmax by 1000 dims This is very big! Split softmax into 4 GPUs GPU4 GPU3 GPU2 1000 LSTM cells 2000 dims per timestep GPU1 2000 x 4 = 8k dims per sentence A B C D A B C 160k vocab in input language
GPU6 GPU5 A B C D A B C D 80k softmax by 1000 dims This is very big! Split softmax into 4 GPUs GPU4 GPU3 GPU2 1000 LSTM cells 2000 dims per timestep GPU1 2000 x 4 = 8k dims per sentence A B C D A B C 160k vocab in input language
GPU6 GPU5 A B C D A B C D 80k softmax by 1000 dims This is very big! Split softmax into 4 GPUs GPU4 GPU3 GPU2 1000 LSTM cells 2000 dims per timestep GPU1 2000 x 4 = 8k dims per sentence A B C D A B C 160k vocab in input language
Representations X Y Z. X Y Z This thing A B C D
Representations
Representations
Reversing source sentences It is natural to train the LSTM on sentences in the source language followed by the target language A, B, C, A, B, C, It turns out that reversing the words in the source sentence substantially improves performance. Why? Because we introduce many short-term dependencies to the dataset that make the learning problem easier
Reversing source sentences, C, B, A A, B, C, A is very close to A B is fairly close to B Close = close in time Backpropagation can notice that A is connected to A and establish this connection It makes it easier to learn that B is connected to B Natural bootstrapping Backprop can notice the short-term dependencies first, and slowly extend them to long range dependencies
Results on a big dataset Corpus: WMT 14 English French 384M French words, 303M English words 70K test words BLEU score: large is good Bahdanau et al. One LSTM Phrase-based SMT baseline Ensemble of 5 LSTM State of the art 28.45 30.6 33.3 34.8 37.0 We are doing OK, but we re still far from state of the art
Performance vs sentence length
Performance on rare words
Big Problem Performance deteriorates on sentences that has many rare words Vocabulary is limited, many words cannot be translated for that reason Most obvious flaw, worth fixing
Solving the rare word problem Example: input: I trained a veryrareword output: I trained a <unk> veryrareword is a very rare word. But model could know where it came from
Our solution Use a traditional word alignment algorithm Replace each <unk> in the target translation with an <unk-d> where d is an integer d indicates the position of the aligned word in the source sentence, if it exists
Our solution The model no longer needs to have every word in its vocabulary It only needs to know where the word originated from!
Procedure Train time: Annotate the unknown tokens with an alignment algorithm Train the LSTM to emit the annotations Test time: Use the LSTM to produce the annotated translations Translate each word with the word dictionary
Example
Example
Example
It works really well! We match state-of-the-art
Using only 1/3rd of the training data! Model takes many days to train, so we trained on ⅓rd of the data Additional models are being trained on the entire dataset, results are expected to get better Much larger dataset exist, and much larger neural nets should get much better
Depth helps
Depth helps Most gains are due to larger hidden state Better models benefit more from the postprocessing. Why? Because better models compute the annotations more accurately
Conclusions We match state-of-the-art in MT and will likely exceed it Near future: train a huge, giant model on much more translation data Apply these models to other sequence to sequence problems This work brings us closer to the complete solution to the problem of supervised learning
Conclusions If you have a large big dataset
Conclusions If you have a large big dataset And you train a very big neural network
Conclusions If you have a large big dataset And you train a very big neural network Then success is guaranteed!
Questions?
Thank you!