Supervised Learning with Neural Networks and Machine Translation with LSTMs

Size: px

Start display at page:

Download "Supervised Learning with Neural Networks and Machine Translation with LSTMs"

Sylvia Skinner
5 years ago
Views:

1 Supervised Learning with Neural Networks and Machine Translation with LSTMs Ilya Sutskever in collaboration with: Minh-Thang Luong Quoc Le Oriol Vinyals Wojciech Zaremba Google Brain

8 Deep Neural Networks 1. Can perform an astonishingly wide range of computations 2. Can be learned automatically powerful models learnable models deep neural networks

9 Powerful models are necessary A weak model will never get good performance Examples of weak models: Single layer logistic regression Linear SVM Small neural nets Small conv nets A neural network needs to be large and deep to be powerful powerful models learnable models deep neural networks

10 Why are deep nets powerful? A single neuron can implement boolean logic, and thus arbitrary computation OR AND NOT

11 Why are deep nets powerful? A single neuron can implement boolean logic, and general thus computation and computers Mid-sized 2-hidden layer neural network can sort N N-bit numbers Intuitively, sorting requires log N parallel steps Backpropagation can find this circuit

12 Why are deep nets powerful? A single neuron can implement boolean logic, and general thus computation and computers Mid-sized 2-hidden layer neural network can sort N N-bit numbers Intuitively, sorting requires log N parallel steps Backpropagation can find this circuit Neurons are more economical than boolean logic

13 The Deep Learning Hypothesis Human perception is fast Neurons fire at most 100 times a second Humans solve perception in 0.1 seconds our neurons fire 10 times, at most Anything a human can do in 0.1 seconds, a big 10layer neural network can do, too! 20-layer neural networks can be trained well in practice Two years ago we could only train 10-layer networks

14 Implication DNNs, once trained, should do well on all perception problem Vision, speech, emotion, face recognition, instinct If there exists a human expert that can solve hard problems in a fraction of a second, then large deep neural networks could do so, too Instantaneous translation Speed reading Identifying the obvious thing to do in a complicated situation or a game

15 Learning Powerful models are useless unless we can train them Supervised backpropagation works! Not clear why 20-layer neural nets easily trainable with backprop powerful models learnable models deep neural networks

16 Learning Algorithm While not done Pick an example (x, y) Run the network of x to get a prediction p Use the gradient to bring p slightly closer to y Theory must make nontrivial assumptions about dataset powerful models learnable models deep neural networks

17 How to solve hard problems? Use a lot of good AND labelled training data Use a big deep neural network powerful models learnable models deep neural networks

18 How to solve hard problems? Use a lot of good AND labelled training data Use a big deep neural network Success is the only possible outcome powerful models learnable models deep neural networks

19 The deep learning hypothesis is true! Not a controversial statement Big deep nets get the best results ever on: Speech recognition Object recognition Language modelling

20 Summary Big deep nets with layers can do great things Supervised backpropagation can train layer nets Ergo: we can do a whole lot if we have a large, good supervised training set

21 Deep nets can t solve all problems???

22 Deep nets can t solve all problems DNNs couldn t solve problems where the input and the output are very structured Sequence to sequence Graph to graph So far, we ve addressed the most basic form of supervised learning

23 Key limitation Inputs and outputs must be fixed-sized vectors Great for images: input is a big image of a fixed size output is a 1-of-N encoding of category

24 Key limitation Inputs and outputs must be fixed-sized vectors Great for images: input is a big image of a fixed size output is a 1-of-N encoding of category output Bad news for machine translation, question answering, squiggle recognition The enemy: Unit-specific connections Input

25 Our contribution: solving the sequence to sequence problem It s a fundamental capability Applications: MT, Q&A, ASR, squiggle recognition Nice feature: Our approach has minimal innovation We demonstrate that the approach is viable by matching state-of-the-art results on machine translation State-of-the-art in MT is strong So approach should do well on many other tasks too

26 Recurrent Neural Networks (RNNs) RNNs can work with sequences t=1 t=2 t=3 t=4 t=5 t=6 out out out out out out hid hid hid hid hid hid inp inp inp inp inp inp Key idea: each timestep has a layer with the same weights Time Problem is solved

27 Recurrent Neural Networks (RNNs) Are neural networks that can process sequences well Very expressive models Backpropagation is applicable Fun fact: recurrent neural networks were trained in the original backpropagation paper in 1986 Has trouble learning long-term dependencies Vanishing gradient problem (Hochreiter 1991; Bengio et al., 1994) There are ways to learn RNNs but they are complicated

28 Long Short-Term Memory (LSTM) Modify / hack the RNN architecture so that the vanishing gradient problem goes away Do so without sacrificing expressive power A model that achieves this purpose will be useful

29 Long Short-Term Memory (LSTM) RNNs overwrite the hidden state LSTMs add to the hidden state Addition has nice gradients All terms in a sum get a nice gradient LSTM is good at noticing long-range correlations It uses sums instead of overwriting Main advantage: requires little tuning Hugely important

30 RNN t=1 t=2 t=3 t=4 t=5 t=6 out out out out out out hid hid hid hid hid hid inp inp inp inp inp inp

31 LSTM t=1 t=2 t=3 t=4 t=5 t=6 out out out out out out hid inp + hid inp + hid inp + hid inp + hid inp + hid inp

32 The heart of the LSTM The memory cell output X M M + X H I1 I2 F O X H X

33 Sequence to sequence Length of input sequence = length of output sequence = bad Not good for either ASR and MT Existing strategies for mapping sequences to sequences has an HMM-like component Normal ASR approaches have a big complicated transducer The Connectionist Sequence Classification (CTC) assumes monotonic alignments, uses an HMM But we want something simpler and more generic Should be applicable to any sequence-to-sequence problem Including MT, where words can be reordered in many ways

34 Main idea Neural nets are excellent at learning very complicated functions Coerce a neural network / LSTM to read one sequence and produce another Learning will take care of the rest

35 Main idea Target sequence A B C Input sequence D X Y Z Q X Y Z

36 That s it! The LSTM needs to read the entire input sequence, and then produce the target sequence from memory The input sequence is stored by a single LSTM hidden state

37 Relevant Related Work Independently and simultaneously, Kyunghyun Cho et al. (Bengio lab) invented basically the same approach, with a model that s related to the LSTM More interestingly and impressively, Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio developed a model that could learn to attend to different parts of the input sentence No need to remember the entire input sequence Maximal benefit with smaller hidden states

38 Step 1: can the LSTM reconstruct the input sentence? Can this scheme learn the identity function? Target sequence A B C D A B C D A B C Answer: it can, and it can do it very easily. It just does it effortlessly and perfectly.

39 Step 2: small dataset experiments: EuroParl French to English Low-entropy parliament language 20M words in total Small vocabulary Sentence length no longer than 25 The net was doing something non-trivial We were inspired

40 Digression: decoding Formally, given an input sentence, the LSTM defines a distribution over output sentences Therefore, we should produce the sentence with the highest probability But there are exponentially many sentences, how to find it? We use a simple greedy strategy

41 Decoding in a nutshell Proceed left to right Maintain N partial translations Expand each translation with possible next words Discard all but the top N new partial translations 2 partial hypothesis I My expand and sort expand hypotheses I decided My decision I thought I tried My thinking My direction 2 new partial hypotheses prune I decided My decision

42 Why does simple beam-search work? The LSTM is trained to predict the next word given previous words Maintain a list of partial translations Extend each partial translation, evaluate each extensions, and discard all but the top-k Most search improvement is obtained with a beam of size 2 A full 1 BLEU point

43 Model for big experiments 160K input words 80k output words 4 layers of 1000D LSTM different LSTMs for input and output language 384M parameters

44 The model A B C D 80k softmax by 1000 dims This is very big! 1000 LSTM cells 2000 dims per timestep 2000 x 4 = 8k dims per sentence A B C D A B C 160k vocab in input language

45 Parallelization Parallelization is important More parallelization is better -- ongoing work 8 GPUs Speed: 6,700 words per second Idea: layer per GPU

46 Learning parameters Learning parameters are very simple and straightforward: Learning rate = 0.7 / batch_size init scale = norm of gradient is clipped to 5 learning rate is halved every 0.5 epochs after 5 epochs No momentum (may be a mistake )

47 GPU6 GPU5 A B C D A B C D 80k softmax by 1000 dims This is very big! Split softmax into 4 GPUs GPU4 GPU3 GPU LSTM cells 2000 dims per timestep GPU x 4 = 8k dims per sentence A B C D A B C 160k vocab in input language

48 GPU6 GPU5 A B C D A B C D 80k softmax by 1000 dims This is very big! Split softmax into 4 GPUs GPU4 GPU3 GPU LSTM cells 2000 dims per timestep GPU x 4 = 8k dims per sentence A B C D A B C 160k vocab in input language

49 GPU6 GPU5 A B C D A B C D 80k softmax by 1000 dims This is very big! Split softmax into 4 GPUs GPU4 GPU3 GPU LSTM cells 2000 dims per timestep GPU x 4 = 8k dims per sentence A B C D A B C 160k vocab in input language

50 GPU6 GPU5 A B C D A B C D 80k softmax by 1000 dims This is very big! Split softmax into 4 GPUs GPU4 GPU3 GPU LSTM cells 2000 dims per timestep GPU x 4 = 8k dims per sentence A B C D A B C 160k vocab in input language

51 GPU6 GPU5 A B C D A B C D 80k softmax by 1000 dims This is very big! Split softmax into 4 GPUs GPU4 GPU3 GPU LSTM cells 2000 dims per timestep GPU x 4 = 8k dims per sentence A B C D A B C 160k vocab in input language

52 GPU6 GPU5 A B C D A B C D 80k softmax by 1000 dims This is very big! Split softmax into 4 GPUs GPU4 GPU3 GPU LSTM cells 2000 dims per timestep GPU x 4 = 8k dims per sentence A B C D A B C 160k vocab in input language

53 GPU6 GPU5 A B C D A B C D 80k softmax by 1000 dims This is very big! Split softmax into 4 GPUs GPU4 GPU3 GPU LSTM cells 2000 dims per timestep GPU x 4 = 8k dims per sentence A B C D A B C 160k vocab in input language

54 Representations X Y Z. X Y Z This thing A B C D

55 Representations

56 Representations

57 Reversing source sentences It is natural to train the LSTM on sentences in the source language followed by the target language A, B, C, A, B, C, It turns out that reversing the words in the source sentence substantially improves performance. Why? Because we introduce many short-term dependencies to the dataset that make the learning problem easier

58 Reversing source sentences, C, B, A A, B, C, A is very close to A B is fairly close to B Close = close in time Backpropagation can notice that A is connected to A and establish this connection It makes it easier to learn that B is connected to B Natural bootstrapping Backprop can notice the short-term dependencies first, and slowly extend them to long range dependencies

59 Results on a big dataset Corpus: WMT 14 English French 384M French words, 303M English words 70K test words BLEU score: large is good Bahdanau et al. One LSTM Phrase-based SMT baseline Ensemble of 5 LSTM State of the art We are doing OK, but we re still far from state of the art

60 Performance vs sentence length

61 Performance on rare words

62 Big Problem Performance deteriorates on sentences that has many rare words Vocabulary is limited, many words cannot be translated for that reason Most obvious flaw, worth fixing

63 Solving the rare word problem Example: input: I trained a veryrareword output: I trained a <unk> veryrareword is a very rare word. But model could know where it came from

64 Our solution Use a traditional word alignment algorithm Replace each <unk> in the target translation with an <unk-d> where d is an integer d indicates the position of the aligned word in the source sentence, if it exists

65 Our solution The model no longer needs to have every word in its vocabulary It only needs to know where the word originated from!

66 Procedure Train time: Annotate the unknown tokens with an alignment algorithm Train the LSTM to emit the annotations Test time: Use the LSTM to produce the annotated translations Translate each word with the word dictionary

67 Example

68 Example

69 Example

70 It works really well! We match state-of-the-art

71 Using only 1/3rd of the training data! Model takes many days to train, so we trained on ⅓rd of the data Additional models are being trained on the entire dataset, results are expected to get better Much larger dataset exist, and much larger neural nets should get much better

72 Depth helps

73 Depth helps Most gains are due to larger hidden state Better models benefit more from the postprocessing. Why? Because better models compute the annotations more accurately

74 Conclusions We match state-of-the-art in MT and will likely exceed it Near future: train a huge, giant model on much more translation data Apply these models to other sequence to sequence problems This work brings us closer to the complete solution to the problem of supervised learning

75 Conclusions If you have a large big dataset

76 Conclusions If you have a large big dataset And you train a very big neural network

77 Conclusions If you have a large big dataset And you train a very big neural network Then success is guaranteed!

78 Questions?

79 Thank you!

Residual Stacking of RNNs for Neural Machine Translation

Residual Stacking of RNNs for Neural Machine Translation Raphael Shu The University of Tokyo shu@nlab.ci.i.u-tokyo.ac.jp Akiva Miura Nara Institute of Science and Technology miura.akiba.lr9@is.naist.jp