Neural Machine Translation

Size: px

Start display at page:

Download "Neural Machine Translation"

Rosalyn Wilson
5 years ago
Views:

Babel, MIPT The ADAPT Centre is funded under the SFI Research Centres

1 Neural Machine Translation Qun Liu, Peyman Passban ADAPT Centre, Dublin City University 29 January 2018, at DeepHack.Babel, MIPT The ADAPT Centre is funded under the SFI Research Centres Programme (Grant 13/RC/2106) and is co-funded under the European Regional Development Fund.

2 Content Background: Machine Translation and Neural Network Transition: From Discrete Spaces to Continuous Spaces Neural Machine Translation: MT in a Continuous Space Implementing Seq2Seq models with PyTorch Conclusion 1

3 Background: Machine Translation and Neural Network Statistical Machine Translation (SMT) Deep Learning (DL) and Neural Network (NN) The Gap between DL and MT 2

4 Parallel Corpus 3

5 Word Alignment 4

6 Phrase Table 5

7 Decoding Process Build translation left to right Select a phrase to translate Maria no dio una bofetada a la bruja verde Mary did not slap the witch green 6

8 Decoding Process Build translation left to right Select a phrase to translate Find the translation for the phrase Maria no dio una bofetada a la bruja verde Mary did not slap the witch green 7

9 Decoding Process Build translation left to right Select a phrase to translate Find the translation for the phrase Add the phrase to the end of the partial translation Maria no dio una bofetada a la bruja verde Mary did not slap the witch green 8

10 Decoding Process Build translation left to right Select a phrase to translate Find the translation for the phrase Add the phrase to the end of the partial translation Mark words as translated Maria no dio una bofetada a la bruja verde Mary did not slap the witch green 9

11 Decoding Process One to many translation Maria no dio una bofetada a la bruja verde Mary did not slap the witch green 10

12 Decoding Process Many to one translation Maria no dio una bofetada a la bruja verde Mary did not slap the witch green 11

13 Decoding Process Many to one translation Maria no dio una bofetada a la bruja verde Mary did not slap the witch green 12

14 Decoding Process Reordering Maria no dio una bofetada a la bruja verde Mary did not slap the witch green 13

15 Decoding Process Translation finished! Maria no dio una bofetada a la bruja verde Mary did not slap the witch green 14

16 Search Space for Phrase-based SMT 约翰 Yuehan 喜欢 xihuan John loves Mary 玛丽 Mali 15

17 Search Space for Phrase-based SMT 约翰 Yuehan 喜欢 xihuan John loves Mary 玛丽 Mali 16 The search is directed by a weighted combination of various features: Translation probability Language model probability

18 Background: Machine Translation and Neural Network Statistical Machine Translation (SMT) Deep Learning (DL) and Neural Network (NN) o (slides taken from Kevin Duh s presentation) The Gap between DL and MT 17

19 Human Neurons - Very Loose Inspiration

20 Perceptrons - Linear Classifiers

21 Logistic Regression (1-layer net) Function model: f(x) = σ(w T x) o Parameters: vector w R d o σ is a non-linearity, e.g. sigmoid: o σ(z) = 1/(1 + exp ( z)) o Non-linearity will be important in expressiveness o multi-layer nets. Other non-linearities, e.g., o tanh (z) = (e z e z )/(e z + e z ) 20 Extracted from Kevin Duh s slides in DL4MT Winter School

22 2-layer Neural Networks Called Multilayer Perceptron (MLP), but more like multilayer logistic regression 21 Extracted from Kevin Duh s slides in DL4MT Winter School

23 Expressive Power of Non-linearity A deeper architecture is more expressive than a shallow one given same number of nodes [Bishop, 1995] o 1-layer nets only model linear hyperplanes o 2-layer nets can model any continuous function (given sufficient nodes) o >3-layer nets can do so with fewer nodes 22 Extracted from Kevin Duh s slides in DL4MT Winter School

24 What is Deep Learning? A family of methods that uses deep architectures to learn high-level feature representations 23 Extracted from Kevin Duh s slides in DL4MT Winter School

25 Automatically Trained Features in FR Automatically trained features make sense! [Lee et al., 2009] Input: Images (raw pixels) Output: Features of Edges, Body Parts, Full Faces 24 Extracted from Kevin Duh s slides in DL4MT Winter School

26 Current models are becoming more complex 25 Extracted from Kevin Duh s slides in DL4MT Winter School

27 Background: Machine Translation and Neural Network Statistical Machine Translation (SMT) Deep Learning (DL) and Neural Network (NN) The Gap between DL and MT 26

28 The Gap between DL and MT Discrete symbols Continuous vectors 27

29 Content Background: Machine Translation and Neural Network Transition: From Discrete Spaces to Continuous Spaces Neural Machine Translation: MT in a Continuous Space Implementing Seq2Seq models with PyTorch Conclusion 28

30 Transition From Discrete Space to Continuous Space Word Embedding Express a word in a continuous space Neural Language Model 29

31 Express a word in a continuous space David John play Mary loves like 30

32 Express a word in a continuous space John David Mary loves play like 31

33 One-Hot Vector The dimension of the vector is the vocabulary size Each dimension is correspondent to a word Each word is represented as a vector that: o the element is equal to 1 at the dimension which is correspondent to that word o All the other elements are equal to 0 32

34 One-Hot Vector: Weakness The dimension is very high (equal to the vocabulary size / 100k) Very little information is carried by a one-hot vector o No syntactic information o No semantic information o No lexical information 33

35 Distributional Semantic Models Assumption: Words that are used and occur in the same contexts tend to purport similar meanings A typical model: Context Window: o A word is represented as the sum/average/tf-idf of the one-hot vectors appearing in the windows surrounding its every occurrence in the corpus o Effective for word similarity measurement o LSA can be used to reduce the dimension Weakness o Not compositional o Reverse Mapping is not supported 34

36 Word2Vec: Word Embedding by Neural Networks A word is represented by a dense vector (usually several hundreds dimensions) The Word2Vec matrix are trained by a 2-layer neural network 35 Extracted from Christopher Moody s slides

Word2Vec: CBOW context words current word 36 http://stats.stackexchange.

37 Word2Vec: CBOW context words current word 36 t-vector-representation-vs-output-vectorrepresentation-in-word2vec

Word2Vec: Skip-gram current word context words http://stats.stackexchange.

38 Word2Vec: Skip-gram current word context words t-vector-representation-vs-output-vectorrepresentation-in-word2vec 37

39 Transition From Discrete Space to Continuous Space Word Embedding Express a word in a continuous space Neural Language Model Express a sentence in a continuous space 38

40 Language Models Given a sentence: w 1 w 2 w 3 w n, a language model is: p(w i w 1 w i 1 ) N-gram Language Model: p(w i w 1 w i 1 ) p(w i w i N+1 w i 1 ) Markov Chain Assumption 39

41 N-Gram Model A part of the parameter matrix of a bigram language model 40

42 N-Gram Model Normalize on all words A part of the parameter matrix of a bigram language model 41

43 Feed Forward Neural Network LM 42 [Bengio et al., 2003]

44 Feed Forward Neural Network LM 43 [Bengio et al., 2003]

45 Feed Forward Neural Network LM softmax layer Normalize on the vocabulary size Computational intensive 44 [Bengio et al., 2003]

46 Feed Forward Neural Network LM One shortcoming of FFNN LM is that it can only take limited length of history, just like N-gram LM An improved NN LM is proposed to solve this problem: Recurrent Neural Network LM 45

47 Recurrent Neural Network LM 46

48 Recurrent Neural Network LM Unfold the RNN LM along the timeline: 47

49 LSTM & GRU: Improved Implementation of RNN Mitigating gradient vanishing and exploding Long distance dependency 48

50 Language Model for Generation Given language model p(w i w 1 w i 1 ) and a history, we can generate the next word with highest LM score: w t = argmax w t V p(w t w 1 w i 1 ) S 49

51 Content Background: Machine Translation and Neural Network Transition: From Discrete Spaces to Continuous Spaces Neural Machine Translation: MT in a Continuous Space Implementing Seq2Seq models with PyTorch Conclusion 50

52 Neural Machine Translation: MT in a Continuous Space 喜欢约翰玛丽 Mary John loves Chinese Space English Space 51

53 Neural Machine Translation: MT in a Continuous Space Neural Machine Translation (NMT) Attention-based NMT 52

54 Neural Machine Translation The same things with SMT: o Trained with a parallel corpus o The input and output are word sequences The difference with SMT: o A single, large neural network o All the internal computing is conducted on real values without symbols o No word-alignment o No phrase table or rule table o No n-gram language model 53

Neural Machine Translation <bos> 54 https://medium.

55 Neural Machine Translation <bos> 54

56 Neural Machine Translation: MT in a Continuous Space Neural Machine Translation (NMT) Attention-based NMT 55

57 Weakness of the simple NMT model The only connection between the source sentence and the target sentence is the single vector representation of the source sentence It is hard for this fix-length vector to capture the meaning of the variable-length sentence, especially when the sentence is very long When the sentence becomes longer, the translation quality drops dramatically 56

58 Attention-based Model: Improvement Keep the states for all words rather than the final state only Use bi-directional RNN to replace single directional RNN Use an attention mechanism as a soft alignment between the source words and target words 57

59 Bi-directional RNN 58 on-neural-machine-translation-gpus-part-3/

60 Bi-directional RNN The representation for the word in the context on-neural-machine-translation-gpus-part-3/

61 Bi-directional RNN It contains the context information of the word in both sides 60 on-neural-machine-translation-gpus-part-3/

Attention for NMT 61 https://devblogs.nvidia.

62 Attention for NMT 61 on-neural-machine-translation-gpus-part-3/

63 Attention for NMT 62 on-neural-machine-translation-gpus-part-3/

Attention for NMT 63 https://devblogs.nvidia.

64 Attention for NMT 63 on-neural-machine-translation-gpus-part-3/

65 Soft Alignments by Attention Mechanism 64 on-neural-machine-translation-gpus-part-3/

66 Attention-based NMT The attention-based NMT is very successful It s performance has outperformed the SoA of SMT Attention mechanism is used in many DL tasks, such as image caption generation 65

67 Content Background: Machine Translation and Neural Network Transition: From Discrete Spaces to Continuous Spaces Neural Machine Translation: MT in a Continuous Space Implementing Seq2Seq models with PyTorch Conclusion 66

68 Implementing Seq2Seq models with PyTorch Encoder-Decoder Model 67

69 cat Encoding

70 cat Encoding context

71 cat sat Encoding context

72 cat sat on Encoding context

73 cat sat on the Encoding context

74 cat sat on the mat Encoding context

75 cat sat on the mat EOS Encoding (Done!) context

76 cat sat on the mat EOS Encoding (Done!) context Encoder

77 cat sat on the mat EOS gorbeh Decoding context Encoder

78 cat sat on the mat EOS gorbeh ruyeh Decoding context Encoder

79 cat sat on the mat EOS gorbeh ruyeh hasir Decoding context Encoder

80 cat sat on the mat EOS gorbeh ruyeh hasir neshast Decoding context Encoder

81 cat sat on the mat EOS gorbeh ruyeh hasir neshast EOS Decoding context Encoder

82 cat sat on the mat EOS gorbeh ruyeh hasir neshast EOS Decoding (Done!) context Encoder Decoder

83 cat sat on the mat EOS gorbeh ruyeh Attention! context α 1 α 2 α 3 α 4 α 5 α 6

84 cat sat on the mat EOS Attention! α 1 α 2 α 3 α 4 α 5 α 6 + context

85 Encoder

86 Encoder

87 Encoder

88 Encoder w.com/questions/ /what-doessuper-do-in-python

89 Encoder embedding # Unique Source Words w_i

90 Encoder input-th embedding

91 Encoder input-th embedding h_(t-1) h_t output_t input h output

92 Encoder input h

93 Decoder+Attention

94 Decoder+Attention

95 Decoder+Attention Two embedding tables!?

96 Decoder+Attention

97 Decoder+Attention index (digit)

98 Decoder+Attention 1 x 1-1

99 Decoder+Attention 1 x 1-1 embedded: 1 x 1 x -1 embedded[0]: 1 x -1

100 Decoder+Attention 1 x 1-1 embedded: 1 x 1 x -1 embedded[0]: 1 x -1 hidden[0]: 1 x -1

101 Decoder+Attention decoder s state: embedded[0] ; hidden[0] ;

102 Decoder+Attention decoder s state: embedded[0] ; hidden[0] ;

103 Decoder+Attention Softmax ( ) 1 x max_length

104 cat sat on the mat EOS Attention! α 1 α 2 α 3 α 4 α 5 α 6 + context

105 Decoder+Attention 1 x max_length unsqueeze(0) 1 x 1 x max_length

106 Decoder+Attention 1 x max_length x embed

107 Decoder+Attention context: 1 x 1 x embed

108 Decoder+Attention Learning Phrase Representations using RNN Encoder Decoder for Statistical Machine Translation, EMNLP, 2014.

109 Decoder+Attention context: 1 x 1 x embed

110 Decoder+Attention context: 1 x 1 x embed

111 Decoder+Attention context: 1 x 1 x embed

112 Decoder+Attention context: 1 x 1 x embed

113 Decoder+Attention context: 1 x 1 x embed

114 Decoder+Attention

115 Putting together

116 Putting together

117 Putting together

118 Putting together pair: [[a, b, c], [a, b, c, d ]]

119 Putting together

120 Putting together training_pair[0]: [a, b, c] training_pair[1]: [a, b, c, d ]

121 Putting together training_pair[0]: [a, b, c] training_pair[1]: [a, b, c, d ]

122 Putting together training_pair[0]: [a, b, c] training_pair[1]: [a, b, c, d ]

123 Putting together

124 Putting together training_pair[0]: [a, b, c] training_pair[1]: [a, b, c, d ]

125 Putting together

126 cat sat on the mat EOS gorbeh ruyeh Decoding context Encoder

127 Putting together training_pair[0]: [a, b, c] training_pair[0][0]: [a] word embedding

128 Putting together init the decoder!

129 Putting together

130 Putting together

131 Putting together

132 Putting together

133 Putting together

134 Putting together

135 Content Background: Machine Translation and Neural Network Transition: From Discrete Spaces to Continuous Spaces Neural Machine Translation: MT in a Continuous Space Implementing Seq2Seq models with PyTorch Conclusion 134

136 Conclusion MT is a task defined in a discrete space In a deep learning framework, the MT is converted to a task defined in a continuous space Word embedding is used to map a word to a vector Recurrent Neural Network is used to model the word sequence Encoder-Decoder (or Sequence-to-Sequence) model is proposed for neural machine translation Attention-based mechanism is used to provide soft alignment for NMT NMT has outperformed SMT and still has huge potential 135

137 Further topics Subword level and character level models o Morphologically rich languages o Out-of-Vocabulary problem Multitask and Multiway models o Sharing parameters among Multiple MT models o Low resource or zero-shot language pairs Pure attention models o Higher performance 136

138 Thanks Q&A Speaker: Qun Liu Speaker: Peyman Passban 137

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks 1 Tzu-Hsuan Yang, 2 Tzu-Hsuan Tseng, and 3 Chia-Ping Chen Department of Computer Science and Engineering