Neural models in NLP. Natural Language Processing: Lecture Kairit Sirts

Size: px

Start display at page:

Download "Neural models in NLP. Natural Language Processing: Lecture Kairit Sirts"

Kristina Horton
5 years ago
Views:

1 Neural models in NLP Natural Language Processing: Lecture Kairit Sirts

2 The goal of today s lecture Explain word embeddings Explain the recurrent neural models used in NLP 2

3 Log-linear language model y the next word to predict x the context sequence: words, annotations etc v model parameters f(x, y) feature vector for the input-output pair (x, y) 3

4 The problem with log-linear models Feature engineering Developing feature templates Which features are relevant to which problems? Experiment with subsets of features Features can be very complex 4

5 What if we could let the model learn the relevant features automatically? Neural networks 5

6 1-hot representation the girl with flowers is cute are were flower The girl with the flowers is cute flower

7 What is the similarity between vectors for flower and flowers? the girl with flowers is cute are were flower flowers flower

8 Features as distributed representations Deep Learning: What is meant by a distributed representation? 8

9 Distributed word representations f1 f2 f3 f4 flower flowers What is the cosine similarity between flower and flowers now? 9

10 Learning distributed word representations The girl with the flowers is cute. She has the flowers in her hand. I picked these flowers myself. The girl with a flower is cute. She has a flower in her hand. I picked this flower myself. with the has the flowers is cute in her with a has a flower is cute in her picked the myself picked a myself 10

11 11

12 12

13 Word2Vec Mikolov et al., Efficient Estimation of Word Representations in Vector Space 13

14 CBOW continuous bag of words w(t-2), w(t-1), w(t+1), w(t+2) one-hot vectors a row in the parameter matrix C the set of context vectors c the size of the context window - linear projection d embedding size 14

15 Skip-gram model Predict the context words w(t) one-hot vector Maximize: z

16 Training word embeddings General principle maximize the probability of the: Middle word, given the context words (CBOW) Context words, given the middle word (skip-gram) In case of skip-gram: Given T training words in context Maximize: Minimize: 16

17 Training word embeddings Typically trained with gradient descent You will learn more sophisticated methods in other courses Initialize the parameter vectors/matrices (somehow) Repeat until convergence: - the set of all trainable parameters - learning rate 17

18 Softmax vs log-linear model Softmax is a log-linear model Log-linear: Softmax: 18

19 The gradient of a log-linear model Empirical count Expected count 19

20 The gradients in skip-gram model c context word w middle word 20

21 The problem with softmax gradients Computing is computationally very expensive. Why? The gradients always include the sum over the whole vocabulary This makes computation very inefficient 21

from the training data (instead of the probability of

22 Negative sampling The general idea: Maximize the probability of the (word, context) pairs that came from the training data (instead of the probability of the context given the word) Previously: maximize Now: maximize 22

23 Skip-gram objective with negative sampling Maximize: - the set of random negative samples In practice, the number of negative samples per each positive sample is between

24 Tools for training word embeddings Word2vec Gensim includes both CBOW and skip-gram implementations Glove optimizes the predictions of co-occurrence counts between words Polyglot Dependency-based word embeddings 24

25 Further reading on word embeddings Mikolov et al., Distributed representations of words and phrases and their compositionality Mikolov et al., Efficient estimation of word representations in vector space Goldberg and Levy, word2vec Explained: Deriving Mikolov et al. s Negative-Sampling Word-Embedding Method Pennington et al., GloVe: Global Vectors for Word Representation Al-Rfou et al., Polyglot: Distributed Word Representations for Multilingual NLP Levy and Goldberg, Dependency-based word embeddings 25

26 Regularities between word embeddings Vector Representations of Words: 26

Word embedding models as neural networks One-hot vector of the input word Prediction of the context word (softmax) or whether the (context, word) pair belongs to the Data (negative

27 Word embedding models as neural networks One-hot vector of the input word Prediction of the context word (softmax) or whether the (context, word) pair belongs to the Data (negative sampling) Word embeddings The row corresponding To the input word in CS231n Convolutional Neural Networks for Visual Recognition: 27

28 Recurrent Neural Networks 28

29 RNN Language Model 29

30 Machine Translation with RNN 30

31 RNN music generation Music Language Modeling with Recurrent Neural Networks: 31

32 Sequence Models The Unreasonable Effectiveness of Recurrent Neural Models: 32

33 Recurrent Neural Networks - Initial state - a nonlinear function and so on <s> 33

34 Non-linear activation functions 34

35 Cross-entropy loss function 35

36 Training neural networks Typically with stochastic or mini-batch gradient descent (full batch) GD gradients are computed based on all training items Mini-batch GD at each step compute the gradients based on a small number (a mini-batch) of training samples: for instance 20 or 32 or 128 etc Stochastic GD gradients are computed based on a single training item Gradients are computed using back-propagation BP is an algorithm for an efficient application of the chain rule There are several versions of gradient descent that set the learning rates in a clever way RMSProp, AdaGrad, AdaDelta, Momentum, Adam 36

Gated units RNN-s are supposed to remember long contexts but in practice they don t Gated units, such as LSTM or GRU include gates that control: How much from the next input is read in How much

37 Gated units RNN-s are supposed to remember long contexts but in practice they don t Gated units, such as LSTM or GRU include gates that control: How much from the next input is read in How much from the previous hidden state is remembered or forgotten How much from the cell state is used in the output Figure 12 from Herath et al., Going Deeper into Action Recognition: A Survey. 37

38 Tools for creating and training neural networks Python libraries that perform symbolic gradient computation Keras Tensorflow Theano PyTorch Dynet The field is developing rapidly 38

39 RNN LM and word embeddings Inputs x one-hot vectors Parameter matrix embeddings - word Training embeddings with word2vec or a similar model is faster than with RNNLM Pretrained word embeddings can be used to initialise the U matrix in RNNLM Transfer learning 39

40 Further reading Understanding LSTM networks Mikolov et al., Linguistic Regularities in Continuous Space Word Representations 40

41 Recap Word embeddings are dense distributed representations of words Word embeddings are trained from (word, context) pairs using neural models Word embeddings can be viewed as automatically learned feature vectors Recurrent neural networks are neural sequence models often used in NLP Pretrained word embeddings can be used to initialize the embedding layer of the recurrent neural models with textual input 41

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks 1 Tzu-Hsuan Yang, 2 Tzu-Hsuan Tseng, and 3 Chia-Ping Chen Department of Computer Science and Engineering