TTIC 31210: Advanced Natural Language Processing. Kevin Gimpel Spring Lecture 3: Word Embeddings

Size: px

Start display at page:

Download "TTIC 31210: Advanced Natural Language Processing. Kevin Gimpel Spring Lecture 3: Word Embeddings"

Ashlie Jackson
5 years ago
Views:

1 TTIC 31210: Advanced Natural Language Processing Kevin Gimpel Spring 2017 Lecture 3: Word Embeddings 1

2 Assignment 1 Assignment 1 due tonight 2

3 Roadmap review of TTIC (week 1) deep learning for NLP (weeks 2-4) generamve models & Bayesian inference (week 5) Bayesian nonparametrics in NLP (week 6) EM for unsupervised NLP (week 7) syntax/semanmcs and structure predicmon (weeks 8-9) applicamons (week 10) 3

4 Neural Similarity Modeling Siamese networks (Bromley et al., 1993) two idenmcal networks with shared parameters at end, similarity computed between two representamons 4

5 Similarity FuncMons many choices for similarity funcmons we talked about some during Lecture 2 5

6 Learning for Similarity We want to learn input representamon funcmon as well as any parameters of similarity funcmon We ll just write all these parameters as How about this loss? (loss A on your handout) Any potenmal problems with this? 6

7 (Beber) Learning for Similarity ContrasMve hinge loss (loss B on handout): is a negamve example Any potenmal problems with this? 7

8 (Beber) Learning for Similarity Large- margin contrasmve hinge loss: is the margin 8

9 (Beber) Learning for Similarity Large- margin contrasmve hinge loss: How should we choose negamve examples? 9

10 (Beber) Learning for Similarity Large- margin contrasmve hinge loss: How should we choose negamve examples? random: just pick v randomly from the data max: many other ways depending on problem 10

11 Aside: 11

12 Recurrent Neural Networks hidden vector 12

13 Recurrent Neural Networks MulMplicaMve IntegraMon Recurrent Neural Networks 13

14 14

15 15

16 RNN MI- RNN 16

17 Word Embeddings Turian et al. (2010) 17

18 Journal of Machine Learning Research 3 (2003) Submitted 4/02; Published 2/03 A Neural Probabilistic Language Model Yoshua Bengio Réjean Ducharme Pascal Vincent Christian Jauvin Département d Informatique et Recherche Opérationnelle Centre de Recherche Mathématiques Université de Montréal, Montréal, Québec, Canada BENGIOY@IRO.UMONTREAL.CA DUCHARME@IRO.UMONTREAL.CA VINCENTP@IRO.UMONTREAL.CA JAUVINC@IRO.UMONTREAL.CA idea: use a neural network for n- gram language modeling: 18

19 Journal of Machine Learning Research 3 (2003) Submitted 4/02; Published 2/03 A Neural Probabilistic Language Model Yoshua Bengio Réjean Ducharme Pascal Vincent Christian Jauvin Département d Informatique et Recherche Opérationnelle Centre de Recherche Mathématiques Université de Montréal, Montréal, Québec, Canada BENGIOY@IRO.UMONTREAL.CA DUCHARME@IRO.UMONTREAL.CA VINCENTP@IRO.UMONTREAL.CA JAUVINC@IRO.UMONTREAL.CA this is not the earliest paper on using neural networks for n- gram language models, but it s the most well- known and first to scale up see paper for citamons of earlier work 19

20 Neural ProbabilisMc Language Models (Bengio et al., 2003) 1.1 Fighting the Curse of Dimensionality with Distributed Representations In a nutshell, the idea of the proposed approach can be summarized as follows: 1. associate with each word in the vocabulary a distributed word feature vector (a realvalued vector in R m ), 2. express the joint probability function of word sequences in terms of the feature vectors of these words in the sequence, and 3. learn simultaneously the word feature vectors and the parameters of that probability function. 20

21 Model (Bengio et al., 2003) i-th output = P(w t = i context) softmax most computation here tanh C(w t n+1 )... Table look up in C... C(w t 2 ) C(w t 1 ) Matrix C shared parameters across words w t n+1 w t 2 index for index for index for w t 1 21

22 Bengio et al. (2003) Experiments: they minimized log loss of next word condimoned on a fixed number of previous words no RNNs here. just a feed- forward network. ~800k training tokens, vocab size of 17k they trained for 5 epochs, which took 3 weeks on 40 CPUs! 22

23 Experiments (Bengio et al., 2003) n c h m direct mix train. valid. test. MLP yes no MLP yes yes MLP yes no MLP yes yes MLP yes no MLP yes yes MLP yes no MLP yes yes MLP no no MLP no yes classes). n :orderofthemodel. c :numberofwordclassesinclass-basedn-grams. h : number of hidden units. m :number of word features for MLPs,number of classes for class-based n-grams. direct: whether there are direct connections from word features to outputs. mix: whethertheoutputprobabilitiesoftheneuralnetworkaremixedwiththe output of the trigram (with a weight of 0.5 on each). The last three columns give perplexity on the training, validation and test sets. 23

24 Experiments (Bengio et al., 2003) n c h m direct mix train. valid. test. MLP yes no MLP yes yes MLP yes no MLP yes yes MLP yes no MLP yes yes MLP yes no MLP yes yes MLP no no MLP no yes ObservaMons: hidden layer (h > 0) helps interpolamng with n- gram model ( mix ) helps using higher word embedding dimensionality helps 5- gram model beber than trigram 24

25 Experiments n c h m direct mix train. valid. test. MLP yes no MLP yes yes MLP yes no MLP yes yes MLP yes no MLP yes yes MLP yes no MLP yes yes MLP no no MLP no yes Del. Int Kneser-Ney back-off Kneser-Ney back-off Kneser-Ney back-off class-based back-off class-based back-off class-based back-off class-based back-off

26 Bengio et al. (2003) they discuss how the word embedding space might be interesmng to examine but they don t do this they suggest that a good way to visualize/ interpret word embeddings would be to use 2 dimensions J they discussed handling polysemous words, unknown words, inference speed- ups, etc. 26

27 Collobert et al. (2011) Journal of Machine Learning Research 12 (2011) Submitted 1/10; Revised 11/10; Published 8/11 Natural Language Processing (Almost) from Scratch Ronan Collobert Jason Weston Léon Bottou Michael Karlen Koray Kavukcuoglu Pavel Kuksa NEC Laboratories America 4IndependenceWay Princeton, NJ

28 Input Window word of interest Text cat sat on the mat Feature 1 w1 1 w wn 1 Feature K w1 K w2 K... wn K Lookup Table LT W 1. d LT W K Linear concat M 1 n 1 hu HardTanh Linear M 2 n 2 hu = #tags 28

29 Collobert et al. Pairwise Ranking Loss is training set of 11- word windows is vocabulary What is going on here? (loss C on handout) 29

30 Collobert et al. Pairwise Ranking Loss is training set of 11- word windows is vocabulary What is going on here? Make actual text window have higher score than all windows with center word replaced by w 30

31 Collobert et al. Pairwise Ranking Loss is training set of 11- word windows is vocabulary This smll sums over enmre vocabulary, so it should be as slow as log loss Why can it be faster? when using SGD, summamon à sample 31

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models Stephan Gouws and GJ van Rooyen MIH Medialab, Stellenbosch University SOUTH AFRICA {stephan,gvrooyen}@ml.sun.ac.za