Efficient Estimation of Word Representations in Vector Space

Size: px

Start display at page:

Download "Efficient Estimation of Word Representations in Vector Space"

Morgan Berry
6 years ago
Views:

1 Efficient Estimation of Word Representations in Vector Space Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean in Google Brain[2013] University of Gothenburg Master in Language Technology Sung Min Yang

2 Basic Distributed representation sparse representations A distributed representation is dense 1.One concept is represented by more than one neuron firing 2.One neuron represents more than one concept one hot vector Rangan Majumder, Works at Microsoft

3 Basic Distributed representation Make new shape with a sparse representation, we would have to increase the dimensionality. With distributed representation, we can represent a new shape with the existing dimensionality. e.g., Because of this efficient reuse, Distributed representations are used more than sparse representations. Rangan Majumder, Works at Microsoft

4 Background Background knowledge required - Negative-sampling, Subsampling - Neural Network (we don t need recurrent concept here) *SGD(Stochastic gradient decent) + Backpropagation[these two techniques are important in word2vec] - We can interpret as Updating weight for now. - Softmax, Cross-entropy, Activation-function(ReLU)

5 Introduction Why so popular? Because they(google brain tem)made a real tool and release it. Which is not heavy but quicker, simpler than previous works

6 Introduction Main : word2vec is not a single algorithm. Word2vec has two different Model [Architectures] (CBOW and skip-gram) It means each Model uses a lot of algorithms Why word2vec? because previous works for finding word vectors base on Neural Network were computationally expensive [ RAM, time, etc] Goal of word2vec? Computing continuous vector representation of words from 1. VERY LARGE data set 2. Quickly

7 Introduction Main : word2vec is not a single algorithm. Word2vec has two different Model [Architectures] (CBOW and skip-gram) It means each Model uses a lot of algorithms Why word2vec? because previous works for finding word vectors base on Neural Network were computationally expensive [ RAM, time, etc] Goal of word2vec? Computing continuous vector representation of words from 1. VERY LARGE data set 2. Quickly

8 Big picture Two models

9 Inside word2vec Inside of word2vec CBOW ( Continuous Bag of Words) Input : The, quick, brown, fox Goal : Predict Next word by given Context Output : runs, eats, jumps, chases, goes

10 Inside word2vec Inside of word2vec CBOW ( Continuous Bag of Words) Let s say we got two sentence already [whole data we got] 1. the quick brown fox jumps over the lazy dog 2. the dog runs and chases mouse then it goes not well, so dog eats nothing Input : The quick brown fox Goal : Predict Next word by given Context Output : runs, eats, jumps, chases, goes

11 Inside word2vec Inside of word2vec Skip-gram : with one word, predict surrounding words Let s say we got two sentence already [whole data we got] 1. the quick brown fox jumps over the lazy dog 2. the dog runs and chases mouse then it goes not well, so dog eats nothing [ Here, we consider surrounding words as just before and after the target word] Input : fox Goal : predict surrounding words by given Context Output : brown, eats, jumps, chases, goes the, quick, dog, then, jumps,

layers) shallow(1 Hidden layer) Neural Network Models https://codesachin.wordpress.

12 Origin of word2vec word2vec is not Deep learning both Model CBOW and skip-gram are "shallow" neural models Difference Deep learning(neural Network with many hidden layers) shallow(1 Hidden layer) Neural Network Models

13 Origin of word2vec Authors(team) of word2vec belonged to has been investing a number of teams for A.I Thanks to huge amount of data own they can use Neural Network with a lot of Hidden Layer a.k.a. Deep Learning the study of artificial neural networks and related machine learning algorithm that contain more than one hidden layer -wikipedia Then released open Neural Network library

Origin of word2vec Shortly, Word2vec was made to be implemented into - One-Hot Encoding, Negative-sampling, SGD(Stochastic gradient decent) Backpropagation, hierarchical softmax,

com/course/deep-learning--ud730 But we can build exact same word2vec tool with specific algorithms, This is where word2vec supported by other project.

14 Origin of word2vec Shortly, Word2vec was made to be implemented into - One-Hot Encoding, Negative-sampling, SGD(Stochastic gradient decent) Backpropagation, hierarchical softmax, Cross-entropy, Activationfunction(ReLU), logistic classifier, tsne(not SVD) Drop- Out, etc. But we can build exact same word2vec tool with specific algorithms, This is where word2vec supported by other project. from Google Word2vec in Python by Radim Rehurek in gensim (plus tutorial and demo that uses the above model trained on Google News). Word2vec in Java as part of the deeplearning4j project. Another Java version from Medallia here. Word2vec implementation in Spark MLlib.

15 Where to use Okay, So where can we use it? To capture Similarity! [1]Efficient Estimation of Word Representations in Vector Space

16 how to build Okay,how to build? First concept is, word2vec use Random values for weight. Initial weight value is not important, because our Neural Network Unsupervised Machine Learning will fix it for us. Randomly Initiated Randomly Initiated Inside of Code : `hashfxn` = hash function to use to randomly initialize weights, for increased training reproducibility. -gensim

17 how to build Suppose that we have only 3 sentences. the dog saw a cat the dog chased the cat the cat climbed a tree Note. Word dimension(ality) will be called as Neurons or Number of Neurons in hidden layer in many papers. The fact is, writer of original word2vec paper never used term neuron in his papers. Don t get confused. Then we have alphabetically sorted bag of words {1. a 2. cat 3. chased 4. climbed 5. dog 6. saw 7. the 8. tree} Suppose we have 3 dimensions vectors for each word[1,2, 8] (a.k.a. vector dimensionality, Hidden neurons) now, We random input matrix, output matrix.( each element in matrix is called weight ) a cat chased climbed dog saw the tree dimension1 dimension2 dimension3 dimension1 dimension2 dimension3 a cat chased climbed dog saw the tree In other words, 3 hidden neurons

18 how to build Our target word is cat We can select cat by dot product [0,1,0,0,0,0,0,0] cat [0,1,0,0,0,0,0,0] a cat chased climbed dog saw the tree dimension1 dimension2 dimension3

19 how to build We have given data word-word occurrence frequency matrix. Suppose that we have window size 1 (one left, one right word of target word) Then we have this matrix Output Target a cat chased climbed dog saw the tree a the dog saw a cat the dog chased the cat the cat climbed a tree cat chased climbed dog saw the tree 1 0

20 how to build the dog saw a cat the dog chased the cat the cat climbed a tree Suppose we want the network to learn relationship between the words cat and climbed dimension1 dimension2 dimension3 a cat chased climbed dog saw the tree [ ] Nothing but making elements real number to Probability [ ] Pr(word target word context ) P(climbed(target) cat(context)) = 1/22=0.045 P( the(target) cat(context)) = 1/22

21 how to build the dog saw a cat the dog chased the cat the cat climbed a tree Suppose we want the network to learn relationship between the words cat and climbed Sum up to 1 (probability) [ ] climbed P(climbed(target) cat(context)) Is this proper probability? 2 Yes => Okay. Doing nothing No => is this high? => Make it lower No => is this low? => make it higher Pr(word target word context ) P(climbed(target) cat(context)) = 1/22=0.045 P( the(target) cat(context)) = 1/22 1

22 how to build the dog saw a cat the dog chased the cat the cat climbed a tree P(climbed(target) cat(context)) Suppose we want the network to learn relationship between the words cat and climbed How to make it lower(than 1/22=0.045)? => by changing values of Okay, So how? Answer is using Backpropagation + SGD(Stochastic gradient descent) Shortly, we update WI and Wo to reduce error ( in this case) Pr(word target word context ) P(climbed(target) cat(context)) = 1/22=0.045 P( the(target) cat(context)) = 1/22

23 how to build the dog saw a cat the dog chased the cat the cat climbed a tree Suppose we want the network to learn relationship between the words cat and climbed The goal of backpropagation is to optimize the weights so that the neural network can learn how to correctly map arbitrary inputs to outputs. 1 2 Repeat 4 Changed 3 Changed Nice blog :

Evaluation Goal of word2vec is finding High quality vectors representation of words. Low quality? lab8 http://www.petrkeil.

89133962858176452), ('child', 0.89053309984881468), ('boy', 0.8668296321482909), ('friends', 0.

24 Evaluation Goal of word2vec is finding High quality vectors representation of words. Low quality? lab8 Woman High quality? Word2vec-gensim [('woman', ), ('man', ), ('girl', ), ('child', ), ('boy', ), ('friends', ), ('parents', ), ('herself', ), ('mother', ), ('person', )]

25 how to build Performance of word2vec [1]Efficient Estimation of Word Representations in Vector Space

26 Further Translation Linear Relationships Between Languages we noticed that the vector representations of similar words in different languages were related by a linear transformation. For instance, Figure 1 shows that the word vectors for English numbers one to five and the corresponding Spanish words uno to cinco have similar geometric arrangements. [3 "Exploiting similarities among languages for machine translation."

27 Application Okay, So observing similarity for what? Real Application field? Does it useful? word2vec etc doc2vec paragraph 2vec item2vec

Application tweet_w2v.most_similar('good') Out[52]: [(u'goood', 0.7355118989944458), (u'great', 0.7164269685745239), (u'rough', 0.656904935836792), (u'gd', 0.6395257711410522), (u'goooood', 0.

28 Application tweet_w2v.most_similar('good') Out[52]: [(u'goood', ), (u'great', ), (u'rough', ), (u'gd', ), (u'goooood', ), (u'tough', ), (u'fantastic', ), (u'terrible', ), (u'gooood', ), (u'gud', )]

4, 0,1, ] In Context is Event history of users Data set of What items all users clicked, selected, and

29 Application A nice application of Word2Vec is item recommendations e.g. movies, music, games, market basket analysis etc. In [ 0.2, 0.4, 0,1, ] In Context is Event history of users Data set of What items all users clicked, selected, and installed it. context is users history [ 0.2, 0.3, 0,1, ] [ 0.2, 0.7, 0,2, ] [ 0.2, 0.5, 0,6, ] Etc Relation Item1 item2 Out Item1 item2.. [5] Item2Vec: Neural Item Embedding for Collaborative

Final Tomas Mikolov told me that the whole idea behind word2vec was to demonstrate that you can get better word representations if you trade the model's

30 Final Tomas Mikolov told me that the whole idea behind word2vec was to demonstrate that you can get better word representations if you trade the model's complexity for efficiency, i.e. the ability to learn from much bigger datasets. Omer Levy, at MIT in machine learning.

31 Final

32 References [1] Mikolov, T., Corrado, G., Chen, K., & Dean, J. Efficient Estimation of Word Representations in Vector Space. Proceedings of the International Conference on Learning Representations (ICLR 2013), (2013) [2] Mikolov, T., Chen, K., Corrado, G., & Dean, J. Distributed Representations of Words and Phrases and their Compositionality. NIPS, [3] Mikolov, Tomas, Quoc V. Le, and Ilya Sutskever. "Exploiting similarities among languages for machine translation." arxiv preprint arxiv: (2013). [4] Levy, Omer, Yoav Goldberg, and Israel Ramat-Gan. "Linguistic Regularities in Sparse and Explicit Word Representations." CoNLL [5] Barkan, Oren, and Noam Koenigstein. "Item2vec: neural item embedding for collaborative filtering." Machine Learning for Signal Processing (MLSP), 2016 IEEE 26th International Workshop on. IEEE, 2016.

Python Machine Learning

Python Machine Learning Unlock deeper insights into machine learning with this vital guide to cuttingedge predictive analytics Sebastian Raschka [ PUBLISHING 1 open source I community experience distilled