Explorations in vector space the continuous-bag-of-words model from word2vec. Jesper Segeblad

Explorations in vector space the continuous-bag-of-words model from word2vec Jesper Segeblad January 2016

Contents 1 Introduction 2 1.1 Purpose........................................... 2 2 The continuous bag of words model 3 2.1 Measuring similarity.................................... 4 2.2 Evaluating the models.................................... 5 2.3 Extracting training samples................................. 5 2.4 Hyperparameters...................................... 5 3 Experiments 6 4 Results 7 5 Conclusions 8 1

1 Introduction To represent words as vectors in a relatively low dimensional space has in the last few years become increasingly popular in the natural language processing community. In this representation, we can think of words as points in a vector space space where words that are semantically similar lie close to each other. This approach has several advantages to representing words as atomic symbols. Words do have a relation to one another, and to actually have a representation that can capture this can be very beneficial to many natural language applications. Traditionally, these models have used word co-occurrence matrices, counting how many times words co-occur in some text corpus. Other types of ways to create these vectors have also been explored, such as trying to predict a word from the words that surround it. Although all these models differ in many ways, they are all trained on raw, unannotated text, and rest on the distributional hypothesis: the hypothesis that words which have similar meaning occur in similar contexts (Sahlgren, 2008). This is the assumption that drives these models, and also make them successful. Word2vec is a highly popular software package that provides two types of algorithms for creating such word vectors, the skip-gram and continuous-bag- of-words models (Mikolov et al., 2013a). The main difference between these two algorithms is the objective: while the skip-gram model try to predict the surrounding words given a target word, the continuous-bag-of-words model try to predict the target word given the surrounding words. In a follow up paper, the authors recommended using the skip-gram model with negative sampling (Mikolov et al., 2013b) and is the model from word2vec that has gained the most attention. Less attention has been given to the continuous-bag-of- word model. This does not make it less interesting however, and is the model that will be studied in greater detail here. 1.1 Purpose The purpose here is to get a deeper understanding of the continuous-bag-of- words model from word2vec. This is both of the theory and math behind this model, why it actually works, as well as how this might be implemented in practice. 2

2 The continuous bag of words model The continuous bag of words model (CBOW) is inspired by neural network architecture, with an input layer, hidden layer, and output layer. Similarly, it is trained using gradient descent and backpropagation which is a common technique for training neural networks (Russell and Norvig, 2009). It does however lack the non-linear activation function at the hidden layer traditionally used in neural networks, and instead passes on the linear activation of the hidden layer (from now on called projection layer as in the original article by Mikolov et al. (2013a)) to the output. First of, we need a vocabulary, which is the words that we want to create word vectors for. The input layer has the dimension of the size of the vocabulary (denote this size by V). Between the input layer and the projection layer we have a matrix of dimensions VxN, where N is the desired dimensions of the word vectors. This matrix will be called W in. Each row in this matrix corresponds to some word in our vocabulary. Between the projection and output layer we have a matrix of size NxV, which we call W out. With these matrices we have two vector representations of each word in the vocabulary, one coming from the rows of W in and one coming from the columns of W out. These will be referred to as the input vectors and output vectors respectively, following Rong (2014). When the model the model is initialized, all these word vectors have random values, and the objective is to find better values for these vectors so that words that are semantically similar have similar vectors. We can think of the input as a number of one-hot encoded column vectors, meaning that one of the entries in the vector are 1 while the rest are 0. How many of these vectors that are taken as input are decided by the context size. The goal of the model is to predict the word at the output layer given the the context words fed into to the model at the input layer. This is done by modeling the conditional probability of the output word given the input words, p(word out words in ). The model takes two inputs at this stage, the context words and the target word. The context words are projected onto the projection layer by combining their respective input vectors, taken from the rows of the input weight matrix, or more formally, by multiplying them with the input weight matrix. But since they are one-hot encoded this is equivalent to just copying them. They are combined by taking the average vector (denote this transformation by h). h = 1 C (v w1 + v w2 +... + v wi ) (1) Here, C is the number of words in the context and v wi is the input vector of the i:th word in the context. We thus end up with an average vector, with the dimensionality N. With this vector we can compute a score for each output word (Rong, 2014) (denote this score u j ). u j = W T out j h (2) W out j is the j:th column of the output matrix, the output vector of the word j. This is transposed from a column vector to a row vector so that we can multiply it by the averaged vector h. With this score we can compute the conditional probability of this being the actual output word using the softmax function. P(word out words in )=y j = exp(u j ) V Â exp(wout T k h) k=1 The score u j is exponentiated and divided by the sum of the exponentiated score of all other words in the vocabulary, giving us a probability of this being the actual output word. The next step is to update the weights of the model based on the error of the prediction. Since we expect only one output word, the objective is to maximize the equation 3 for the actual output word. Given this we have a corresponding loss function E = logp(word out words in ) (Rong, 2014). The error at the output layer can be computed as e j = y j t j, where y j is the prediction of the model and t j is the actual probability of this word being (3) 3

the output word. t j can be either 0 or 1, depending on if it is the actual output word or not, 1 if it is, 0 otherwise. To update the weights between the projection and output layer, this error is multiplied with h, the averaged vector of the input words and the learning rate (denoted h). This results in another vector, which is then subtracted from the output vector of the word. W T out j = W T out j h e j h (4) When the output vector have been updated, the error is propagated backwards to update the input vectors in the context. In order to do this, the sum of all output vectors, weighted by the prediction error, in the vocabulary is calculated. EH = V Â j=1 e j W outij (5) This gives an N-dimensional vector EH, which is used to update the input vectors in the context. W ini = W ini h 1 EH for i in context (6) C Here, each vector (i.e. row of the input matrix W in ) of the words in the context are updated by subtracting the averaged error multiplied by the learning rate (h). The computations for updating the vectors are rather costly, since for each possible output word we have to check its probability and compare it to the actual probability (0 or 1). The number of output probabilities are the same as the number of words in the vocabulary, meaning that the execution time becomes quite substantial if we have a large vocabulary. With a vocabulary of one million words a total of one million checks would have to made just for one training example. There are however some optimization tricks that can be applied, such as hierarchical softmax and negative sampling, described in (Mikolov et al., 2013b) (though in the context of the skip-gram model). Hierarchical softmax makes use of a binary tree that represents the output layer. Negative sampling reconstructs the tasks from a multiclass classification problem to a binary one: given a training sample, the task is to predict if the word and its context comes from the real training data or sampled randomly from a random distribution. While the math behind this can seem rather complex and involved, the intuitive understanding of the model is somewhat simpler and builds on the distributional hypothesis previously mentioned. The objective of the model is to maximize the probability of the output target word given the context words, and the weights of the model (the word vectors) are moved in order to maximize this probability. If the probability of a word is overestimated, the input vectors will move further away from the output vector. And if the probability is underestimated, the input vectors will move closer to the output vectors (Rong, 2014). 2.1 Measuring similarity With the resulting vectors, the distance between the words in the vocabulary can be measured. If we think of the words as points in a two dimensional space, words that are close to each other should have a similar meaning (given that our model actually works and our assumptions about word similarity being reflected by their distributional properties). Mikolov et al. (2013a) uses the cosine similarity to measure the similarity between them. Cosine similarity is the angle between two vectors in a geometric perspective, and is defined as the dot product divided by the product of their respective lengths (Jurafsky and Martin, 2009). cos sim(v,u)= v u v u (7) 4

2.2 Evaluating the models These types of vector representations are often evaluated on various types of word similarity tasks. In (Mikolov et al., 2013a) evaluation of the models on two types of similarity tasks, semantic and syntactic, and on one sentence completion task. The similarity tasks are questions of the type a is to b as c is to d, where the task is fill in d. This is accomplished with vector addition and subtraction: X = vec(a) vec(b)+vec(c). With this resulting vector X, a search over the all the vectors in the vocabulary can be made to find the vector that is closest to it. The sentence completion task consists of trying complete a sentence that has one missing word. A list is given with five words that all can be said to be reasonable choice (Mikolov et al., 2013a). 2.3 Extracting training samples Since these are trained on unannotated text, obtaining large amounts of training data is no problem. From this text, training samples can be extracted. These training samples are words together with the context they appear in (from here on called word and context pairs). The context is defined as a symmetric window around the focus word. One can view this as sliding a window over the text collection and looking at a word and the n words around it. If for example a window size of 2 is chosen, the 2 preceding and the two following words are used as the context. Exactly how many words that should be used as context are not well established, and different types of window sizes captures different types of semantic similarity (Goldberg, 2015). This might be something to choose depending on the task that the resulting vectors are to be used for. Goldberg (2015) also describes some types of preprocessing that can be used before extracting the word-context pairs, which can include removing words that appear to frequently or too infrequently, and removing sentences that are either to long or too short. 2.4 Hyperparameters The continuous-bag-of-words model relies on a few hyperparameters. They won t be covered in much detail, but the most important ones that are used in the implementation further on are described. First we have to decide on the dimensionality of the resulting word vectors (N from the previous description). The results presented in Mikolov et al. (2013a) indicates that a higher dimensionality might give better word representations, since vectors of a higher dimensionality performed better on the tasks that they were evaluated on. Another parameter is how the vectors are initialized. Since they are initialized to random values, one has to decide how these random values are chosen. The approach used in word2vec is to use uniformly 1 sampled values between 2N and 1 2N (Goldberg, 2015). Yet another of these hyperparameters is the learning rate. Mikolov et al. (2013a) used an initial learning rate of 0.025, and during the training decreased it so that it approached zero further into the training. 5

3 Experiments To get a clearer understanding of the inner workings of the model, a small implementation was made in Python together with the Numpy library (Van Der Walt et al., 2011). This implementation is rather simple and not very efficient, because of the inefficient weight updating, and none of the optimization tricks were implemented. The model as implemented is initialized together with a vocabulary. This vocabulary is a Python dictionary with the words that we want to create word vectors for. Each word also has a corresponding index number, which is how we know which vector is assigned to which word. The weight matrices 1 are initialized randomly with values between 2N and 1 2N drawn from a uniform distribution, as in the original implementation. The learning rate was set to a default value of 0.1, and this is not adjusted during training of the model. The default size of the projection layer is set 100. As the original intention was to train the model on the first 1000000 tokens from Swedish Wikipedia, some methods to preprocess that corpus was implemented. This preprocessing removes all words appearing less than 10 times in the corpus, and they appear neither as target or context words. This was done before extracting the word and context pairs used as training samples. In practice this means that the context window is expanded. The context window size was set to the two previous words and the two following words. Words that appear in the beginning of a sentence and does not have any word preceding it, only the words following it are used as context words. The same holds for words at the end of a sentence. Following Levy et al. (2015), sentences shorter than 5 words were also removed. This resulted in around 9000 unique words and 55000 sentences for training. Even with that relatively small amount of words, training was to slow to complete over the acquired 1000000 words from Swedish Wikipedia. These limitations make it hard to evaluate how well this implemented model actually performs the way it was originally intended. The code necessary for doing this was implemented however. Instead a small testing corpus was built to see how well this implementation performed and if the resulting vector similarities make sense. This small testing corpus can be found below. These experiments were done with a model having the default parameters as described above. The window size was not set to 2+2 however, instead a window size of 1+1 was used, i.e. one word to the left and one word to the right. drink apple juice there. drink orange juice there. drink tomato juice there. drink beer there. drink milk there. they eat beans here. they eat meat here. they eat pasta here. they eat pork here. they eat rice here. we play football outside. we play ice hockey outside. we play soccer outside. we play golf outside. As can be seen, this testing corpus is quite small. The small size is made up for by doing multiple passes over the training data. 6

4 Results Some results can be found in the table below. These results were retrieved with using a vector dimensionality of 100 and doing 200 passes over the training data. This high number of passes was necessary because of the small size of the training data. Although these results might not be the most interesting, they do reflect some regularities in the training data. If we for example look at the three first entries in the table, we see that the closest words are words that are separated by another one. In other words, they are words that appear in the same context. Word Closest word Cosine similarity play outside 0.987 drink juice 0.919 eat here 0.999 juice drink 0.919 apple there 0.454 beans they 0.412 To see the effect of a smaller vector dimensionality another test was done with this parameter set to 50 instead of the default 100. These results can be seen below. The only thing that has changed is the cosine similarities (and this is not even by that much), the closest words are still the same. Word Closest word Cosine similarity play outside 0.982 drink juice 0.968 eat here 0.998 juice drink 0.968 apple there 0.347 beans they 0.531 200 passes over the training data are quite many. To see the effects of a smaller number of passes, this number was reduced to 50. The results from this are presented below. Word Closest word Cosine similarity play outside 0.353 drink juice 0.497 eat here 0.689 juice drink 0.497 apple pork 0.208 beans they 0.315 The cosine similarities are significantly lower. The most interesting thing however is that the nearest word to apple now is pork, and they are not that close. This is quite surprising since they do not share any context in the training data. It might be attributed to the small number of actual training instances, and smaller number of passes over the data. 7

5 Conclusions Vector representations of words are very popular among natural language processing researchers and practitioners today. They perform very well on a number of tasks and can capture many types of word similarities. Here, the continuous-bag-of-words model from the popular word2vec package has been implemented. Even though the model could not be tested as originally intended, the tests that were made did show that the resulting word vectors could reflect regularities in the training data. And while the provided implementation does not scale to large vocabularies, it does provide a starting point for exploring the model further. One such exploration might be to look at the optimizations regarding weight updating. 8

References Yoav Goldberg. A primer on neural network models for natural language processing. arxiv preprint arxiv:1510.00726, 2015. Daniel Jurafsky and James H Martin. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, volume 21. Pearson Education, 2nd edition, 2009. ISBN 0130950696. doi: 10.1162/089120100750105975. Omer Levy, Yoav Goldberg, and Ido Dagan. Improving distributional similarity with lessons learned from word embeddings. Transactions of the Association for Computational Linguistics, 3:211 225, 2015. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. In Proceedings of Workshop at ICLR, 2013a. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111 3119, 2013b. Xin Rong. word2vec parameter learning explained. arxiv preprint arxiv:1411.2738, 2014. Stuart Russell and Peter Norvig. Artificial Intelligence: A Modern Approach. Pearson Education, 3rd edition, 2009. ISBN 9780136042594. Magnus Sahlgren. The distributional hypothesis. Italian Journal of Linguistics, 20(1):33 53, 2008. Stefan Van Der Walt, S Chris Colbert, and Gael Varoquaux. The numpy array: a structure for efficient numerical computation. Computing in Science & Engineering, 13(2):22 30, 2011. 9