CSCI 315: Artificial Intelligence through Deep Learning W&L Winter Term 2017 Prof. Levy Autoencoder Networks: Embedding and Representation Learning (Chapter 6)
Motivation Representing words and other data as arbitrary one-hot codes is convenient for building a simple dictionary, but creates two problems: Ignores similarity between word meanings (cat should closer to mouse than either is to bell), making tasks like language learning much harder. One-hot codes become computationally impractical once we approach a realistic (20-30k words) vocabulary: for each word, softmax would need to compute the probability distribution over tens of thousands of words. y i =f ( x i )= j ex i e x j
Data Compression So what we'd like to be able to do is pick an arbitrary, reasonable size for our codes (maybe a few hundred units), and compress the entire vocabulary into codes of that size. Such a compression (a.k.a. embedding) should also put similar words near each other in the vector space. the if bell collar frighten chase speak cat dog mouse
Elman (1990) Revisited Elman's SRN showed that a neural network can "find structure in time" (discover semantic similarities among words) by being trained to predict the next word from the current and previous words. SRN is a variety of auto-encoder network: a network where the input and output layers represent the same thing (hence, same # of units), and the cool stuff happens on the hidden layer.
The Simplest Autoencoder Although Buduma cites Hinton & Salakhutdinov (2006) for autoencoders, a three-layer backprop network (available since 1986) can serve as a simple autoencoder: they used to be called auto-associators; e.g., Pollack's RAAM (1990). Let's build a simple 8-3-8 auto-associator using our Backprop class.
As we can see, our simple 8-3-8 autoassociator has "discovered" (devised, learned) a binary representation for the the one-hot codes, on it hidden layer. Now we are ready to look at Hinton & Salakhutdinov (2006). First, let's predict: How did their network differ from our simple one? What dataset did they use?
All layers fully connected, with sigmoidal activation function Why only two units in innermost hidden layer?
Dimensionality Reduction People can only visualize data in two or three dimensions. Several techniques exist for reducing high-dimensional data to such low dimensions; the most popular, dating back to 1933, is Principal Component Analysis (PCA) Figure above shows 2 1 dimension reduction, but PCA can be used for any number of dimensions
Limitations of PCA As Fig. 6-2 shows, PCA is a fundamentally linear technique: it works by re-aligning the data along a few mutually orthogonal (right-angled) axes, like X,Y or X,Y,Z. Works well for many kinds of data; however, a simple example shows the limitations of the linearity assumption. Does this non-linearity problem remind you of anything?
PCA vs. DL: The MNIST Challenge Reducing 784 dimensions to two: (On the other hand, PCA can be coded in two or three lines of Python!)
Autoencoder as De-noiser (a.k.a. Cleanup Memory) Give the trained network a degraded image, and see what comes out the other end:
word2vec: Word Prediction Revisited Predicting the next word from the current word may be too narrow a view of how to find structure in time. Looking at a window ( bag ) of a few words before and after the current word can be an even better way of discovering relationships: We saw some LIONS and ELEPHANTS at the ZOO. The ZOO had no LIONS, but lots of ELEPHANTS. ELEPHANTS and LIONS live in the ZOO. Since sequence order no longer matters, we don t need a recurrent net (and its associated complexity) to learn the relationships. This is not a novel idea! Before looking more at word2vec, let s look at an earlier approach...
Latent Semantic Analysis: The Ultimate Bag of Words Algorithm LSA: An extremely simple matrix-based method: One word per row One document per column Each cell j,k shows the number of times word j appears in document k. Applying a clever transformation (Singular Value Decomposition) reveals latent ( hidden ) relationships between words and the documents in which they should have appeared!
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.66.7422&rep=rep1&type=pdf LSA
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.66.7422&rep=rep1&type=pdf LSA
word2vec Architecture is simple: a classic input-hidden-output (n-h-n) auto-encoder. Two flavors: Continuous Bag Of Words (CBOW): Predict a single word (LIONS) from neighboring words (ZOO, ELEPHANTS). Useful for smaller datasets. Skip-Gram: Predict neighbors (ZOO, ELEPHANTS) from a single word (LIONS). Useful for larger datasets. But this still leaves us with the second original problem: given the embedding (low-dimensional embedding) of a word how can the output layer (decoder) compute the softmax over n > 10K words!
word2vec: Noise-Contrastive Encoding Instead of trying to compute the softmax over all n vocabulary words, word2vec compares the actual input/target pattern (MONKEY ZOO, LIONS) with a randomly-selected bogus ( noise ) pattern (MONKEY DOLLAR, CLOCK). The closer the obtained output is to the bogus output, the higher the value of the loss function, and the greater the adjustment on the weights from the hidden (embedding) layer to the output (one-hot) layer.
word2vec: Analogy as Linear Transformation https://www.tensorflow.org/images/linear-relationships.png