Incorporating Semantic Information into Image Classifiers

Incorporating Semantic Information into Image Classifiers Osbert Bastani and Hamsa Sridhar Advised by Richard Socher December 14, 2012 1 Introduction In this project, we are investigating the incorporation of semantic information into image classifiers. In particular, we are interested in the problem of using neural networks and techniques from natural language processing to train an image classifier that uses semantic information contained in word vectors that have already been trained in an unsupervised fashion. Traditional image classifiers ignore the semantic content of the names of the image classes. Especially when classifying fine grained image classes, it would be hugely beneficial to jointly train representations of the labels along with representations of the images. One can even imagine learning how to recognize a class when there are few or no training examples in that class, based solely on training on semantically similar words. For example, if the classifier can learn from training examples that boat and ship represent similar objects, then the classifier could learn to identify boat images using ship training examples. The goal of our project is to show that we can leverage semantic information when training image classifiers. We incorporate semantic information into deep learning algorithms, especially neural networks, to solve the classification problem. 1.1 Acknowledgements Osbert contributed a lot of the code on neural networks, while Hamsa contributed a lot of the code for preprocessing the images, scripting the training procedures, and visualizing the results. Most of the time was spent running the code and analyzing the results, which we performed jointly. Richard was incredibly helpful throughout the project, always suggesting new things to try whenever we ran into a wall. Submitting this project for both CS 224n and CS 229. Submitting this project for CS 229 only. 2 Background Collobert and Weston successfully used neural networks to solve various standard problems in NLP [1]. They achieve this by using unsupervised learning to learn representations of words in a high dimensional vector space. The neural networks they use to do so have k inputs, where k is the length of the window being used. The first layer of the neural network is a look up table that maps atomic words to word vectors, generally in R 50 or R 100. Next we have a layer that computes an affine transformation of the outputs from the previous layer, followed by a hard tanh transformation applied component-wise. The error function computed by the final layer in the neural network for a single k-word training phrase s is the pairwise ranking cost, i.e. the sum J(f; s) = max{0, 1 f(s) + f(s )}, s S c(s) where f : R n R is some function mapping word vectors to scores, usually taken to be a convolution of affine maps with a nonlinearity such as tanh. Here s is the window of the original k words, and S c (s) is the set of windows with the same k words as s except the middle word, which is replaced with another word. The idea is for f(s) to be significantly greater than f(s ), so the neural network is training word vectors that describe what words belong in a given window. The word vectors obtained by [1] tended to cluster words that had similar meanings since they could be replaced in the same context and preserve the meaning of the sentence. Our goal is to use the semantic information encoded in these word vectors to enhance the capabilities of image classifiers. Given a set of images I, where x I corresponds to a label l(x) L, the basic image classification problem is to predict l(x) given x based on previous training examples. We assume further that each label l L corresponds to a word vector w l that has been trained by the method outlined above to carry semantic information. The most ambitious goal would be to train a zeroshot classifier, which can classify images of a given label l that did not occur in the training set. For example, given training images of apples and carrots, we can imagine 1

classifying oranges without having seen an orange, given only the fact that apples and oranges are both spherical, and that carrots and oranges both have the same color. The hope is to identify semantic correspondences that yield similar intuitions that allow us to classify the class of oranges, without having seen a training example of an orange, with some nontrivial degree of accuracy. At a minimum, we hope to demonstrate that semantic information allows the classifier to learn a new class more quickly, akin to how a description of an object using synonyms would improve a human s ability to recognize it. 3 Methodology We implemented and tested image classifiers using neural networks that incorporate semantic information encoded by word vectors. Our first approach is an extension of the cross entropy method used in [1] to train word vectors, and the second computes a linear map from the image feature space to the word vector space, trained to map image features close to their labels. We additionally implemented and tested two more neural networks: (i) a map from image feature space to word vector space with a single hidden layer, and (ii) a map that also acts as a sort of autoencoder. However, these both performed poorly so we focused on the cross entropy classifier and the mapping classifier. Cross Entropy Classifier. Our first approach is to replace the the middle word in the input to the neural network by the image annotation, and to replace the remainder of the window by the input. This way, the score for a word describing the image will be significantly higher than the score for a word that doesn t at all describe the image. Let ( [ ) x f(x, l; W, U, b, c) = Ug W + b + c, wl] where g(z) is the sigmoid function applied coordinatewise. Then the error for a single training example is the same sum as before, except this time S c ranges over the image paired with incorrect labels: J(W, U, b, c; I, l) = J(W, U, b, c; x, l, l ) x I where l L l(x) + λ( U 2 + W 2 ), J(W, U, b, c; x, l, l ) = max{0, 1 f(x, l(x); W, U, b, c) + f(x, l ; W, U, b, c)} and L l(x) denotes the set of all labels L except for the label l(x) corresponding to x. The predicted label for a new image is then the label with the highest score, i.e. l pred (x) = arg max f(x, l; W, U, b, c). l L Since labels that are semantically similar will have similar word vectors, the hope is that training images can essentially be trained for multiple word vectors at the same time, since a higher score f(x, l) also results in a higher score f(x, l ) so long as w l w l is small. Mapping Classifier. Our second approach is to compute a mapping from the image feature space to the word vector space trained to map images close to their label vectors. The cost function is J(U, b; I, l) = x I w l(x) (Ux + b) 2 + λ U 2. The predicted label is that which is closest to Ux + b, i.e. l pred (x) = arg min l L w l (Ux + b) 2. The hope is that features map in some structured way so that unseen labels are also mapped approximately correctly. In addition to zero-shot learning, where we exclude all images with a given label in the training set, we see if including semantic information helps the neural network learn more efficiently. We include a small number of image features of the excluded class I d with the following cost function: J(..., I, I d, l) = J(..., I, l) + αj(..., I d, l). Furthermore, to analyze the usefulness of the included semantic information, we train both on word vectors w from [1] and on word vectors w U[0, 1] chosen uniformly randomly (and fixed for the duration of the tests). 4 Results In order to test and analyze the neural networks, we train the classifier on the CIFAR-10 dataset, which consists of ten categories of images: airplanes, automobiles, birds, cats, deer, dogs, frogs, horses, ships, and trucks. For our trails we used the preinitialized word vectors from [1].We implemented and tested the cross entropy classifier and the mapping classifier for various parameters. The basic parameters are given in Table 1. We deleted all cat images (label 4) from the training set in our attempt to achieve zero-shot learning. We also tested both the semantic word vectors from [1] and preinitialized randomized word vectors to compare performance. Unfortunately, as of now we have been unable to obtain good results on zero-shot learning, so we decided to focus on including a small number of cat images in 2

Parameter Value Hidden Layer Size 100 λ 10 3 Training Images 8093 Cat Images 807 Test Images 1000 α for cross entropy 0.05 α for mapping 1.0 Table 1: Neural network parameters. the training sets and comparing how performance scaled with training set size when using semantic word vectors versus randomized word vectors. We tuned the α parameter that weights the cat images independently, and found that 0.05 was a good value for the cross entropy classifier whereas 1.0 was a good value for the mapping classifier. We show only the accuracy, since that gives the most transparent explanation of what is happening. Results are shown in Table 2. Label (l) Mapping C. E. 1 w cat w l airplane 65 27 0.115 automobile 8 18 0.106 bird 179 135 0.159 cat 31 0 deer 79 64 0.162 dog 376 392 0.193 frog 114 158 0.105 horse 13 72 0.0985 ship 31 25 0.120 truck 10 15 0.0887 Table 3: Distribution of predicted labels for cat vectors for the zero-shot classifiers, along with distances between each word vector and the cat vector. 5 Discussion While we failed to achieve zero-shot in our results, the mapping classifier did have a nonzero score for zero-shot learning. However, in both classifiers, we were still able to demonstrate the value of semantic word vectors vs. randomized word vectors. 5.1 Zero-Shot Learning We begin with a discussion of the results and issues with our attempt at zero-shot image classification. Mapping Classifier. The zero-shot mapping classifier achieved a score of 3.4 on the test set of cat images. Even though this is worse than random, a trained classifier would typically classify images only into the classes it has already seen, so this result while not promising is still noteworthy. One thing we noticed here is that when not testing for zero-shot learning, semantic information actually decreased performance, indicating that we need to further tune α. In Table 3, we give the distribution of the labels predicted by the zero-shot mapping classifier for cat vectors. We also computed the inverse distances between each word vector and the cat word vector, which measure the semantic similarity between two words, shown in 3 and plotted in Figure 1. It is clear from the plot that there is a correlation between the word vector and the number of misclassifications. As expected, there is a bias from the image features themselves, so cats are still often misclassified as dogs. This indicates that there is a signal in the data that Figure 1: Plot of the distribution of predicted labels for cat vectors for the zero-shot classifiers, along with inverse distances between each word vector and the cat vector. could be exploited for zero-shot learning, though more work must be done to extract the signal. Finally, in Figure 2, we give a visualization of the word vectors and the mapping of the image features into the word vector space. It shows a scatter plot of the first two principle components of the 50 dimensional vectors. The word vectors are red, and the image vectors are blue, and each class is represented by a different shape. Even with just two components, we can see the projected images cluster around the word vectors representing their label. Most significantly, this is even the case for the cat vector. This gives us an indication that if the mapping from image feature vector space to word vector space were computed with more labels, the accuracy of the mapping would improve and allow for zero-shot learning. It would also be helpful to include a nonlinearity in the mapping, though since we did not have enough image labels or training examples, including a nonlinearity led to overfitting. We are also investigating the possibility of using a one-shot classifier to better center the cat word vector. Cross Entropy Classifier. The cross entropy classifier completely failed at zero shot learning, but exhibited 3

Classifier Word Vectors # Cat Images Training Test Training Cats Test Cats C. E. Semantic 0 0.74 0.60 0.0 C. E. Randomized 0 0.71 0.56 0.0 C. E. Semantic 10 0.72 0.58 0.3 0.0 C. E. Randomized 10 0.67 0.54 1.0 0.0 C. E. Semantic 100 0.74 0.60 0.0 0.0 C. E. Randomized 100 0.73 0.58 0.78 0.085 Mapping Semantic 0 0.73 0.60 0.034 Mapping Randomized 0 0.76 0.60 0.0 Mapping Semantic 10 0.66 0.56 1.0 0.10 Mapping Randomized 10 0.68 0.57 1.0 0.019 Mapping Semantic 100 0.57 0.55 1.0 0.67 Mapping Randomized 100 0.61 0.59 1.0 0.54 Table 2: Results for various classifier configurations. Figure 2: Projection of the word vectors and mapped image features. the same correlation with inverse distance as the mapping classifier, as shown in Table 3 and Figure 1. We concluded that the mapping classifier was more promising and focused on that approach. While we failed to achieve zero-shot learning, we feel that we have some promising approaches that may at least lead to a one-shot classifier. 5.2 Semantic vs. Randomized Word Vectors While our focus was on understanding how to achieve zero-shot learning, we also performed some preliminary investigations as to how semantic information affects performance. Mapping Classifier. Once it had 10 cat images, the mapping classifier was at least performing on par with random, and at 100 training examples it was performing very well. Accuracy on the test set suffered, indicating that the effect of the examples of cat images was too strong, though this could be fixed with further tuning of α. Still, performance was sufficient for our purposes to show that semantic information helped the classifier learn to classify cat images more effectively. The effect was especially pronounced with only 10 cat images, where the randomized word vectors performed abysmally on the set of test cat images. We compare the misclassification of cat vectors when using semantic word vectors vs. randomized word vectors in Table 4. We noticed that in both cases, most cat vectors are misclassified as either birds or dogs, since birds and dogs especially look similar to cats. However, this effect is more pronounced in the case of semantic word vectors, indicating that in this case the semantic information is boosting the signal given by the images themselves. However, we have to be careful because the 4

semantic information also causes more dog and bird images to be misclassified as cats, indicating that semantic information makes the neural network more sensitive to overtraining. Since the randomized word vectors are uniformly distributed, rather than clustered by meaning, they are less likely to be overtrained. This indicates an interesting tradeoff between precision and recall, where semantic information improves recall at the cost of precision, which we did not have the chance to investigate. Label (l) Semantic Randomized airplane 65 25 automobile 8 34 bird 179 62 cat 31 0 deer 79 56 dog 376 413 frog 114 137 horse 13 60 ship 31 22 truck 10 97 Table 4: Distribution of predicted labels for cat vectors for the zero-shot mapping classifier with semantic and randomized word vectors. Cross Entropy Classifier. The cross entropy classifier was unable to pick up the cat signal, even after 100 training examples of cat images. As before, more fine tuning of α may be needed, though for the higher values of α we tested, performance on the test set suffered significantly. Still, on the training set, there was a noticable improvement moving from semantic vectors to randomized vectors. For the cross entropy classifier, there was noticable improvement going from randomized word vectors to semantic word vectors, indicating that semantic word vectors can improve performance. More importantly, there was a marked improvement in performance of the mapping classifier when trained on 10 cat vectors, showing that especially when training data is scarce, semantic information can be very valuable. 6 Conclusions In this project we were able to obtaining good classification results and demonstrate the value of semantic information to image classification. We are planning on continuing our investigation of the following: 1. Since we were focused on achieving zero-shot learning, we did not have time to investigate how performance was affect as we scaled to more images, so it would be interesting to investigate how performance scales with the size of the training set. 2. We would be interested in investigating how our techniques scale when more fine grained labels are available, though our initial attempts were not very successful because the baseline classifier performed poorly. There is an opprotunity to train multiple labels from a single training image based on semantic content here, both because there are fewer training examples per label and because there is greater chance for semantic correlation. 3. We hope to consider whether image information can be used to improve word vectors, since semantic information often misses visual information that could be useful in some semantic tasks. For example, it would allow identification of synonyms based on word vectors, since words that describe similar pictures are more likely to be synonyms, though this may only work for objects. 4. We plan to continue our project of building a zeroshot image classifier, and in particular find the mapping approach promising. 5. We hope to apply our techniques for cases where images are labeled by descriptions rather than single words. Websites like Flickr have a huge number of annotated images, and these annotations do not come in the simple binned labeling style required by traditional image classifiers. We are particularly excited about the possibility of classifying images for classes that have zero training examples, which would demonstrate the power of incorporating semantic features in the image classification task, though we also hope to continue to explore the possibility of using semantic information to augment traditional image classifiers. References [1] Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., and Kuska, P. Natural Language Processing (Almost) from Scratch. Journal of Machine Learning Research, 12 pp. 2493-2537, 2011. [2] Coates, A., Lee, H., Ng, A. An Analysis of Single- Layer Networks in Unsupervised Feature Learning. Advances in Neural Information Processing Systems, 2010. 5