Generating Chinese Captions for Flickr30K Images

Generating Chinese Captions for Flickr30K Images Hao Peng Indiana University, Bloomington penghao@iu.edu Nianhen Li Indiana University, Bloomington li514@indiana.edu Abstract We trained a Multimodal Recurrent Neural Network on Flickr30K dataset with Chinese sentences. The RNN model is from Karpathy and Fei-Fei, 2015 [6]. As Chinese sentence has no space between words, we implemented the model on Flickr30 dataset in two methods. In the first setting, we tokenized each Chinese sentence into a list of words and feed them to the RNN. While in the second one, we split each Chinese sentence into a list of characters and feed them into the same model. We compared the BLEU score achieved by our two methods to that achieved by [6]. We found that the RNN model trained with char-level method for Chinese captions outperforms the word-level one. The former method performs very close to that trained on English captions by [6]. This came to a conclusion that the RNN model works universally well, or at least the same, for image caption system on different languages. 1. Introduction Humans are good at describing and understanding the visual scene expressed in images with just a glance. But it is a kind of tough task for computers to describe the context or even just recognize all the objects in one image. Therefor, an automated image caption system is very helpful in many way. The self-driving cars, the VR glass all need this technology to build up its functionality. These tools also could potentially be used to provide richer descriptions of images for people who are blind or visually impaired. The majority of previous work in visual recognition has focused on labeling images with a fixed set of visual categories and great progress has been achieved in these endeavors [4, 11]. However, while closed visual words or vocabularies consists of reasonable modeling assumption, they are vastly limited when compared to the descriptions articulated by humans. Recently, many researches on image caption tasks has been devoted to RNN models, as they are said to be very effective on modeling sequential data, and also to capture context and semantic relation in languages. However, all these models are trained on images with English captions. Thus we don t know their performance in other languages to see that whether this method works universally. We tested it on Chinese caption system in this paper. As Chinese sentence has no space between words, it is very different from English. We implemented the RNN model with the same architecture used by [6] on Flickr30 dataset with Chinese captions in two scenarios. The Chinese captions are obtained by translating the original English caption using Google Translation API. Our experiments show that the generated Chinese sentences aligns pretty well with the translated Chinese captions, we also report the BLEU [10] score computed with the coco-caption code [1], which is a metric evaluating a candidate sentence by measuring how well it matches a set of five reference sentences written by humans. 2. Related Work Researchers have explored a lot in vision to language, such as, examining image captioning (Lin et al., 2014; Karpathy and Fei-Fei, 2015; Vinyals et al., 2015; Xu et al., 2015; Chen et al., 2015; Young et al., 2014; Elliott and Keller, 2013), question answering (Antol et al., 2015; Ren et al., 2015; Gao et al., 2015; Malinowski and Fritz, 2014), visual phrases (Sadeghi and Farhadi, 2011), video understanding (Ramanathan et al., 2013), and visual concepts (Krishna et al., 2016; Fang et al., 2015). To build the visual description system, recent state of the art work [6, 14] has used the multimodal recurrent neural network (RNN) to create a sequence to sequence machine learning system which is similar to the kind other researchers have used for machine translation. In this case, however, instead of translating from, say, French to English, the researchers were training the system to translate from images to sentences. Multiple closely related work has also used RNNs to generate image descriptions [9, 14, 3, 8, 5, 2]. But [6] claims their model to be simpler than most of the previous approaches. Thus we decided to apply their model on our Chinese caption task on the same image dataset, Flickr30k. We also quantify the performance and comparison in our 1

experiments to their original results. 3. Our Model As been said, the architecture of our RNN model is the same with the one used in [6] because we want to make a performance comparision in this paper. Thus some of the lines in this section (the Training/Testing process and Optimization as well) are borrowed from [6]. The RNN model accepts a image vector and outputs a corresponding sentence description.each sentence is split into a sequence of elements and feed into the RNN (as we implemented a word level and a character level method, we refer them as elements here). It generates elements by defining a probability distribution of the next element in a sequence given the current element and context from pervious time steps. At the first time step, it conditions the probability of element only on the input image vector. For testing, the model can predict a variable length of elements given an image. Specifically, our RNN model is trained as follows, it takes the image pixels I and a sequence of one-hot encoded word vectors (x 1, x 2,, x T ). It then computes a sequence of hidden states (h 1, h 2,, h t ) and a sequence of outputs (y 1, y 2,, y t ) by iterating the following formula for t = 1 to T : Figure 1. The image vector produced from VGG net. b v = W hi [CNN θc (I)] (1) h t = f(w hx x t + W hh h t 1 + b h + 1(t = 1) b v )(2) y t = softmax(w oh h t + b o ) (3) In the above equations, W hi, W hx, W hh, W oh, x t and b h, b o are learnable parameters which would be updated during training, and CNN θc (I) is the output of the last layer of the V GG [12] net (as shown in Figure 1). During our training, the image encoding size, word encoding size and hidden size are all set to 256, which means x t, b v, h t, b h and b o are all 256 dimensional vectors. The output vector y t is a log probabilities of words in the vocabulary and one additional dimension for a special END token. We feed into RNN the image encoding vector b v only at the first iteration. 3.1. Training process Our RNN model is trained to predict the next word y t based on the input word x t and the previous context (hidden state) h t 1. We simply treat the image encoding vector b v as a bias term on the first iteration.the training process is illustrated in Figure 2: We set h 0 = 0, x 1 to a special START vector, and we expect y 1 to be close to the first word in the sequence. Similarly, we set x 2 to the first word vector and expect the network to predict the second word, etc. Finally, x T would be the last word vector in the sequence and we expect the RNN to predict a special END token. The goal is to maximize the log probability assigned to the target labels. Figure 2. Illustration of RNN sentence generating process. 3.2. Testing process To predict a sentence, we compute the image encoding vector b v, set h 0 = 0, x 1 to the START vector and compute the distribution over the first word y 1. We sample a word from that distribution, set its embedding vector as input word x 2, and repeat this process until the END token is generated or the length of generated sequence exceeds 20. We also report the BLUE score with different beam size search. 3.3. Optimization As we are going to compare the performance of the RNN model on Chinese sentence generation to that on English sentence generation in [6], we need to keep the RNN architecture and training parameters the same with [6]. We use SGD with mini-batches of 100 image-sentence pairs and momentum of 0.9 to optimize the alignment model. We cross-validate the learning rate and the weight decay. We 2

Figure 3. Some examples of English captions and their translated Chinese. The translated sentences are obtained by using Google Translation API. Figure 4. An example of sentence segmentation in word level method. Figure 6. Chinese caption generated by character level RNN during test. For understanding, the Chinese sentence in the bottom of this figure means a young girl is wearing a red shirt and a black trousers. Figure 5. An example of sentence segmentation in character level method. also use dropout regularization in all layers except in the recurrent layers. We also achieved the best results using RMSprop [13]. 4. Experiments 4.1. Dataset processing We experiment on Flickr30K [15] dataset, which contains 31, 000 images and each comes with 5 Chinese sentences. Be noted that the original captions are in English. We obtained the Chinese captions using Google Translation API. Some examples are shown in Figure 3. For Flickr30K, we use 1, 000 images for validation, another 1, 000 images for testing and the rest for training (the same setting as [6]). 4.2. Methods As you can see from Figure 3, Chinese is very different from English, we have no blank between Chinese characters. Thus we trained our model in two different method. In the first method, we tokenized each Chinese sentences into a list of words, such an example is shown in Figure 4. In the second method, we split each Chinese caption into a list of Chinese characters, such an example is shown in Figure 5. 4.3. Model evaluation and comparison We first trained the RNN model in two ways, it can produce reasonable descriptions of images during test for both method. Figure 6 shows an example of test image Chinese caption generated by the character level RNN model. Figure 7 shows an example of generated Chinese caption by the word level RNN model. Figure 7. Chinese caption generated by word level RNN during test. For understanding, the Chinese sentence in the bottom of this figure means a man and a woman are dancing. Figure 8. An example of different captions generated by word level (left) and character level (right) methods. For understanding, the corresponding English sentences for each of them are shown in the bottom of this figure. We also compared the captions generated by our model in two methods on the same test images. The interesting part is that, though there is a slightly difference between the two captions, each of the generated sentence still makes sense. Such an example is shown in Figure 8. Before we came to a conclusion, we also compared their performance with the result achieved by [6] (FeiFei s model) on English caption generation quantitatively. Thus 3

Figure 9. BLEU scores for the RNN model on English caption generation (FeiFeis), word level and character level method on Chinese caption generation on Flickr30K image dataset. we report the BLEU [10] score (see in Figure 9) for both methods with Beam size of 7 using the coco-caption code [1], which is the same setting in [6]. From the BLEU scores in Figure 9, we can see that the RNN model trained with character level method for Chinese captions outperforms the model trained with word level method. The character level method performs very close to the original model trained on English captions [6], while the word level method performs slightly worse. Thus, we came to the conclusion that this RNN model works universally well for image caption on different languages. Before our work in this paper, we haven t seen any application of the RNN model on image caption generation other than English. Thus it is not clear whether this sequential model applies to different languages or not. We proved or at least tested this model on Chinese language and finally came to a conclusion which we think is fair. One surprise we have noticed is that the character level method works better than the word level one in this task. As some researchers have proved, character level convolutional neural network works even better in text/sentiment/sentiment classification [16, 7] than word level one. Our finding in this paper opens a further research to explore the performance of character level method on RNN model in other tasks, such as LSTM on test classification. 5. Limitation and Future work We used Google translation API to obtain the Chinese captions due to limited resources. However, the automatic Machine Translation is not very accurate currently. The reported BLEU scores may not be influenced too much, as it measures the relatively similarity between the generated sentence and reference one. But the quality of translation may compromise the quality of generated image descriptions. So what we could do in the future, instead of using the sentences translated by Google, we may review on a few thousands of images and translated sentences and manually correct them by ourselves. With that small set of clean data, we may try to train it again to see if it could work better. Or fine tune the hyper parameters of the RNN model to see if it yields a even better result. Acknowledgments We highly appreciate the help we received from Professor David and all the AIs in this great course. Most of the knowledge and techniques used in this paper are learnt from the vision course. The idea of training RNN model for image caption generation on Chinese language is inspired by Professor David, and he had provided us with many valuable suggestions and feedbacks. We want to say thank you to all the course staff in this course. Thanks for hosting attentive office hours and devoting efforts on the course development and poster session. We really learnt a lot from it. References [1] X. Chen, H. Fang, T.-Y. Lin, R. Vedantam, S. Gupta, P. Dollár, and C. L. Zitnick. Microsoft coco captions: Data collection and evaluation server. arxiv preprint arxiv:1504.00325, 2015. [2] X. Chen and C. L. Zitnick. Learning a recurrent visual representation for image caption generation. arxiv preprint arxiv:1411.5654, 2014. [3] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2625 2634, 2015. [4] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2):303 338, 2009. [5] H. Fang, S. Gupta, F. Iandola, R. K. Srivastava, L. Deng, P. Dollár, J. Gao, X. He, M. Mitchell, J. C. Platt, et al. From captions to visual concepts and back. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1473 1482, 2015. [6] A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3128 3137, 2015. [7] Y. Kim. Convolutional neural networks for sentence classification. arxiv preprint arxiv:1408.5882, 2014. [8] R. Kiros, R. Salakhutdinov, and R. S. Zemel. Unifying visual-semantic embeddings with multimodal neural language models. arxiv preprint arxiv:1411.2539, 2014. [9] J. Mao, W. Xu, Y. Yang, J. Wang, and A. L. Yuille. Explain images with multimodal recurrent neural networks. arxiv preprint arxiv:1410.1090, 2014. [10] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pages 311 318. Association for Computational Linguistics, 2002. [11] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, 4

A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211 252, 2015. [12] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arxiv preprint arxiv:1409.1556, 2014. [13] T. Tieleman and G. Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 4:2, 2012. [14] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3156 3164, 2015. [15] P. Young, A. Lai, M. Hodosh, and J. Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67 78, 2014. [16] X. Zhang, J. Zhao, and Y. LeCun. Character-level convolutional networks for text classification. In Advances in Neural Information Processing Systems, pages 649 657, 2015. 5