Visual Question Answering Using Various Methods

Similar documents
System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

arxiv: v1 [cs.cv] 10 May 2017

arxiv: v4 [cs.cl] 28 Mar 2016

Cultivating DNN Diversity for Large Scale Video Labelling

Ask Me Anything: Dynamic Memory Networks for Natural Language Processing

Second Exam: Natural Language Parsing with Neural Networks

arxiv: v1 [cs.cl] 27 Apr 2016

arxiv: v1 [cs.lg] 7 Apr 2015

Residual Stacking of RNNs for Neural Machine Translation

Глубокие рекуррентные нейронные сети для аспектно-ориентированного анализа тональности отзывов пользователей на различных языках

TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Diverse Concept-Level Features for Multi-Object Classification

HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION

A study of speaker adaptation for DNN-based speech synthesis

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

THE world surrounding us involves multiple modalities

Assignment 1: Predicting Amazon Review Ratings

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

SORT: Second-Order Response Transform for Visual Recognition

arxiv: v1 [cs.lg] 15 Jun 2015

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Modeling function word errors in DNN-HMM based LVCSR systems

ON THE USE OF WORD EMBEDDINGS ALONE TO

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

Lip Reading in Profile

Generative models and adversarial training

Taxonomy-Regularized Semantic Deep Convolutional Neural Networks

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

arxiv: v3 [cs.cl] 7 Feb 2017

NEURAL DIALOG STATE TRACKER FOR LARGE ONTOLOGIES BY ATTENTION MECHANISM. Youngsoo Jang*, Jiyeon Ham*, Byung-Jun Lee, Youngjae Chang, Kee-Eung Kim

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, VOL XXX, NO. XXX,

arxiv: v1 [cs.cv] 2 Jun 2017

THE enormous growth of unstructured data, including

Probabilistic Latent Semantic Analysis

A Review: Speech Recognition with Deep Learning Methods

Calibration of Confidence Measures in Speech Recognition

Georgetown University at TREC 2017 Dynamic Domain Track

There are some definitions for what Word

Modeling function word errors in DNN-HMM based LVCSR systems

Word Segmentation of Off-line Handwritten Documents

Dropout improves Recurrent Neural Networks for Handwriting Recognition

Deep Neural Network Language Models

Dialog-based Language Learning

Lecture 1: Machine Learning Basics

Evolutive Neural Net Fuzzy Filtering: Basic Description

Offline Writer Identification Using Convolutional Neural Network Activation Features

arxiv: v5 [cs.ai] 18 Aug 2015

arxiv: v2 [cs.cl] 26 Mar 2015

INPE São José dos Campos

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Model Ensemble for Click Prediction in Bing Search Ads

Python Machine Learning

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

A JOINT MANY-TASK MODEL: GROWING A NEURAL NETWORK FOR MULTIPLE NLP TASKS

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Dual-Memory Deep Learning Architectures for Lifelong Learning of Everyday Human Behaviors

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Artificial Neural Networks written examination

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Semantic Segmentation with Histological Image Data: Cancer Cell vs. Stroma

A Compact DNN: Approaching GoogLeNet-Level Accuracy of Classification and Domain Adaptation

Knowledge Transfer in Deep Convolutional Neural Nets

Test Effort Estimation Using Neural Network

Academic literacies and student learning: how can we improve our understanding of student writing?

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

SARDNET: A Self-Organizing Feature Map for Sequences

Speech Recognition at ICSI: Broadcast News and beyond

Human Emotion Recognition From Speech

arxiv: v4 [cs.cv] 13 Aug 2017

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

Chamilo 2.0: A Second Generation Open Source E-learning and Collaboration Platform

arxiv: v1 [cs.lg] 20 Mar 2017

Information System Design and Development (Advanced Higher) Unit. level 7 (12 SCQF credit points)

Speech Emotion Recognition Using Support Vector Machine

Attributed Social Network Embedding

Adaptive learning based on cognitive load using artificial intelligence and electroencephalography

arxiv: v2 [cs.cv] 3 Aug 2017

A Deep Bag-of-Features Model for Music Auto-Tagging

WHEN THERE IS A mismatch between the acoustic

COMPUTER-ASSISTED INDEPENDENT STUDY IN MULTIVARIATE CALCULUS

Improvements to the Pruning Behavior of DNN Acoustic Models

Summarizing Answers in Non-Factoid Community Question-Answering

Australian Journal of Basic and Applied Sciences

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

arxiv: v2 [cs.ir] 22 Aug 2016

A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Transcription:

Visual Question Answering Using Various Methods Shuhui Qu Civil and Environmental Engineering Department Stanford University California, CA, 94305 shuhuiq@stanford.edu Abstract This project tries to apply deep learning tools to enable computer answering question by looking at images. In this project, the visual question answering dataset[1] is introduced. This dataset consists of 204,721 real images, 614,164 question and 50,000 abstract scenes, 150,000 questions. Various methods are reproduced. The analysis on different models are presented. 1 Introduction Due to the rapid development of deep learning methods for visual recognition, natural language processing, computers could perform various complex and difficult tasks. One of the most important tasks is to have computer combining various tools for high-level scene interpretation, such as image captioning and visual question answering. With the emergence of large image dataset, text, questions, visual question answering by computers has been made possible. In general, the visual question answer system are required to answer all kinds of questions people might ask relate or not related with the image. Building a system that could properly answer questions would be important to the development of artificial intelligence. Various methods have been proposed to deal with the problem. Due to the time limit, in this project, I manage to reproduce, analyze and compare previous works on this problem. These methods includes: pure yes, LSTM question only, CNN + LSTM, DMN. The accuracy of thess models are evaluated by the vqa dataset, which contains 204,721 real images, 614,164 question and 50,000 abstract scenes, 150,000 questions. Section 2 discusses the related works on image captioning and question answering methods. Section 3 introduces various module for question answering tools. The testing and result discussions are in sections 4. Then the report concludes with section 5. 2 Related Work As the development of deep learning, there are a lot of studies related with high-level scene interpresations. Karpathy and Feifei[3] developed a model that generate natural language descriptions of images and regions that could learn about the inter-modal correspondances between language and visual data. They align the convolutional neural network with bidirectional recurrent neural network. Johnson et al.[2] presented convolutional localization network for dense captioning that could process an image with a single efficient forward pass. Recently, after the work by Agrawal et al[1], that they introduce the task of free-form and open ended visual question answering, there s emergence of works on tackling these tasks. Agrawal et al. also provided some initial effort on these problems. Zhou et al.[6] developed Bowing as a simple baseline for this task. Kumar et al.[4] presented dynamic memory network tools for natural language processing. Xiong et al.[5] further developed 1

dynamic memory network to enable visual and textual question answering. In this project, I am trying to reproduce most of these works. 3 Methodology In order to successfully address the task, there are four main modules: image feature extraction, question understanding, answer generation and feature filters. The first three modules are essential for the project, which help to understand images and question and reasoning possible answers to the correct result. Depend are various techniques, the forth module are applied to improve the final accuracies.these possibe techniques includes normalization, BOW, episodic memory network and etc. Due to time limit, I only apply the episodic memory network in this project. 3.1 Visual Feature Extraction In this project, a pretrained convolutional neural network based model VGGNet is applied to extract image features. Depend on different question answering model s architecture, the feature layer is Figure 1: Image feature extraction selected accordingly. For instance, the fully connected layer fc3 is selected for CNN+LSTM model as shown in the red box of figure 1. This layer has 4096 parameters and could be input to answer generation module directly. For DMN model, the conv5 3 layer is selected as shown in blue box of figure 1. This layer has 14 14 512 parameters. During implementation for extracting conv5 3 layer, various problems are encountered: out of memory, out of disk space, cannot write such large matrix. I also tried to extract features during runtime as well. However, this results unreasonable long training time. 3.2 Question Understanding Figure 2: Question feature extraction The module uses LSTM to extract question features q (the final hidden state of LSTM). Pretrained Glove is applied to represent each word. 3.3 Answer Generation The answer generation module receives both question feature(or filter question feature) and image feature(or filtered image feature). These two features are then concatenated and input to an LSTM module to a sequence of words. Cross entropy loss on answers is applied to train the network. 3.4 Episodic Memory Network There are various filters for images and questions. In this project, the episodic memory network[5] is applied that functions as the attention mechanism on images input. No filters on questions feature is applied, e.g. ibow, etc. 2

Figure 3: Answer generation 3.4.1 Input Module for EMN Figure 4: image input module In this input module, there are two components instead of three parts described in [5] visual feature extraction and input fusion layer. The visual feature extraction is the the same as the previou section 3.1. I directly use numpy.reshape to concatenate the 196 local regional vectorsf i instead of snake shape. After local feature extraction, they are then input to a bi-directional GRU system to produce globally aware input facts F. By using bi-directional GRU, information are propagated through all neighbors.the GRU function is as follow: u i = σ(w (u) X i + U (u) h i 1 + b (u) ) r i = σ(w (r) X i + U (r) h i 1 + b (r) ) h i = tanh(w x i + r i Uh i 1 + b (h) ) (1) h i = u i h i + (1 u i ) h i 1 where σ is sigmoid activation function. Then, for each local feature, forward hidden state and backward hidden state are combined as the output of the input module. hi = GRU fwd (f i, h i 1 ) hi = GRU bwd (f i, h i 1 ) (2) F i = h i + h i 3.4.2 EMN Mechanism The output of the input module F = [F 1, F 2,..., F i,..., F N ], where N = 196 in this case, are the input to the episodic memory module to provide further focusing attention information related with 3

question feature q. This is to find the relation between question feature and input F. The relation is represented by the attention gate g t i. z t i = [F i q; F i m t 1 ; F i q ; F i m t 1 ] Z t i = W (2) tanh(w (1) z t i + b (1) ) + b (2) g (t) exp(zi t i = ) Mi k=1 exp(zt k ) where, F i is the fact, m t 1 is the previous episodic memory and m 0 = q, M i in this case is 196. Then attention gate is then help to extract a contextual vector c t base on the attention by using a weighed summantion of facts. N c t = gif t i (4) i=1 The contextual vector could also be generated by using attention based GRU. however, due to time limit, I was not able to reproduce it(i tried but it takes too long to work and give up). The episodic memory then is updated by using ReLU for its simplicity as well. m t = ReLU(W t [m t 1 ; c t ; q] + b) (5) However, it could also be updated by using GRU as well. It need to be reproduced if more time is allowed. 3.5 Test Models In this project, apart from two baseline models, pure yes and LSTM question only, I also reproduce CNN + LSTM, and DMN model. 3.5.1 CNN + LSTM (3) Figure 5: CNN + LSTM Model For the CNN + LSTM model, I directly concatenate the visual feature extraction module, question understanding module and answer generation module directly to generate answer. The output of visual feature extraction module and question understanding module are directly input to answer generation module. 3.5.2 Dynamic Memory Network(DMN) Therefore, for DMN model, I only applies weighted sum and ReLU for attention mechanism. The filtered visual feature are concatenated with the question features and input into question answering module directly. 4 Test and Discussion In this project, I tested several methods:pure yes result, LSTM question, CNN + LSTM, DMN. Among these test, LSTM question and CNN + LSTM are trained by 60 epoches. However, due to 4

Figure 6: DMN Model Table 1: VQA Test Questions Accuracy All Model Pure Yes 27.12 LSTM Question 42.99 CNN+LSTM 51.63 DMN 50.98 Test Accuracy time limit, DMN is trained with weigh less epoches. Therefore, the benefit of DMN might not be that obvious. The result is shown as follow: From the result we can see, that for some questions, the system does not always require images. The LSTM question model could generate good result directly from past knowledge. The CNN+LSTM use fully connected later directly as input to the answer module could provide relatively good result, various techiniques, such as normalization, ibowing could help to improve the accuracy in some way. The DMN takes local features and tries to focus on the features related with questions.the tested DMN model performs relatively well with few epoches. I believe if more time is allowed and better computational device could be provided, it will definitely outperform CNN+LSTM model. For the implementation of DMN, I encountered a lot of problems related with the computation issues such as out of memory, out of hard drive and etc. Here are some sample outputs of the system and visualqa s interface. Figure 7: sample output Figure 8: visualqa interface 5

5 Conclusion and Future Work In this project, I investigate various methods to deal with visual question answering problem. Based on the impetus of CNN and RNN, I tested four different methods that handles the problem from different perspective. Reasonable results are reproduced in this project. The original purpose of the project was to improve the accuracy. However, due to time limit, I could only reproduce these result. For future work, I will reproduce the DMN model with different attention mechanism and improve the accuracy. 6 Reference References [1] Stanislaw Antol et al. VQA: Visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision. 2015, pp. 2425 2433. [2] Justin Johnson, Andrej Karpathy, and Li Fei-Fei. DenseCap: Fully Convolutional Localization Networks for Dense Captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016. [3] Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015, pp. 3128 3137. [4] Ankit Kumar et al. Ask me anything: Dynamic memory networks for natural language processing. In: arxiv preprint arxiv:1506.07285 (2015). [5] Caiming Xiong, Stephen Merity, and Richard Socher. Dynamic memory networks for visual and textual question answering. In: arxiv preprint arxiv:1603.01417 (2016). [6] Bolei Zhou et al. Simple Baseline for Visual Question Answering. In: arxiv preprint arxiv:1512.02167 (2015). 6