A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

Similar documents
System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Python Machine Learning

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

arxiv: v1 [cs.cv] 10 May 2017

Semantic Segmentation with Histological Image Data: Cancer Cell vs. Stroma

Model Ensemble for Click Prediction in Bing Search Ads

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

arxiv: v4 [cs.cl] 28 Mar 2016

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

arxiv: v1 [cs.lg] 15 Jun 2015

arxiv: v1 [cs.lg] 7 Apr 2015

Cultivating DNN Diversity for Large Scale Video Labelling

Lip Reading in Profile

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION

Test Effort Estimation Using Neural Network

Residual Stacking of RNNs for Neural Machine Translation

Effective Supervision: Supporting the Art & Science of Teaching

SORT: Second-Order Response Transform for Visual Recognition

Learning Methods for Fuzzy Systems

Ask Me Anything: Dynamic Memory Networks for Natural Language Processing

arxiv: v1 [cs.cl] 27 Apr 2016

Глубокие рекуррентные нейронные сети для аспектно-ориентированного анализа тональности отзывов пользователей на различных языках

A Review: Speech Recognition with Deep Learning Methods

Using Deep Convolutional Neural Networks in Monte Carlo Tree Search

Deep Neural Network Language Models

Assignment 1: Predicting Amazon Review Ratings

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval

arxiv: v4 [cs.cv] 13 Aug 2017

Lecture 1: Machine Learning Basics

Second Exam: Natural Language Parsing with Neural Networks

Attributed Social Network Embedding

Knowledge Transfer in Deep Convolutional Neural Nets

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, VOL XXX, NO. XXX,

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING

Diverse Concept-Level Features for Multi-Object Classification

Probability and Statistics Curriculum Pacing Guide

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen

A study of speaker adaptation for DNN-based speech synthesis

A Compact DNN: Approaching GoogLeNet-Level Accuracy of Classification and Domain Adaptation

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

Activities, Exercises, Assignments Copyright 2009 Cem Kaner 1

School of Innovative Technologies and Engineering

Beyond the Pipeline: Discrete Optimization in NLP

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

TU-E2090 Research Assignment in Operations Management and Services

Automatic Discovery, Association Estimation and Learning of Semantic Attributes for a Thousand Categories

Artificial Neural Networks written examination

Word Segmentation of Off-line Handwritten Documents

Dual-Memory Deep Learning Architectures for Lifelong Learning of Everyday Human Behaviors

arxiv: v2 [cs.ro] 3 Mar 2017

arxiv: v2 [cs.cl] 26 Mar 2015

THE world surrounding us involves multiple modalities

Curriculum Design Project with Virtual Manipulatives. Gwenanne Salkind. George Mason University EDCI 856. Dr. Patricia Moyer-Packenham

A deep architecture for non-projective dependency parsing

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Taxonomy-Regularized Semantic Deep Convolutional Neural Networks

Dropout improves Recurrent Neural Networks for Handwriting Recognition

arxiv: v2 [cs.cv] 30 Mar 2017

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

How to read a Paper ISMLL. Dr. Josif Grabocka, Carlotta Schatten

An empirical study of learning speed in backpropagation

Missouri Mathematics Grade-Level Expectations

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

An Introduction and Overview to Google Apps in K12 Education: A Web-based Instructional Module

Unit 3: Lesson 1 Decimals as Equal Divisions

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

What Can Neural Networks Teach us about Language? Graham Neubig a2-dlearn 11/18/2017

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

A Reinforcement Learning Variant for Control Scheduling

Certified Six Sigma Professionals International Certification Courses in Six Sigma Green Belt

arxiv: v1 [cs.cl] 20 Jul 2015

A Deep Bag-of-Features Model for Music Auto-Tagging

Algebra 2- Semester 2 Review

DIDACTIC MODEL BRIDGING A CONCEPT WITH PHENOMENA

SARDNET: A Self-Organizing Feature Map for Sequences

arxiv: v2 [cs.cv] 3 Aug 2017

Webly Supervised Learning of Convolutional Networks

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Softprop: Softmax Neural Network Backpropagation Learning

PM tutor. Estimate Activity Durations Part 2. Presented by Dipo Tepede, PMP, SSBB, MBA. Empowering Excellence. Powered by POeT Solvers Limited

arxiv: v2 [cs.ir] 22 Aug 2016

Dublin City Schools Mathematics Graded Course of Study GRADE 4

Unit 2. A whole-school approach to numeracy across the curriculum

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

Discriminative Learning of Beam-Search Heuristics for Planning

Image based Static Facial Expression Recognition with Multiple Deep Network Learning

STT 231 Test 1. Fill in the Letter of Your Choice to Each Question in the Scantron. Each question is worth 2 point.

BMBF Project ROBUKOM: Robust Communication Networks

How do adults reason about their opponent? Typologies of players in a turn-taking game

arxiv: v3 [cs.cl] 7 Feb 2017

Learning to Rank with Selection Bias in Personal Search

Transcription:

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention Damien Teney 1, Peter Anderson 2*, David Golub 4*, Po-Sen Huang 3, Lei Zhang 3, Xiaodong He 3, Anton van den Hengel 1 1 University of Adelaide 2 Australian National University 3 Microsoft Research 4 Stanford University *Work performed while interning at MSR

Proposed model Straightforward architecture Joint embedding of question/image Single-head, question-guided attention over image Element-wise product The devil is in the details Image features from Faster R-CNN Gated tanh activations Output as regression of answer scores, soft scores as target Output classifiers initialized with pretrained representations of answers

Gated layers Non-linear layers: gated hyperbolic tangent activations Defined as: input x, output y intermediate activation gate Inspired by gating in LSTMs/GRUs combine with element-wise product Empirically better than ReLU, tanh, gated ReLU, residual connections, etc. Special case of highway networks; used before in: [1] Dauphin et al. Language modeling with gated convolutional networks, 2016. [2] Teney et al. Graph-structured representations for visual question answering, 2017.

Question encoding Chosen implementation Pretrained GloVe embeddings, d=300 GRU encoder Better than. Word embeddings learned from scratch GloVe of dimension 100, 200 Bag-of-words (sum/average of embeddings) GRU backwards GRU bidirectional 2-layer GRU

Classical top-down attention on image features Chosen implementation Simple attention on image feature maps One head Softmax normalization of weights Better than. No L2 normalization Multiple heads Sigmoid on weights

Output Chosen implementation Sigmoid output (regression) of answer scores: allows multiple answers per question Soft targets in [0,1] Better than. Softmax classifier Binary targets {0,1} allows uncertain answers Initialize classifiers with representations of answers Classifiers learned from scratch W of dimensions nanswers x d

Output Chosen implementation Sigmoid output (regression) of answer scores: allows multiple answers per question Soft targets in [0,1] allows uncertain answers Initialize classifiers with representations of answers Initialize W text with GloVe word embeddings Initialize W img with Google Images (global ResNet features)

Training and implementation Additional training data from Visual Genome: questions with matching answers and matching images (about 30% of Visual Genome, i.e. ~485,000 questions) Keep all questions, even those with no answer in candidates, and with 0<score<1 Shuffle training data but keep balanced pairs in same mini-batches Large mini-batches of 512 QAs; sweet spot in {64, 128, 256, 384, 512, 768, 1024} 30-Network ensemble: different random seeds, sum predicted scores

Image features from bottom-up attention Equally applicable to VQA and image captioning Significant relative improvements: 6 8 % (VQA / CIDEr / SPICE) Intuitive and interpretable (natural approach)

Bottom-up image attention Typically, attention models operate on the spatial output of a CNN We calculate attention at the level of objects and other salient image regions

Can be implemented with Faster R-CNN 1 Pre-train on 1600 objects and 400 attributes from Visual Genome 2 Select salient regions based on object detection confidence scores Take the mean-pooled ResNet-101 3 feature from each region 1 NIPS 2015, 2 http://visualgenome.org, 3 CVPR 2016

Qualitative differences in attention methods Q: Is the person wearing a helmet? ResNet baseline Up-Down attention Q: What foot is in front of the other foot? ResNet baseline Up-Down attention

VQA failure cases: counting, reading Q: How many oranges are sitting on pedestals? Q: What is the name of the realtor?

Equally applicable to Image Captioning ResNet baseline: A man sitting on a toilet in a bathroom. Up-Down attention: A man sitting on a couch in a bathroom.

MS COCO Image Captioning Leaderboard Bottom-up attention adds 6 8% improvement on SPICE and CIDEr metrics (see arxiv: Bottom-Up and Top-Down Attention for Image Captioning and VQA) First place on almost all MS COCO leaderboard metrics

VQA experiments Current best results Ensemble, trained on tr+va+vg, eval. on test-std Yes/no: 86.52 Number: 48.48 Other: 60.95 Overall: 70.19 Bottom-up attention adds 6% relative improvement (even though the baseline ResNet has twice as many layers) Single-network, trained on tr+vg, eval. on va

Take-aways and conclusions Difficult to predict effects of architecture, hyperparameters, Engineering effort: good intuitions are valuable, then need fast experiments Performance (# Ideas) * (# GPUs) / (Training time) Beware of experiments with reduced training data Non-cumulative gains, performance saturates Fancy tweaks may just add more capacity to network May be redundant with other improvements Calculating attention at the level of objects and other salient image regions (bottom-up attention) significantly improves performance Replace pretrained CNN features with pretrained bottom-up attention features

Questions? arxiv:1708.02711: Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge arxiv:1707.07998: Bottom-Up and Top-Down Attention for Image Captioning and VQA Damien Teney, Peter Anderson, David Golub, Po-Sen Huang, Lei Zhang, Xiaodong He, Anton van den Hengel