Knowledge Guided Attention and Inference for Describing Images Containing Unseen Objects

Similar documents
Neural Network Model of the Backpropagation Algorithm

More Accurate Question Answering on Freebase

Fast Multi-task Learning for Query Spelling Correction

An Effiecient Approach for Resource Auto-Scaling in Cloud Environments

Channel Mapping using Bidirectional Long Short-Term Memory for Dereverberation in Hands-Free Voice Controlled Devices

1 Language universals

MyLab & Mastering Business

Information Propagation for informing Special Population Subgroups about New Ground Transportation Services at Airports

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

Lecture 1: Machine Learning Basics

Diverse Concept-Level Features for Multi-Object Classification

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Cross Language Information Retrieval

Lip Reading in Profile

Learning Methods in Multilingual Speech Recognition

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Rule Learning With Negation: Issues Regarding Effectiveness

Probabilistic Latent Semantic Analysis

A Comparison of Two Text Representations for Sentiment Analysis

arxiv: v1 [cs.lg] 7 Apr 2015

Rule Learning with Negation: Issues Regarding Effectiveness

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen

HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION

Speech Recognition at ICSI: Broadcast News and beyond

Build on students informal understanding of sharing and proportionality to develop initial fraction concepts.

Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge

arxiv: v2 [cs.cv] 3 Aug 2017

Taxonomy-Regularized Semantic Deep Convolutional Neural Networks

arxiv:submit/ [cs.cv] 2 Aug 2017

Cultivating DNN Diversity for Large Scale Video Labelling

A Case Study: News Classification Based on Term Frequency

INPE São José dos Campos

Loughton School s curriculum evening. 28 th February 2017

Learning Methods for Fuzzy Systems

Contents. Foreword... 5

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Linking the Ohio State Assessments to NWEA MAP Growth Tests *

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Modeling function word errors in DNN-HMM based LVCSR systems

Distant Supervised Relation Extraction with Wikipedia and Freebase

Modeling function word errors in DNN-HMM based LVCSR systems

A Case-Based Approach To Imitation Learning in Robotic Agents

FUZZY EXPERT. Dr. Kasim M. Al-Aubidy. Philadelphia University. Computer Eng. Dept February 2002 University of Damascus-Syria

Semantic Segmentation with Histological Image Data: Cancer Cell vs. Stroma

What is a Mental Model?

Automatic Discovery, Association Estimation and Learning of Semantic Attributes for a Thousand Categories

Context Free Grammars. Many slides from Michael Collins

arxiv: v1 [cs.cl] 20 Jul 2015

Linking Task: Identifying authors and book titles in verbose queries

Entrepreneurial Discovery and the Demmert/Klein Experiment: Additional Evidence from Germany

Short Text Understanding Through Lexical-Semantic Analysis

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Copyright Corwin 2015

Matching Similarity for Keyword-Based Clustering

Approaches to control phenomena handout Obligatory control and morphological case: Icelandic and Basque

A Review: Speech Recognition with Deep Learning Methods

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Artificial Neural Networks written examination

ENGBG1 ENGBL1 Campus Linguistics. Meeting 2. Chapter 7 (Morphology) and chapter 9 (Syntax) Pia Sundqvist

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Residual Stacking of RNNs for Neural Machine Translation

Axiom 2013 Team Description Paper

SORT: Second-Order Response Transform for Visual Recognition

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Deep Neural Network Language Models

Python Machine Learning

ACTIVITY: Comparing Combination Locks

Глубокие рекуррентные нейронные сети для аспектно-ориентированного анализа тональности отзывов пользователей на различных языках

CS Machine Learning

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Model Ensemble for Click Prediction in Bing Search Ads

A DISTRIBUTIONAL STRUCTURED SEMANTIC SPACE FOR QUERYING RDF GRAPH DATA

Concepts and Properties in Word Spaces

J j W w. Write. Name. Max Takes the Train. Handwriting Letters Jj, Ww: Words with j, w 321

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Georgetown University at TREC 2017 Dynamic Domain Track

T2Ts, revised. Foundations

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Fourth Grade. Reporting Student Progress. Libertyville School District 70. Fourth Grade

WebLogo-2M: Scalable Logo Detection by Deep Learning from the Web

Switchboard Language Model Improvement with Conversational Data from Gigaword

WebLogo-2M: Scalable Logo Detection by Deep Learning from the Web

Advanced Grammar in Use

Review in ICAME Journal, Volume 38, 2014, DOI: /icame

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

arxiv: v1 [cs.lg] 15 Jun 2015

Using Web Searches on Important Words to Create Background Sets for LSI Classification

SARDNET: A Self-Organizing Feature Map for Sequences

Corpus Linguistics (L615)

CAFE ESSENTIAL ELEMENTS O S E P P C E A. 1 Framework 2 CAFE Menu. 3 Classroom Design 4 Materials 5 Record Keeping

AP Chemistry

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

Books Effective Literacy Y5-8 Learning Through Talk Y4-8 Switch onto Spelling Spelling Under Scrutiny

Dear Teacher: Welcome to Reading Rods! Reading Rods offer many outstanding features! Read on to discover how to put Reading Rods to work today!

Transcription:

Knowledge Guided Aenion and Inference for Describing Images Conaining Unseen Objecs Adiya Mogadala 1, Umanga Bisa 2, Lexing Xie 2, Achim Reinger 1 1 Insiue of Applied Informaics and Formal Descripion Mehods (AIFB), Karlsruhe Insiue for Technology (KIT), Karlsruhe, Germany {adiya.mogadala,reinger}@ki.edu 2 Compuaional Media Lab, Ausralian Naional Universiy (ANU), Canberra, Ausralia {umanga.bisa,lexing.xie}@anu.edu Absrac. Images on he Web encapsulae diverse knowledge abou varied absrac conceps. They canno be sufficienly described wih models learned from image-capion pairs ha menion only a small number of visual objec caegories. In conras, large-scale knowledge graphs conain many more conceps ha can be deeced by image recogniion models. Hence, o assis descripion generaion for hose images which conain visual objecs unseen in image-capion pairs, we propose a wosep process by leveraging large-scale knowledge graphs. In he firs sep, a muli-eniy recogniion model is buil o annoae images wih conceps no menioned in any capion. In he second sep, hose annoaions are leveraged as exernal semanic aenion and consrained inference in he image descripion generaion model. Evaluaions show ha our models ouperform mos of he prior work on ou-of-domain MSCOCO image descripion generaion and also scales beer o broad domains wih more unseen objecs. 1 Inroducion Conen on he Web is highly heerogeneous and consiss mosly of visual and exual informaion. In mos cases, hese differen modaliies complemen each oher, which complicaes he capuring of he full meaning by auomaed knowledge exracion echniques. An approach for making informaion in all modaliies accessible o auomaed processing is linking he informaion represened in he differen modaliies (e.g., images and ex) ino a shared concepualizaion, like eniies in a Knowledge Graph (KG). However, obaining an expressive formal represenaion of exual and visual conen has remained a research challenge for many years. Recenly, a differen approach has shown impressive resuls, namely he ransformaion of one unsrucured represenaion ino anoher. Specifically, he ask of generaing naural language descripions of images or videos [16] has gained much aenion. While such approaches are no relying on formal concepualizaions of he domain o cover, he sysems ha have been proposed so far

are limied by a very small number of objecs ha hey can describe (less han 100). Obviously, such mehods as hey need o be rained on manually crafed image-capion parallel daa do no scale o real-world applicaions, and can be applied o cross-domain web-scale conen. In conras, visual objec classificaion echniques have improved considerably and hey are now scaling o housands of objecs more han he ones covered by capion raining daa [3]. Also, KGs have grown o cover all of hose objecs plus millions more accompanied by billions of facs describing relaions beween hose objecs. Thus, i appears ha hose informaion sources are he missing link o make exising image capioning models scale o a larger number of objecs wihou having o creae addiional image-capion raining pairs wih hose missing objecs. In his paper, we invesigae he hypohesis, ha concepual relaions of eniies as represened in KGs can provide informaion o enable capion generaion models o generalize o objecs ha hey haven seen during raining in he image-capion parallel daa. While here are exising mehods ha are ackling his ask, none of hem has exploied any form of concepual knowledge so far. In our model, we use KG eniy embeddings o guide he aenion of he capion generaor o he correc (unseen) objec ha is depiced in he image. Our main conribuions presened in his paper are summarized as follows: We designed a novel approach, called Knowledge Guided Aenion (KGA), o improve he ask of generaing capions for images which conain objecs ha are no in he raining daa. To achieve i, we creaed a muli-eniy-label image classifier for linking he depiced visual objecs o KG eniies. Based on ha, we inroduce he firs mechanism ha explois he relaional srucure of eniies in KGs for guiding he aenion of a capion generaor owards picking he correc KG eniy o menion in is descripions. We conduced an exensive experimenal evaluaion showing he effeciveness of our KGA mehod. Boh, in erms of generaing effecual capions and also scaling i o more han 600 visual objecs. The conribuion of his work on a broader scope is is progress owards he inegraion of he visual and exual informaion available on he Web wih KGs. 2 Previous Work on Describing Images wih Unseen Objecs Exising mehods such as Deep Composiional Capioning (DCC) [4], Novel objec Capioner (NOC) [15], Consrained Beam Search (CBS) [2] and - C [17] address he challenge by ransferring informaion beween seen and unseen objecs eiher before inference (i.e. before esing) or by keeping consrains on he generaion of capion words during inference (i.e. during esing). Figure 1 provides a broad overview of hose approaches. 2

Image conaining Unseen Objec (pizza) No Aenion + No Transfer Base CNN- A man is making a sandwich in a resauran. No Aenion + Transfer Before Inference No Aenion + Transfer During Inference DCC,NOC CBS,-C A man sanding nex o a able wih a pizza in fron of i. Knowledge Assised Aenion + Transfer Before and During Inference KGA (ours) A man is holding a pizza in his hands. Fig. 1. KGA goal is o describe images conaining unseen objecs by building on he exising mehods i.e. DCC [4], NOC [15], CBS [2] and -C [17] and going beyond hem by adding semanic knowledge assisance. Base refers o our base descripion generaion model buil wih CNN [13] - [5]. In DCC, an approach which performs informaion ransfer only before inference, he raining of he capion generaion model is solely dependen on he corpus consiuing words which may appear in he similar conex as of unseen objecs. Hence, explici ransfer of learned parameers is required beween seen and unseen objec caegories before inference which limis DCC from scaling o a wide variey of unseen objecs. NOC ries o overcame such issues by adoping a end-o-end rainable framework which incorporaes auxiliary raining objecives during raining and deaching he need for explici ransfer of parameers beween seen and unseen objecs before inference. However, NOC raining can resul in sub-opimal soluions as he addiional raining aemps o opimize hree differen loss funcions simulaneously. CBS, leverages an approximae search algorihm o guaranee he inclusion of seleced words during inference of a capion generaion model. These words are however only consrained on he image ags produced by a image classifier. And he vocabulary used o find similar words as candidaes for replacemen during inference is usually kep very large, hence adding exra compuaional complexiy. -C avoids he limiaion of finding similar words during inference by adding a copying mechanism ino capion raining. This assiss he model during inference o decide wheher a word is o be generaed or copied from a dicionary. However, -C suffers from confusion problems since probabiliies during word generaion end o ge very low. In general, aforemenioned approaches also have he following limiaions: (1) The image classifiers used canno predic absrac meaning, like hope, as observed in many web images. (2) Visual feaures exraced from images are confined o he probabiliy of occurrence of a fixed se of labels (i.e. nouns, verbs and adjecives) observed in a resriced daase and canno be easily exended o varied caegories for large-scale experimens. (3) Since an aenion mechanism is missing, imporan regions in an image are never aended. While, he aenion mechanism in our model helps o scale down all possible idenified conceps o 3

he relevan conceps during capion generaion. For large-scale applicaions, his plays a crucial role. We inroduce a new model called Knowledge Guided Assisance (KGA) ha explois concepual knowledge provided by a knowledge graph (KG) [6] as exernal semanic aenion hroughou raining and also o aid as a dynamic consrain before and during inference. Hence, i augmens an auxiliary view as done in muli-view learning scenarios. Usage of KGs has already shown improvemens in oher asks, such as in quesion answering over srucured daa, language modeling [1], and generaion of facoid quesions [12]. 3 Describing Images wih Unseen Objecs Using Knowledge Guided Assisance (KGA) In his secion, we presen our capion generaion model o generae capions for unseen visual objec caegories wih knowledge assisance. KGAs core goal is o inroduce exernal semanic aenion (ESA) ino he learning and also work as a consrain before and during inference for ransferring informaion beween seen words and unseen visual objec caegories. 3.1 Capion Generaion Model Our image capion generaion model (henceforh, KGA-CGM) combines hree imporan componens: a language model pre-rained on unpaired exual corpora, exernal semanic aenion (ESA) and image feaures wih a exual (T), semanic (S) and visual (V) layer (i.e. TSV layer) for predicing he nex word in he sequence when learned using image-capion pairs. In he following, we presen each of hese componens separaely while Figure 2 presens he overall archiecure of KGA-CGM. Language Model This componen is crucial o ransfer he senence srucure for unseen visual objec caegories. Language model is implemened wih wo long shor-erm memory () [5] layers o predic he nex word given previous words in a senence. If w 1:L represen he inpu o he forward of layer-1 for capuring forward inpu sequences ino hidden sequence vecors ( h 1 1:L R H ), where L is he final ime sep. Then encoding of inpu word sequences ino hidden layer-1 and hen ino layer-2 a each ime sep is achieved as follows: h 1 = L1-F( w ; Θ) (1) h 2 = L2-F( h 1 ; Θ) (2) where Θ represen hidden layer parameers. The encoded final hidden sequence ( h 2 R H ) a ime sep is hen used for predicing he probabiliy disribuion of he nex word given by p +1 = sofmax(h 2 ). The sofmax layer is only used while raining wih unpaired exual corpora and no used when learned wih image capions. 4

... Resauran I I Node1 Node2 F Node3 P I I Node4 Node6 I............ Pizza P W W Node5 F Chef... Language Model Fig. 2. KGA-CGM is buil wih hree componens. A language model buil wih wolayer forward (L1-F and L2-F), a muli-word-label classifier o generae image visual feaures and a muli-eniy-label classifier ha generaes eniy-labels linked o a KG serving as a parial image specific scene graph. This informaion is furher leveraged o acquire eniy vecors for supporing ESA. w represens he inpu capion word, c he semanic aenion, p he oupu of probabiliy disribuion over all words and y he prediced word a each ime sep. BOS and EOS represen he special okens. Exernal Semanic Aenion (ESA) Our objecive in ESA is o exrac semanic aenion from an image by leveraging semanic knowledge in KG as eniy-labels obained using a muli-eniy-label image classifier (discussed in he Secion 4.2). Here, eniy-labels are analogous o paches or aribues of an image. In formal erms, if ea i is an eniy-label and e i R E he eniy-label vecor among se of eniy-label vecors (i = 1,.., L) and β i he aenion weigh of e i hen β i is calculaed a each ime sep using Equaion 3. β i = exp(o i ) L j=1 exp(o j) (3) where O i = f(e i, h 2 ) represen scoring funcion which condiions on he layer-2 hidden sae (h 2 ) of a capion language model. I can be observed ha he scoring funcion f(e i, h 2 ) is crucial for deciding aenion weighs. Also, relevance of he hidden sae wih each eniy-label is calculaed using Equaion 4. f(e i, h 2 ) = anh((h 2 ) T W he e i ) (4) where W he R H E is a bilinear parameer marix. Once he aenion weighs are calculaed, he sof aenion weighed vecor of he conex c, which is a dynamic represenaion of he capion a ime sep is given by Equaion 5 c = L β i e i (5) i=1 5

Here, c R E and L represen he cardinaliy of eniy-labels per image-capion pair insance. Image Feaures & TSV Layer & Nex Word Predicion Visual feaures for an image are exraced using muli-word-label image classifier (discussed in he Secion 4.2). To be consisen wih oher approaches [4, 15] and for a fair comparison, our visual feaures (I) also have objecs ha we aim o describe ouside of he capion daases besides having word-labels observed in paired image-capion daa. Once he oupu from all componens is acquired, he TSV layer is employed o inegrae heir feaures i.e. exual (T ), semanic (S) and visual (V ) yielded by language model, ESA and images respecively. Thus, TSV acs as a ransformaion layer for molding hree differen feaure spaces ino a single common space for predicion of nex word in he sequence. If h 2 R H, c R E and I R I represen vecors acquired a each ime sep from language model, ESA and images respecively. Then he inegraion a TSV layer of KGA-CGM is provided by Equaion 6. T SV = W h 2 h 2 + W c c + W I I (6) where W h 2 R vs H, W c R vs E and W I R vs I are linear conversion marices and vs is he image-capion pair raining daase vocabulary size. The oupu from he TSV layer a each ime sep is furher used for predicing he nex word in he sequence using a sofmax layer given by p +1 = sofmax(t SV ). 3.2 KGA-CGM Training To learn parameers of KGA-CGM, firs we freeze he parameers of he language model rained using unpaired exual corpora. Thus, enabling only hose parameers o be learned wih image-capion pairs emerging from ESA and TSV layer such as W he, W h 2, W c and W I. KGA-CGM is now rained o opimize he cos funcion ha minimizes he sum of he negaive log likelihood of he appropriae word a each ime sep given by Equaion 7. min 1 θ N N L (n) n=1 =0 log(p(y (n) )) (7) Where L (n) represen he lengh of senence (i.e. capion) wih beginning of senence (BOS), end of senence (EOS) okens a n-h raining sample and N as a number of samples used for raining. 3.3 KGA-CGM Consrained Inference Inference in KGA-CGM refer o he generaion of descripions for es images. Here, inference is no sraighforward as in he sandard image capion generaion approaches [16] because unseen visual objec caegories have no parallel 6

capions hroughou raining. Hence hey will never be generaed in a capion. Thus, unseen visual objec caegories require guidance eiher before or during inference from similar seen words ha appear in he paired image-capion daase and likely also from image labels. In our case, we achieve he guidance boh before and during inference wih varied echniques. Guidance before Inference We firs idenify he seen words in he paired image-capion daase similar o he visual objec caegories unseen in imagecapion daase by esimaing he semanic similariy using heir Glove embeddings [9] learned using unpaired exual corpora (more deails in Secion 4.1). Furhermore, we uilize his informaion o perform dynamic ransfer beween seen words visual feaures (W I ), language model (W h 2 ) and exernal semanic aenion (W c ) weighs and unseen visual objec caegories. To illusrae, if (v unseen, i unseen ) and (v closes, i closes ) denoe he indexes of unseen visual objec caegory zebra and is semanically similar known word giraffe in a vocabulary (v s ) and visual feaures (i s ) respecively. Then o describe images wih zebra in he similar manner as of giraffe, he ransfer of weighs is performed beween hem by assigning W c [v unseen,:], W h 2 [v unseen,:] and W I [v unseen,:] o W c [v closes,:], W h 2 [v closes,:] and W I [v closes,:] respecively. Furhermore, W I [i unseen,i closes ], W I [i closes,i unseen ] is se o zero for removing muual dependencies of seen and unseen words presence in an image. Hence, aforemenioned procedure will updae he KGA-CGM rained model before inference o assis he generaion of unseen visual objec caegories during inference as given by Algorihm 1. Inpu: M={W he, W h 2, W c, W I } Oupu: M new 1 Iniialize Lis(closes) = cosine disance(lis(unseen),vocabulary) ; 2 Iniialize W c [v unseen,:], W h 2 [v unseen,:], W I [v unseen,:] = 0 ; 3 Funcion Before Inference 4 forall iems T in closes and Z in unseen do 5 if T and Z is vocabulary hen 6 W c [v Z,:] = W c [v T,:] ; 7 W h 2 [v Z,:] = W h 2 [v T,:] ; 8 W I [v Z,:] = W I [v T,:] ; 9 end 10 if i T and i Z in visual feaures hen 11 W I [i Z,i T ]=0 ; 12 W I [i T,i Z]=0 ; 13 end 14 end 15 M new = M ; 16 reurn M new ; 17 end Algorihm 1: Consrained Inference Overview (Before) 7

Guidance during Inference The updaed KGA-CGM model is used for generaing descripions of unseen visual objec caegories. However, in he beforeinference procedure, he closes words o unseen visual objec caegories are idenified using embeddings ha are learned only using exual corpora and are never consrained on images. This obsrucs he view from an image leading o spurious resuls. We resolve such nuances during inference by consraining he beam search used for descripion generaion wih image eniy-labels (ea). In general, beam search is used o consider he bes k senences a ime o idenify he senence a he nex ime sep. Our modificaion o beam search is achieved by adding a exra consrain o check if a generaed unseen visual objec caegory is par of he eniy-labels. If i s no, unseen visual objec caegories are never replaced wih heir closes seen words. Algorihm 2 presens he overview of KGA-CGM guidance during inference. Inpu: M new, Im labels, beam-size k, word w Oupu: bes k successors 1 Iniialize Im labels = Top-5 (ea) ; 2 Iniialize beam-size k ; 3 Iniialize word w=null ; 4 Funcion During Inference 5 forall Sae s of k words do 6 w=s ; 7 if closes[w] in ea hen 8 s = closes[w]; 9 end 10 else 11 s = w ; 12 end 13 end 14 reurn bes k successors ; 15 end Algorihm 2: Consrained Inference Overview (During) 4 Experimenal Seup 4.1 Resources and Daases Our approach is dependen on several resources and daases. Knowledge Graphs (KGs) and Unpaired Texual Corpora There are several openly available KGs such as DBpedia, Wikidaa, and YAGO which provide semanic knowledge encapsulaed in eniies and heir relaionships. We 8

choose DBpedia as our KG for eniy annoaion, as i is one of he exensively used resource for semanic annoaion and disambiguaion [6]. For learning weighs of he language model and also Glove word embeddings, we have explored differen unpaired exual corpora from ou-of-domain sources (i.e. ou of image-capion parallel corpora) such as he Briish Naional Corpus (BNC) 3, Wikipedia (Wiki) and subse of SBU1M 4 capion ex conaining 947 caegories of ILSVRC12 daase [11]. NLTK 5 senence okenizer is used o exrac okenizaions and around 70k+ words vocabulary is exraced wih Glove embeddings. Unseen Objecs Descripion (Ou-of-Domain MSCOCO & ImageNe) To evaluae KGA-CGM, we use he subse of MSCOCO daase [7] proposed by Hendricks e al. [4]. The daase is obained by clusering 80 image objec caegory labels ino 8 clusers and hen selecing one objec from each cluser o be held ou from he raining se. Now he raining se does no conain he images and senences of hose 8 objecs represened by bole, bus, couch, microwave, pizza, racke, suicase and zebra. Thus making he MSCOCO raining daase o consiue 70,194 image-capion pairs. While validaion se of 40504 image-capion pairs are again divided ino 20252 each for esing and validaion. Now, he goal of KGA-CGM is o generae capion for hose es images which conain hese 8 unseen objec caegories. Henceforh, we refer his daase as ou-of-domain MSCOCO. To evaluae KGA-CGM on more challenging ask, we aemp o describe images ha conain wide variey of objecs as observed on he web. To imiae such a scenario, we colleced images from collecions conaining images wih wide variey of objecs. Firs, we used same se of images as earlier approaches [15, 17] which are subse of ImageNe [3] consiuing 642 objec caegories used in Hendricks e al. [4] who do no occur in MSCOCO. However, 120 ou of hose 642 objec caegories are par of ILSVRC12. 4.2 Muli-Label Image Classifiers The imporan consiuens ha influence KGA-CGM are he image eniy-labels and visual feaures. Idenified objecs/acions ec. in an image are embodied in visual feaures, while eniy-labels capure he semanic knowledge in an image grounded in KG. In his secion, we presen he approach o exrac boh visual feaures and eniy-labels. Muli-Word-label Image Classifier To exrac visual feaures of ou-ofdomain MSCOCO images, emulaing Hendricks e al. [4] a muli-word-label classifier is buil using he capions aligned o an image by exracing parof-speech (POS) ags such as nouns, verbs and adjecives aained for each word 3 hp://www.nacorp.ox.ac.uk/ 4 hp://vision.cs.sonybrook.edu/~vicene/sbucapions/ 5 hp://www.nlk.org/ 9

in he enire MSCOCO daase. For example, he capion A young child brushes his eeh a he sink conains word-labels such as young (JJ), child (NN), eeh (NN) ec., ha represen conceps in an image. An image classifier is rained now wih 471 word-labels using a sigmoid cross-enropy loss by fineuning VGG-16 [13] pre-rained on he raining par of he ILSVRC12. The visual feaures exraced for a new image represen he probabiliies of 471 image labels observed in ha image. For exracing visual feaures from ImageNe images, we replace he muli-word-label classifier wih he lexical classifier [4] learned wih 642 ImageNe objec caegories. Muli-Eniy-label Image Classifier To exrac semanic knowledge for ouof-domain MSCOCO images analogous o he word-labels, a muli-eniy-label classifier is build wih eniy-labels aained from a knowledge graph annoaion ool such as DBpedia spoligh 6 on raining se of MSCOCO consiuing 82,783 raining image-capion pairs. In oal around 812 unique labels are exraced wih an average of 3.2 labels annoaed per image. To illusrae, considering he capion presened in he aforemenioned secion, eniy labels exraced are Brush 7 and Tooh 8. An image classifier is now rained wih muliple eniylabels using sigmoid cross-enropy loss by fine-uning VGG-16 [13] pre-rained on he raining par of he ILSVRC12. For exracing eniy-labels from ImageNe images, we again leveraged lexical classifier [4] learned wih 642 ImageNe objec caegories. However, as all 642 caegories denoe WordNe synses, we build a connecion beween hese caegories and DBpedia by leveraging BabelNe [8] for muli-eniy-label classifier. To illusrae, for visual objec caegory womba (wordneid: n1883070 ) in ImageNe can be linked o DBpedia Womba 9. Hence, his makes our mehod very modular for building new image classifiers o incorporae semanic knowledge. 4.3 Eniy-Label Embeddings We presened earlier ha he acquisiion of eniy-labels for raining muli-eniylabel classifiers were obained using DBpedia spoligh eniy annoaion and disambiguaion ool. Hence, eniy-labels are expeced o encapsulae semanic knowledge grounded in KG. Furher, eniies in a KG can be represened wih embeddings by capuring heir relaional informaion. In our work, we see he efficacy of hese embeddings for capion generaion. Thus, we leverage eniy-label embeddings for compuing semanic aenion observed in an image wih respec o he capion as observed from KG. To obain eniy-label embeddings, we adoped he RDF2Vec [10] approach and generaed 500 dimensional vecor represenaions for 812 and 642 eniy-labels o describe ou-of-domain MSCOCO and ImageNe images respecively. 6 hps://gihub.com/dbpedia-spoligh/ 7 hp://dbpedia.org/resource/brush 8 hp://dbpedia.org/resource/tooh 9 hp://dbpedia.org/page/womba 10

4.4 Evaluaion Measures To evaluae generaed descripions for he unseen MSCOCO visual objec caegories, we use similar evaluaion merics as earlier approaches [4, 15, 17] such as METEOR and also SPICE [2]. However, CIDEr [14] meric is no used as i is required o calculae he inverse documen frequency used by his meric across he enire es se and no jus unseen objec subses. F1 score is also calculaed o measure he presence of unseen objecs in he generaed capions when compared agains reference capions. Furhermore, o evaluae ImageNe objec caegories descripion generaion: we leveraged F1 and also oher merics such as Unseen and Accuracy scores [15, 17]. The Unseen score measures he percenage of all novel objecs menioned in generaed descripions, while accuracy measure percenage of image descripions correcly addressed he unseen objecs. 5 Experimens The experimens are conduced o evaluae he efficacy of KGA-CGM model for describing ou-of-domain MSCOCO and ImageNe images. 5.1 Implemenaion KGA-CGM model consiues hree imporan componens i.e. language model, visual feaures and eniy-labels. Before learning KGA-CGM model wih imagecapion pairs, we firs learn he weighs of language model and keep i fixed during he raining of KGA-CGM model. To learn language model, we leverage unpaired exual corpora (e.g. enire MSCOCO se, Wiki, BNC ec.) and provide inpu word embeddings represening 256 dimensions pre-rained wih Glove [9] defaul seings on he same unpaired exual corpora. Hidden layer dimensions of language model are se o 512. KGM-CGM model is hen rained using imagecapion pairs wih Adam opimizer wih gradien clipping having maximum norm of 1.0 for abou 15 50 epochs. Validaion daa is used for model selecion and experimens are implemened wih Keras+Theano backend 10. 5.2 Describing Ou-of-Domain MSCOCO Images In his secion, we evaluae KGA-CGM using ou-of-domain MSCOCO daase. Quaniaive Analysis We compared our complee KGA-CGM model wih he oher exising models ha generae image descripions on ou-of-domain MSCOCO. To have a fair comparison, only hose resuls are compared which used VGG-16 o generae image feaures. Table 1 shows he comparison of individual and average scores based on METEOR, SPICE and F1 on all 8 unseen visual objec caegories wih beam size 1. I can be noiced ha KGA-CGM 10 hps://gihub.com/adiyamogadala/kga 11

F1 Model Beam microwave racke bole zebra pizza couch bus suicase Average DCC [4] 1 28.1 52.2 4.6 79.9 64.6 45.9 29.8 13.2 39.7 NOC [15] >1 24.7 55.3 17.7 89.0 69.3 25.5 68.7 39.8 48.8 CBS(T4) [2] >1 29.7 57.1 16.3 85.7 77.2 48.2 67.8 49.9 54.0 -C [17] >1 27.8 70.2 29.6 91.4 68.1 38.7 74.4 44.7 55.6 KGA-CGM 1 50.0 75.3 29.9 92.1 70.6 42.1 54.2 25.6 55.0 METEOR DCC [4] 1 22.1 20.3 18.1 22.3 22.2 23.1 21.6 18.3 21.0 NOC [15] >1 21.5 24.6 21.2 21.8 21.8 21.4 20.4 18.0 21.3 -C [17] >1 - - - - - - - - 23.0 CBS(T4) [2] >1 - - - - - - - - 23.3 KGA-CGM 1 22.6 25.1 21.5 22.8 21.4 23.0 20.3 18.7 22.0 SPICE DCC [4] >1 - - - - - - - - 13.4 CBS(T4) [2] >1 - - - - - - - - 15.9 KGA-CGM 1 13.3 16.8 13.1 19.6 13.2 14.9 12.6 10.6 14.3 Table 1. Measures for all 8 unseen objecs. Underline shows he second bes. wih beam size 1 was comparable o oher approaches even hough i used fixed vocabulary from image-capion pairs. For example, CBS [2] used expanded vocabulary of 21,689 when compared o 8802 by us. Also, our word-labels per image are fixed, while CBS uses a varying size of prediced image ags (T1-4). This makes i non-deerminisic and can increase uncerainy, as varying ags will eiher increase or decrease he performance. Furhermore, we also evaluaed KGA-CGM for he res of seen visual objec caegories in he Table 2. I can be observed ha our KGA-CGM ouperforms exising approaches as i did no undermine he in-domain descripion generaion, alhough i was uned for ou-of-domain descripion generaion. Model Seen Objecs Beam METEOR SPICE F1-score DCC [4] 1 23.0 15.9 - CBS(T4) [2] >1 24.5 18.0 - KGA-CGM 1 24.1 17.2 - KGA-CGM >1 25.1 18.2 - Table 2. Average measures of MSCOCO seen objecs. 12

Qualiaive Analysis In Figure 3, sample predicions of our bes KGA-CGM model is presened. I can be observed ha eniy-labels has shown an influence for capion generaion. Since, eniies as image labels are already disambiguaed, i aained high similariy in he predicion of a word hus adding useful semanics. Figure 3 presens he example unseen visual objecs descripions. Unseen Objec: Bole Prediced Eniy-Labels (Top-3): Wine_glass, Wine_bole, Bole Base: A vase wih a flower in i siing on a able NOC: A wine bole siing on a able nex o a wine bole KGA-CGM : A bole of wine siing on op of a able Unseen Objec: Couch Prediced Eniy-Labels (Top-3): Cake,Couch,Glass Base: A person is laying down on a bed NOC: A woman siing on a chair wih a large piece of cake on her arm KGA-CGM : A woman siing on a couch wih a remoe Unseen Objec: Pizza Prediced Eniy-Labels (Top-3): Pizza,Resauran,Ha Base: A man is making a sandwich in a resauran NOC: A man sanding nex o a able wih a pizza in fron of i. KGA-CGM: A man is holding a pizza in his hands Unseen Objec: Suicase Prediced Eniy-Labels (Top-3): Ca,Baggage,Black_Ca Base: A ca laying on op of a pile of books NOC: A ca laying on a suicase on a bed KGA-CGM: A ca laying inside of a suicase on a bed Unseen Objec: Bus Prediced Eniy-Labels (Top-3): Bus,Public_Transpor,Transi_Bus Base: A car is parked on he side of he sree NOC: Bus driving down a sree nex o a bus sop. KGA-CGM: A whie bus is parked on he sree Unseen Objec: Microwave Prediced Eniy-Labels (Top-3):Refrigeraor,Oven,Microwave_Oven Base: A wooden able wih a refrigeraor and a brown cabine NOC: A kichen wih a refrigeraor, refrigeraor, and refrigeraor. KGA-CGM: A kichen wih a microwave, oven and a refrigeraor Unseen Objec: Racke Prediced Eniy-Labels (Top-3):Tennis, Racke_(spors_equipmen), Cour Base: A ennis player geing ready o serve he ball NOC: A woman cour holding a ennis racke on a cour. KGA-CGM: A woman playing ennis on a ennis cour wih a racke. Unseen Objec: Zebra Prediced Eniy-Labels (Top-3):Zebra,Enclosure,Zoo Base: A couple of animals ha are sanding in a field NOC: Zebras sanding ogeher in a field wih zebras KGA-CGM: A group of zebras sanding in a line Fig. 3. Sample predicions of KGA-CGM on ou-of-domain MSCOCO Images wih Beam Size 1 when compared agains base model and NOC [15] 5.3 Describing ImageNe Images ImageNe images do no conain any ground-ruh capions and conain exacly one unseen visual objec caegory per image. Iniially, we firs rerain differen language models using unpaired exual daa (Secion 4.1) and also he enire MSCOCO raining se. Furhermore, he KGA-CGM model is rebuil for each one of hem separaely. To describe ImageNe images, image classifiers presened in he Secion 4.2 are leveraged. Table 3 summarizes he experimenal resuls aained on 634 caegories (i.e. no all 642) o have fair comparison wih oher approaches. By adoping only MSCOCO raining daa for language model, our KGA-CGM makes he relaive improvemen over NOC and -C in all caegories i.e. unseen, F1 and accuracy. Figure 4 shows few sample descripions. 6 Key Findings The key observaions of our research are: (1) The ablaion sudy conduced o undersand he influence of differen componens in KGA-CGM has shown ha using exernal semanic aenion and consrained inference has superior performance when compared o using only eiher of hem. Also, increasing he beam size during inference has shown a drop in all measures. This is basically 13

Model Unpaired-Tex Unseen F1 Accuracy NOC [15] MSCOCO 69.1 15.6 10.0 BNC&Wiki 87.7 31.2 22.0 -C [17] MSCOCO 72.1 16.4 11.8 BNC&Wiki 89.1 33.6 31.1 KGA-CGM MSCOCO 74.1 17.4 12.2 BNC&Wiki 90.2 34.4 33.1 BNC&Wiki&SBU1M 90.8 35.8 34.2 Table 3. Describing ImageNe Images wih Beam size 1. Resuls of NOC and -C (wih Glove) are adoped from Yao e al. [17] Unseen Objec: Truffle Guidance Before Inference: food ruffle Base: A person holding a piece of paper. KGA-CGM: A close up of a person holding ruffle Unseen Objec: Papaya Guidance Before Inference: banana papaya Base: A woman sanding in a garden. KGA-CGM: These are ripe papaya hanging on a ree Unseen Objec: Mammoh Guidance Before Inference: elephan mammoh Base: A baby elephan sanding in waer KGA-CGM: A herd of mammoh sanding on op of a green field Unseen Objec: Blackbird Guidance Before Inference: bird blackbird Base: A bird sanding in a field of green grass KGA-CGM: A blackbird sanding in he grass Fig. 4. ImageNe images wih bes KGA-CGM model from Table 3. Guided before inference shows which words are used for ransfer beween seen and unseen. adhered o he influence of muliple words on unseen objecs. (2) The performance advanage becomes clearer if he domain of unseen objecs is broadened. In oher words: KGA-CGM specifically improves over he sae-of-he-ar in seings ha are larger and less conrolled. Hereby, KGA-CGM scales o one order of magniude more unseen objecs wih moderae performance decreases. (3) The influence of he closes seen words (i.e. observed in image-capion pairs) and he unseen visual objec caegories played a prominen role for generaing descripions. For example in ou-of-domain MSCOCO, words such as suicase / bag, bole / glass and bus/ruck are semanically similar and are also used in he similar manner in a senence added excellen value. However, some words usually cooccur such as racke / cour and pizza / plae played differen roles in senences and lead o few grammaical errors. (4) The decrease in performance have a high correlaion wih he discrepancy beween he domain where seen and unseen objecs come from. 7 Conclusion and Fuure Work In his paper, we presened an approach o generae capions for images ha lack parallel capions during raining wih he assisance from semanic knowledge encapsulaed in KGs. In he fuure, we plan o expand our models o build mulimedia knowledge graphs along wih image descripions which can be used for finding relaed images or can be searched wih long exual queries. 14

8 Acknowledgemens Firs auhor is graeful o KHYS a KIT for heir research ravel gran and Compuaional Media Lab a ANU for providing access o heir K40x GPUs. References 1. Ahn, S., Choi, H., Pärnamaa, T., Bengio, Y.: A neural knowledge language model. arxiv preprin arxiv:1608.00318 (2016) 2. Anderson, P., Fernando, B., Johnson, M., Gould, S.: Guided open vocabulary image capioning wih consrained beam search. In: EMNLP (2017) 3. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagene: A large-scale hierarchical image daabase. In: Compuer Vision and Paern Recogniion, 2009. CVPR 2009. IEEE Conference on. pp. 248 255. IEEE (2009) 4. Hendricks, L.A., Venugopalan, S., Rohrbach, M., Mooney, R., Saenko, K., Darrell, T.: Deep composiional capioning: Describing novel objec caegories wihou paired raining daa. In: CVPR. pp. 1 10 (2016) 5. Hochreier, S., Schmidhuber, J.: Long shor-erm memory. Neural compuaion 9(8), 1735 1780 (1997) 6. Lehmann, J., Isele, R., Jakob, M., Jenzsch, A., Konokosas, D., Mendes, P.N., e al.: Dbpedia a large-scale, mulilingual knowledge base exraced from wikipedia. Semanic Web (2015) 7. Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zinick, C.L.: Microsof coco: Common objecs in conex. In: ECCV. pp. 740 755. Springer (2014) 8. Navigli, R., Ponzeo, S.P.: Babelne: The auomaic consrucion, evaluaion and applicaion of a wide-coverage mulilingual semanic nework. Arificial Inelligence 193, 217 250 (2012) 9. Penningon, J., Socher, R., Manning, C.: Glove: Global vecors for word represenaion. In: EMNLP. pp. 1532 1543 (2014) 10. Risoski, P., Paulheim, H.: Rdf2vec: Rdf graph embeddings for daa mining. In: Inernaional Semanic Web Conference. pp. 498 514. Springer (2016) 11. Russakovsky, O., Deng, J., Su, H., Krause, J., Saheesh, S., Ma, S., Huang, Z., Karpahy, A., Khosla, A., Bernsein, M., e al.: Imagene large scale visual recogniion challenge. Inernaional Journal of Compuer Vision 115(3), 211 252 (2015) 12. Serban, I.V., García-Durán, A., Gulcehre, C., Ahn, S., Chandar, S., Courville, A., Bengio, Y.: Generaing facoid quesions wih recurren neural neworks: The 30m facoid quesion-answer corpus. arxiv preprin arxiv:1603.06807 (2016) 13. Simonyan, K., Zisserman, A.: Very deep convoluional neworks for large-scale image recogniion. arxiv preprin arxiv:1409.1556 (2014) 14. Vedanam, R., Zinick, L.C., Parikh, D.: Cider: Consensus-based image descripion evaluaion. In: CVPR. pp. 4566 4575 (2015) 15. Venugopalan, S., Hendricks, L.A., Rohrbach, M., Mooney, R., Darrell, T., Saenko, K.: Capioning images wih diverse objecs. In: CVPR (2017) 16. Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and ell: Lessons learned from he 2015 mscoco image capioning challenge. IEEE ransacions on paern analysis and machine inelligence 39(4), 652 663 (2017) 17. Yao, T., Yingwei, P., Yehao, L., Mei, T.: Incorporaing copying mechanism in image capioning for learning novel objecs. In: CVPR (2017) 15