Ed nburgh University of Edinburgh NLP. Understanding Visual Scences. Dependency Graphs, Word Senses, and Multimodal Embeddings

Similar documents
A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

arxiv: v1 [cs.cl] 2 Apr 2017

THE world surrounding us involves multiple modalities

arxiv: v2 [cs.cv] 3 Aug 2017

Semantic Segmentation with Histological Image Data: Cancer Cell vs. Stroma

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Lip Reading in Profile

Diverse Concept-Level Features for Multi-Object Classification

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION

The University of Amsterdam s Concept Detection System at ImageCLEF 2011

Prediction of Maximal Projection for Semantic Role Labeling

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Ensemble Technique Utilization for Indonesian Dependency Parser

On document relevance and lexical cohesion between query terms

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Phrase Localization and Visual Relationship Detection with Comprehensive Image-Language Cues

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

How to analyze visual narratives: A tutorial in Visual Narrative Grammar

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

AQUA: An Ontology-Driven Question Answering System

Lecture 1: Machine Learning Basics

Dublin City Schools Mathematics Graded Course of Study GRADE 4

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

WebLogo-2M: Scalable Logo Detection by Deep Learning from the Web

Taxonomy-Regularized Semantic Deep Convolutional Neural Networks

The Smart/Empire TIPSTER IR System

Leveraging Sentiment to Compute Word Similarity

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

CS 598 Natural Language Processing

Word Sense Disambiguation

Python Machine Learning

arxiv: v4 [cs.cv] 13 Aug 2017

INPE São José dos Campos

Distant Supervised Relation Extraction with Wikipedia and Freebase

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

WebLogo-2M: Scalable Logo Detection by Deep Learning from the Web

Extracting Opinion Expressions and Their Polarities Exploration of Pipelines and Joint Models

Linking Task: Identifying authors and book titles in verbose queries

Natural Language Processing. George Konidaris

Cross Language Information Retrieval

Georgetown University at TREC 2017 Dynamic Domain Track

A DISTRIBUTIONAL STRUCTURED SEMANTIC SPACE FOR QUERYING RDF GRAPH DATA

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

arxiv: v1 [cs.cv] 10 May 2017

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Construction Grammar. University of Jena.

The Role of the Head in the Interpretation of English Deverbal Compounds

ENGBG1 ENGBL1 Campus Linguistics. Meeting 2. Chapter 7 (Morphology) and chapter 9 (Syntax) Pia Sundqvist

1.11 I Know What Do You Know?

arxiv: v1 [cs.lg] 15 Jun 2015

arxiv: v1 [cs.cv] 2 Jun 2017

Airplane Rescue: Social Studies. LEGO, the LEGO logo, and WEDO are trademarks of the LEGO Group The LEGO Group.

Speech Recognition at ICSI: Broadcast News and beyond

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Statewide Framework Document for:

TRANSFER LEARNING IN MIR: SHARING LEARNED LATENT REPRESENTATIONS FOR MUSIC AUDIO CLASSIFICATION AND SIMILARITY

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen

Word Segmentation of Off-line Handwritten Documents

Learning Methods for Fuzzy Systems

Semantic Inference at the Lexical-Syntactic Level for Textual Entailment Recognition

Annotation Projection for Discourse Connectives

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Hardhatting in a Geo-World

Cultivating DNN Diversity for Large Scale Video Labelling

Compositional Semantics

SEMAFOR: Frame Argument Resolution with Log-Linear Models

A Bayesian Learning Approach to Concept-Based Document Classification

Short Text Understanding Through Lexical-Semantic Analysis

A Case Study: News Classification Based on Term Frequency

Learning Disability Functional Capacity Evaluation. Dear Doctor,

CS Machine Learning

Using dialogue context to improve parsing performance in dialogue systems

Lecture 2: Quantifiers and Approximation

Mathematics subject curriculum

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

What Can Neural Networks Teach us about Language? Graham Neubig a2-dlearn 11/18/2017

Proof Theory for Syntacticians

Beyond the Pipeline: Discrete Optimization in NLP

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Radius STEM Readiness TM

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Spoken Language Parsing Using Phrase-Level Grammars and Trainable Classifiers

Human Emotion Recognition From Speech

arxiv: v2 [cs.cv] 4 Mar 2016

Copyright Corwin 2015

Handling Sparsity for Verb Noun MWE Token Classification

Generating Natural-Language Video Descriptions Using Text-Mined Knowledge

Grade 6: Correlated to AGS Basic Math Skills

Chapter 2 Rule Learning in a Nutshell

5.1 Sound & Light Unit Overview

A Compact DNN: Approaching GoogLeNet-Level Accuracy of Classification and Domain Adaptation

Assignment 1: Predicting Amazon Review Ratings

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

A heuristic framework for pivot-based bilingual dictionary induction

Transcription:

Understanding Visual Scences Dependency Graphs, Word Senses, and Multimodal Embeddings Mirella Lapata School of Informatics University of Edinburgh Ed nburgh University of Edinburgh NLP Natural Language Processing Mirella Lapata Understanding Visual Scenes 1

Joint Work with Representing Visual Structure Carina Silberer Spandana Gella Frank Keller Jasper Uijilings Mirella Lapata Understanding Visual Scenes 2

Structure in Multimodal Processing Lots of recent in work on multimodal processing: image description generation; visual question answering; multimodal machine translation; video summarization. Mirella Lapata Understanding Visual Scenes 3

Structure in Multimodal Processing Lots of recent in work on multimodal processing: image description generation; visual question answering; multimodal machine translation; video summarization. We need to understand the meaning of images and text: Who does what to whom? Mirella Lapata Understanding Visual Scenes 3

Structure in Multimodal Processing Lots of recent in work on multimodal processing: image description generation; visual question answering; multimodal machine translation; video summarization. We need to understand the meaning of images and text: Who does what to whom? Understanding requires structure, not just an unordered set of labels: linguistic structure; image structure. Mirella Lapata Understanding Visual Scenes 3

Structure in Multimodal Processing A man is playing a trumpet in front of a little boy. Mirella Lapata Understanding Visual Scenes 4

Linguistic Structure Representing Visual Structure Output of dependency parser (with PoS labels): http://nlp.stanford.edu:8080/corenlp/process Mirella Lapata Understanding Visual Scenes 5

Linguistic Structure Representing Visual Structure Output of a semantic role labeler (with word senses): http://cogcomp.cs.illinois.edu/page/demo_view/srl Mirella Lapata Understanding Visual Scenes 6

Structure in Multimodal Processing Linguistic structure: discrete base units (words), ordered in 1D; span-based labels (e.g., PoS, phrases); tree-based hierarchies; clear distinction between syntax and semantics; canonical representations defined by linguistic theory. Mirella Lapata Understanding Visual Scenes 7

Structure in Multimodal Processing Linguistic structure: discrete base units (words), ordered in 1D; span-based labels (e.g., PoS, phrases); tree-based hierarchies; clear distinction between syntax and semantics; canonical representations defined by linguistic theory. Now let s compare this to image structure. Mirella Lapata Understanding Visual Scenes 7

Image Structure Representing Visual Structure Output of an image labeler: https://www.clarifai.com/demo We could also label: attributes, scene type, colors, textures, etc. Mirella Lapata Understanding Visual Scenes 8

Image Structure Representing Visual Structure Output of an object recognizer: Output of FastRCNN model with AlexNet architecture trained on PASCAL VOC 2007. Mirella Lapata Understanding Visual Scenes 9

Image Structure Representing Visual Structure Hierarchical segmentation (indicates part-whole relationships): http://www.socher.org/index.php/main/parsingnaturalscenesandnaturallanguagewithrecursiveneuralnetworks Mirella Lapata Understanding Visual Scenes 10

Structure in Multimodal Processing Linguistic structure: discrete base units (words), ordered in 1D; span-based labels (e.g., PoS, phrases); tree-based hierarchies; clear distinction between syntax and semantics; canonical representations defined by linguistic theory. Mirella Lapata Understanding Visual Scenes 11

Structure in Multimodal Processing Linguistic structure: discrete base units (words), ordered in 1D; span-based labels (e.g., PoS, phrases); tree-based hierarchies; clear distinction between syntax and semantics; canonical representations defined by linguistic theory. Image structure: continuous base units (pixels), ordered in 2D; region-based labels (e.g., objects, attributes); part whole structure; no clear distinction between syntax and semantics; no correct canonical representations. Mirella Lapata Understanding Visual Scenes 11

Representational Divergence Representational divergence: for multimodal processing, we need to fuse linguistic and image structures, but they are very different. Mirella Lapata Understanding Visual Scenes 12

Representational Divergence Representational divergence: for multimodal processing, we need to fuse linguistic and image structures, but they are very different. Hypothesis: We need to align visual representations. Two examples in this talk: visual dependency representations; visual sense disambiguation. Mirella Lapata Understanding Visual Scenes 12

1 Representing Visual Structure Visual Dependency Representations Visual Constituency Representations Applications 2 Task Definition Dataset Construction 3 Mirella Lapata Understanding Visual Scenes 13

Visual Dependency Representations Visual Constituency Representations Applications 1 Representing Visual Structure Visual Dependency Representations Visual Constituency Representations Applications 2 Task Definition Dataset Construction 3 Mirella Lapata Understanding Visual Scenes 14

Spatial Relations Representing Visual Structure Visual Dependency Representations Visual Constituency Representations Applications We need a grammar that defines the relations between the objects in an image: Visual Dependency Grammar (Elliott & Keller 2013). It assumes eight relations that can hold between pairs of objects, based on three geometric properties: pixel overlap; angle between objects; distance between objects. Mirella Lapata Understanding Visual Scenes 15

Spatial Relations Representing Visual Structure Visual Dependency Representations Visual Constituency Representations Applications X on Y X surrounds Y X beside Y X opposite Y X above Y X below Y X infront Y X behind Y Mirella Lapata Understanding Visual Scenes 16

Visual Tuples Representing Visual Structure Visual Dependency Representations Visual Constituency Representations Applications An image represented a bag of VDR tuples (Ortiz et al., 2015). person close person person on_beside d_table d_table surrounds cake person near cake person close d_table person above_close cake Mirella Lapata Understanding Visual Scenes 17

Visual Dependency Representations Visual Dependency Representations Visual Constituency Representations Applications An image is represented as a dependency tree (Silberer et al., 2017). root on_beside close surrounds person person d_table cake Mirella Lapata Understanding Visual Scenes 18

Visual Constituency Representations Visual Dependency Representations Visual Constituency Representations Applications An image is represented as a constituency tree (Silberer et al., 2017). NP NP SR NP SR R NP R NP NP SR R NP close on_beside surrounds person person d_table cake Mirella Lapata Understanding Visual Scenes 19

Tree Construction Representing Visual Structure Visual Dependency Representations Visual Constituency Representations Applications tv d 12 person d 24 bottle d 31 175 d 53 221 Build a fully connected graph with all objects as nodes; edge weights correspond to spatial distance; pizza minimum spanning tree (MST): visual dependency representation; use grammar to generate visual constituency representation. table Mirella Lapata Understanding Visual Scenes 20 82

Tree Construction Representing Visual Structure Visual Dependency Representations Visual Constituency Representations Applications root on_beside below_close pizza d_table person Build a fully connected graph with all objects as nodes; edge weights correspond to spatial distance; minimum spanning tree (MST): visual dependency representation; use grammar to generate visual constituency representation. Mirella Lapata Understanding Visual Scenes 20

Tree Construction Representing Visual Structure Visual Dependency Representations Visual Constituency Representations Applications NP NP SR NP SR R NP R NP on_beside below_close pizza d_table person Build a fully connected graph with all objects as nodes; edge weights correspond to spatial distance; minimum spanning tree (MST): visual dependency representation; use grammar to generate visual constituency representation. Mirella Lapata Understanding Visual Scenes 20

Visual Dependency Representations Visual Constituency Representations Applications Image Description Generation via Machine Translation Repurpose existing NLP technology to construct visual representations; use machine translation models: focus on tree-to-string translation; Mirella Lapata Understanding Visual Scenes 21

Visual Dependency Representations Visual Constituency Representations Applications Image Description Generation via Machine Translation Repurpose existing NLP technology to construct visual representations; use machine translation models: focus on tree-to-string translation; trees are task-independent, do not take descriptions into account: Mirella Lapata Understanding Visual Scenes 21

Visual Dependency Representations Visual Constituency Representations Applications Image Description Generation via Machine Translation Repurpose existing NLP technology to construct visual representations; use machine translation models: focus on tree-to-string translation; trees are task-independent, do not take descriptions into account: create parallel corpus of trees with multiple descriptions; Mirella Lapata Understanding Visual Scenes 21

Visual Dependency Representations Visual Constituency Representations Applications Image Description Generation via Machine Translation Repurpose existing NLP technology to construct visual representations; use machine translation models: focus on tree-to-string translation; trees are task-independent, do not take descriptions into account: create parallel corpus of trees with multiple descriptions; translation is loose: not all visual objects are verbalized; multiple descriptions can focus different aspects of a scene: Mirella Lapata Understanding Visual Scenes 21

Visual Dependency Representations Visual Constituency Representations Applications Image Description Generation via Machine Translation Repurpose existing NLP technology to construct visual representations; use machine translation models: focus on tree-to-string translation; trees are task-independent, do not take descriptions into account: create parallel corpus of trees with multiple descriptions; translation is loose: not all visual objects are verbalized; multiple descriptions can focus different aspects of a scene: generation model performs content selection. Mirella Lapata Understanding Visual Scenes 21

Parallel Corpus Creation Representing Visual Structure Visual Dependency Representations Visual Constituency Representations Applications Step 1: Grounding objects to linguistic expressions. person d_table person cake plate cup Little kids sitting around a table that has a birthday cake on it. A group of young children standing around a cake. Mirella Lapata Understanding Visual Scenes 22

Parallel Corpus Creation Representing Visual Structure Visual Dependency Representations Visual Constituency Representations Applications Step 1: Grounding objects to linguistic expressions. person d_table person cake plate cup [Little kids] A1 sitting sit.01 [around a table] A2 that has has.01 [a birthday cake] A2 on it. [A group of young children] A1 standing stand.01 [around a cake] A2. Mirella Lapata Understanding Visual Scenes 23

Parallel Corpus Creation Representing Visual Structure Visual Dependency Representations Visual Constituency Representations Applications Step 1: Grounding objects to linguistic expressions. person d_table person cake plate cup [Little kids] A1 sitting sit.01 [around a table] A2 that has has.01 [a birthday cake] A2 on it. [A group of young children] A1 standing stand.01 [around a cake] A2. Mirella Lapata Understanding Visual Scenes 24

Parallel Corpus Creation Representing Visual Structure Visual Dependency Representations Visual Constituency Representations Applications Step 2: Render scenes as trees and generate corpus. root on_beside close surrounds person person d_table cake Kids sitting around a table. root on_beside close surrounds person person d_table cake A table that has a birthday cake. root on_beside close surrounds person person d_table cake Children standing around a cake. Mirella Lapata Understanding Visual Scenes 25

MT Model: Surface Realization Visual Dependency Representations Visual Constituency Representations Applications We train a translation model on our parallel corpus using the MT framework implemented in Moses (Koehn et al., 2007): t = arg max P(t s) t ( K ) P(t s) = arg max λ k h k (d) d k=1 d D(s, t) are derivations in a synchronous grammar; h k feature functions (language model, translation table, word penalty model); constants λ k scale different models, tuned during training. Mirella Lapata Understanding Visual Scenes 26

MT Model: Content Selection Visual Dependency Representations Visual Constituency Representations Applications At test time we must decide which objects to talk about: predict whether a detected object is relevant for scene; we use logistic regression with l 2 regularization; trained on positive and negative instances; positives: objects aligned to SRL arguments; negatives: unaligned objects; features: object detection score, relative size, relative distance between two objects, object occurrences, spatial features. Mirella Lapata Understanding Visual Scenes 27

Query-by-Example Image Retrieval Visual Dependency Representations Visual Constituency Representations Applications Mirella Lapata Understanding Visual Scenes 28

Query-by-Example Image Retrieval Visual Dependency Representations Visual Constituency Representations Applications Let I denote an image collection; for every image q produce a ranking in order of similarity to q; subtree kernels measure similarity of constituent trees; partial tree kernels measure similarly of dependency trees. NP NP SR R NP on_beside pizza d_table SR R NP on_beside d_table SR R NP on_beside d_table on_beside below_close on_beside on_beside pizza d_table personpizza d_table person pizza d_table Mirella Lapata Understanding Visual Scenes 29

Results: Image Description Generation CIDEr (%) 60 50 40 30 20 43.8 44.1 47.9 52 Visual Dependency Representations Visual Constituency Representations Applications 54.3 58.8 Template Bag-of-Objects Tuples Constituency Dependency NeuralTalk 10 0 COCO 2015 Test Set Mirella Lapata Understanding Visual Scenes 30

Results: Image Retrieval Representing Visual Structure Visual Dependency Representations Visual Constituency Representations Applications Macro-averaged precision 40 30 20 10 8.6 11.7 Bag-of-Objects Tuples Constituency Dependency NeuralTalk 19.9 15.2 10.7 10.5 13.4 14.2 15.7 29.6 13.3 13.6 11.9 13.9 42.3 0 P@1 P@5 P@10 Mirella Lapata Understanding Visual Scenes 31

Example Output Representing Visual Structure Visual Dependency Representations Visual Constituency Representations Applications Template Tuples Dependency Constituency Human 5) a couch has a couch 4) the room has a couch 1) a dog sitting on a couch 2) dog laying on a couch 3) a dog is looking at something 2) an airplane is near a car 5) a airplane sitting on a street 3) a airplane parked next to a car 4) a airplane parked next to a car 1) a large plane with a red tail Mirella Lapata Understanding Visual Scenes 32

Task Definition Dataset Construction 1 Representing Visual Structure Visual Dependency Representations Visual Constituency Representations Applications 2 Task Definition Dataset Construction 3 Mirella Lapata Understanding Visual Scenes 33

Aligning Actions and Verbs Task Definition Dataset Construction So far, we have looked at syntactic structure only: how do the objects in an image relate to each other. To really understand the content of an image, we need semantics: represent the event depicted, its participants, and the roles they play. We can achieve this using verb senses: well established in linguistics (e.g., WordNet); more general that the action labels used in computer vision; can be aligned with both sentences and images. Mirella Lapata Understanding Visual Scenes 34

Word Sense Disambiguation Task Definition Dataset Construction Word sense disambiguation is a standard NLP task: (1) A man is playing a guitar. (2) The children are playing across the street. (3) Two men playing doubles tennis on a grass court. Mirella Lapata Understanding Visual Scenes 35

Word Sense Disambiguation Task Definition Dataset Construction Word sense disambiguation is a standard NLP task: (1) A man is playing a guitar. play:1 perform music on musical instrument (2) The children are playing across the street. (3) Two men playing doubles tennis on a grass court. Mirella Lapata Understanding Visual Scenes 35

Word Sense Disambiguation Task Definition Dataset Construction Word sense disambiguation is a standard NLP task: (1) A man is playing a guitar. play:1 perform music on musical instrument (2) The children are playing across the street. play:2 engage in a fun or recreational (childlike) activity (3) Two men playing doubles tennis on a grass court. Mirella Lapata Understanding Visual Scenes 35

Word Sense Disambiguation Task Definition Dataset Construction Word sense disambiguation is a standard NLP task: (1) A man is playing a guitar. play:1 perform music on musical instrument (2) The children are playing across the street. play:2 engage in a fun or recreational (childlike) activity (3) Two men playing doubles tennis on a grass court. play:3 engage in or make moves related to competition or sport Mirella Lapata Understanding Visual Scenes 35

Task Definition Dataset Construction We can apply this task to an image/verb pair: play Mirella Lapata Understanding Visual Scenes 36

Task Definition Dataset Construction We can apply this task to an image/verb pair: play play:1 perform music on musical instrument New task: visual sense disambiguation (VSD, Gella et al. 2016). Mirella Lapata Understanding Visual Scenes 36

Existing Action Recognition Datasets Task Definition Dataset Construction Dataset Actions PPMI (Yao & Fei-Fei 2010) 24 Stanford 40 (Yao et al. 2011) 40 PASCAL 2012 (Everingham et al. 2015) 11 TUHOI (Le et al. 2014) 2974 Mirella Lapata Understanding Visual Scenes 37

Existing Action Recognition Datasets Task Definition Dataset Construction Dataset Verbs Actions Sense PPMI (Yao & Fei-Fei 2010) 2 24 N Stanford 40 (Yao et al. 2011) 33 40 N PASCAL 2012 (Everingham et al. 2015) 9 11 N TUHOI (Le et al. 2014) 2974 N Actions: verb phrases or verb-object pairs; verb senses are more general than actions; no existing datasets with verb sense annotation. Mirella Lapata Understanding Visual Scenes 37

Task Definition Dataset Construction Dataset for Visual Verb Sense Disambiguation Design a new dataset using images from: MSCOCO: 123k images with object labels, image descriptions: not designed for action recognition; use verbs in descriptions as labels. TUHOI: 10,805 images with object labels: labeled with actions (verb-object pairs); use verbs as labels. Mirella Lapata Understanding Visual Scenes 38

Task Definition Dataset Construction Dataset for Visual Verb Sense Disambiguation We use the OntoNotes inventory of verb senses (less fine-grained than WordNet). But: not all verb senses are visual. Visual: Mirella Lapata Understanding Visual Scenes 39

Task Definition Dataset Construction Dataset for Visual Verb Sense Disambiguation We use the OntoNotes inventory of verb senses (less fine-grained than WordNet). But: not all verb senses are visual. Visual: Non-Visual: Mirella Lapata Understanding Visual Scenes 39

Task Definition Dataset Construction Dataset for Visual Verb Sense Disambiguation We use the OntoNotes inventory of verb senses (less fine-grained than WordNet). But: not all verb senses are visual. Visual: Non-Visual: Solution: annotate only the visual senses: annotators decide which senses are visual (about 50% in MSCOCO); new annotators select correct visual sense for each image. Mirella Lapata Understanding Visual Scenes 39

Task Definition Dataset Construction Annotating Image and Verb with Visual Sense Mirella Lapata Understanding Visual Scenes 40

Task Definition Dataset Construction Annotating Image and Verb with Visual Sense Mirella Lapata Understanding Visual Scenes 40

VerSe Dataset Representing Visual Structure Task Definition Dataset Construction Comparison of VerSe with existing action recognition datasets: Dataset Verbs Actions Sense PPMI (Yao & Fei-Fei 2010) 2 24 N Stanford 40 (Yao et al. 2011) 33 40 N PASCAL 2012 (Everingham et al. 2015) 9 11 N TUHOI (Le et al. 2014) 2974 N VerSe (our dataset) 90 Y (163) Mirella Lapata Understanding Visual Scenes 41

VerSe Dataset Representing Visual Structure Task Definition Dataset Construction VerSe dataset divided into motion and non-motion verbs: Verb type Verbs Images Senses Examples Motion 39 1812 5.79 run, walk, jump, swing, hit, kick Non-motion 51 1698 4.86 sleep, sit, lean, read, write, look Mirella Lapata Understanding Visual Scenes 42

O: person, guitar, microphone C: A man playing guitar. Task Definition Dataset Construction Image Representations objects captions CNN-fc7 play Sense Inventory: D s 1 s 2 s 3 engage in competition or sport perform or transmit music engage in a playful activity Scoring Function Φ s 2 Sense Representations Mirella Lapata Understanding Visual Scenes 43

O: person, guitar, microphone C: A man playing guitar. Task Definition Dataset Construction Image Representations objects captions CNN-fc7 play Sense Inventory: D s 1 s 2 s 3 engage in competition or sport perform or transmit music engage in a playful activity Scoring Function Φ s 2 Sense Representations Mirella Lapata Understanding Visual Scenes 43

Task Definition Dataset Construction O: person, guitar, microphone VGG - CNN CNN-fc7 objects word2vec Object labels obtained using VGG (Simonyan & Zisserman 2014). Mirella Lapata Understanding Visual Scenes 43

O: person, guitar, microphone C: A man playing guitar. Task Definition Dataset Construction Image Representations objects captions CNN-fc7 play Sense Inventory: D s 1 s 2 s 3 engage in competition or sport perform or transmit music engage in a playful activity Scoring Function Φ s 2 Sense Representations Mirella Lapata Understanding Visual Scenes 43

Task Definition Dataset Construction LSTM VGG - CNN C: A man playing a guitar word2vec Image descriptions from Show and Tell (Vinyals et al. 2015). captions Mirella Lapata Understanding Visual Scenes 43

Task Definition Dataset Construction O: person, guitar, microphone C: A man playing guitar. Image Representations objects captions CNN-fc7 play Sense Inventory: D s 1 s 2 s 3 engage in competition or sport perform or transmit music engage in a playful activity Scoring Function Φ s 2 Sense Representations Mirella Lapata Understanding Visual Scenes 43

Task Definition Dataset Construction O: person, guitar, microphone C: A man playing guitar. Image Representations objects captions CNN-fc7 play Sense Inventory: D s 1 s 2 s 3 engage in competition or sport perform or transmit music engage in a playful activity Scoring Function Φ s 2 Sense Representations Mirella Lapata Understanding Visual Scenes 43

#1 #3 Representing Visual Structure Visual Representation for Senses Task Definition Dataset Construction perform or transmit music play #2 engage in competition or sport..... Mirella Lapata Understanding Visual Scenes 44

#1 Representing Visual Structure Visual Representation for Senses Task Definition Dataset Construction perform or transmit music q 11 q 12 q 13 playing guitar playing music playing in a band..... q 21 playing tennis play #2 engage in competition or sport q23 q 22 playing sport #3 playing game.......... Mirella Lapata Understanding Visual Scenes 44

#1 Representing Visual Structure Visual Representation for Senses Task Definition Dataset Construction perform or transmit music q 11 q 12 q 13 playing guitar playing music playing in a band..... q 21 playing tennis play #2 engage in competition or sport q23 q 22 playing sport #3 playing game.......... Mirella Lapata Understanding Visual Scenes 44

Visual Representation for Senses Task Definition Dataset Construction play #1 #2 perform or transmit music engage in competition or sport q 11 q 12 q 13 q 21 q23 q 22 playing guitar playing music playing in a band..... playing tennis playing sport CNN - fc7 CNN - fc7 CNN - fc7 CNN - fc7 Mean Pooling Mean Pooling play #1 play #2 #3 playing game.......... Mirella Lapata Understanding Visual Scenes 44

Task Definition Dataset Construction O: person, guitar, microphone C: A man playing guitar. Image Representations objects captions CNN-fc7 play Sense Inventory: D s 1 s 2 s 3 engage in competition or sport perform or transmit music engage in a playful activity Scoring Function Φ s 2 Sense Representations Mirella Lapata Understanding Visual Scenes 45

Scoring Function Representing Visual Structure Task Definition Dataset Construction Use vector similarity (cosine) as scoring function: ŝ = arg max Φ(s, i, v, D) s S(v) Mirella Lapata Understanding Visual Scenes 46

Scoring Function Representing Visual Structure Task Definition Dataset Construction Use vector similarity (cosine) as scoring function: Representations: textual: O, C embeddings; ŝ = arg max Φ(s, i, v, D) s S(v) Mirella Lapata Understanding Visual Scenes 46

Scoring Function Representing Visual Structure Task Definition Dataset Construction Use vector similarity (cosine) as scoring function: Representations: textual: O, C embeddings; visual: CNN features; ŝ = arg max Φ(s, i, v, D) s S(v) Mirella Lapata Understanding Visual Scenes 46

Scoring Function Representing Visual Structure Task Definition Dataset Construction Use vector similarity (cosine) as scoring function: Representations: textual: O, C embeddings; visual: CNN features; ŝ = arg max Φ(s, i, v, D) s S(v) multi-modal: fused textual and visual features using Canonical Correlation Analysis. Mirella Lapata Understanding Visual Scenes 46

Results Representing Visual Structure Task Definition Dataset Construction 85 Motion Non-Motion 80.6 Accuracy Scores 70 70.8 55 First-sense Mirella Lapata Understanding Visual Scenes 47

Results Representing Visual Structure Task Definition Dataset Construction 85 Motion Non-Motion 80.6 Accuracy Scores 70 70.8 65.1 64.3 55 First-sense Text Mirella Lapata Understanding Visual Scenes 47

Results Representing Visual Structure Task Definition Dataset Construction 85 Motion Non-Motion 80.6 Accuracy Scores 70 70.8 65.1 64.3 55 58.3 56.1 First-sense Text Visual Mirella Lapata Understanding Visual Scenes 47

Results Representing Visual Structure Task Definition Dataset Construction 85 Motion Non-Motion 80.6 Accuracy Scores 70 70.8 65.1 64.3 72.6 66.3 55 58.3 56.1 First-sense Text Visual Multi-modal Mirella Lapata Understanding Visual Scenes 47

Task Definition Dataset Construction Results: Gold Standard Image Descriptions 85 80.6 Motion Non-Motion 75.6 75.4 Accuracy Scores 70 70.8 72.7 72.2 55 58.3 56.1 First-sense Text Visual Multi-modal Mirella Lapata Understanding Visual Scenes 48

Verb Prediction Representing Visual Structure Task Definition Dataset Construction ConvNet Classifier Output fc7 (2048,12,12) Linear Sigmoid for each v MIL-Noisy OR play swing throw detect the verbs that are present in an image (250 classes); use multiple instance learning (we do not know which bounding boxes correspond to which verbs). Mirella Lapata Understanding Visual Scenes 49

Examples: Verb Prediction Task Definition Dataset Construction play, perform hit, swing, play hold, sit, use Mirella Lapata Understanding Visual Scenes 50

Task Definition Dataset Construction Verb Prediction and Sense Disambiguation Mirella Lapata Understanding Visual Scenes 51

1 Representing Visual Structure Visual Dependency Representations Visual Constituency Representations Applications 2 Task Definition Dataset Construction 3 Mirella Lapata Understanding Visual Scenes 52

Representing Visual Structure Image understanding (like text understanding) requires structured representations; for multimodal tasks, we need to align linguistic and image structure; syntactic example: visual dependency representations align geometric structure of an image with syntactic structure of a sentence; application in image description and image retrieval; semantic example: visual word senses align event depicted in an image with event described in a sentence; unsupervised VSD model using multimodal embeddings. Mirella Lapata Understanding Visual Scenes 53

Other Approaches to Image Structure Other approaches that align linguistic structure and image structure: Scene (description) graphs (Johnson et al. 2015; Aditya et al. 2015): triples of object, attribute, relation; aligned with image regions and region descriptions; no explicit alignment with linguistic structure (but could be derived). Visual semantic roles (Yatskar et al. 2016): uses semantic frames from FrameNet; annotates images with frames, participants, and roles; not aligned with regions or image descriptions; no verb senses. Mirella Lapata Understanding Visual Scenes 54

Scene Graphs Representing Visual Structure http://cs.stanford.edu/people/jcjohns/cvpr15_supp/ Mirella Lapata Understanding Visual Scenes 55

Scene Graphs Representing Visual Structure Mirella Lapata Understanding Visual Scenes 55

Visual Semantic Roles Representing Visual Structure http://imsitu.org/demo/ Mirella Lapata Understanding Visual Scenes 56

References I Representing Visual Structure Aditya, S., Yang, Y., Baral, C., Fermuller, C., & Aloimonos, Y. (2015). From images to sentences through scene description graphs using commonsense reasoning and knowledge. arxiv preprint arxiv:1511.03292. Elliott, D., & Keller, F. (2013). Image description using visual dependency representations. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, (pp. 1292 1302), Seattle, WA. Everingham, M., Eslami, S. M. A., Gool, L. V., Williams, C. K. I., Winn, J. M., & Zisserman, A. (2015). The Pascal visual object classes challenge: A retrospective. International Journal of Computer Vision, 111, 98 136. Gella, S., Lapata, M., & Keller, F. (2016). Unsupervised visual sense disambiguation for verbs using multimodal embedding. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, (pp. 182 192), San Diego, CA. Johnson, J., Krishna, R., Stark, M., Li, L.-J., Shamma, D. A., Bernstein, M., & Fei-Fei, L. (2015). Image retrieval using scene graphs. In Proceedings of the Conference on Computer Vision and Pattern Recognition, (pp. 3668 3678), Boston, MA. Le, D.-T., Uijlings, J., & Bernardi, R. (2014). Proceedings of the Third Workshop on Vision and Language, chap. TUHOI: Trento Universal Human Object Interaction Dataset, (pp. 17 24). Dublin City University and the Association for Computational Linguistics. Mirella Lapata Understanding Visual Scenes 57

References II Representing Visual Structure Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556. Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015). Show and tell: A neural image caption generator. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, (pp. 3156 3164). Yao, B., & Fei-Fei, L. (2010). Grouplet: A structured image representation for recognizing human and object interactions. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, (pp. 9 16). IEEE. Yao, B., Jiang, X., Khosla, A., Lin, A. L., Guibas, L., & Fei-Fei, L. (2011). Human action recognition by learning bases of action attributes and parts. In Computer Vision (ICCV), 2011 IEEE International Conference on, (pp. 1331 1338). IEEE. Yatskar, M., Zettlemoyer, L., & Farhadi, A. (2016). Situation recognition: Visual semantic role labeling for image understanding. In Computer Vision and Pattern Recognition. Mirella Lapata Understanding Visual Scenes 58