Ed nburgh University of Edinburgh NLP. Understanding Visual Scences. Dependency Graphs, Word Senses, and Multimodal Embeddings

Size: px

Start display at page:

Download "Ed nburgh University of Edinburgh NLP. Understanding Visual Scences. Dependency Graphs, Word Senses, and Multimodal Embeddings"

Rudolph Barry Chambers
6 years ago
Views:

1 Understanding Visual Scences Dependency Graphs, Word Senses, and Multimodal Embeddings Mirella Lapata School of Informatics University of Edinburgh Ed nburgh University of Edinburgh NLP Natural Language Processing Mirella Lapata Understanding Visual Scenes 1

2 Joint Work with Representing Visual Structure Carina Silberer Spandana Gella Frank Keller Jasper Uijilings Mirella Lapata Understanding Visual Scenes 2

3 Structure in Multimodal Processing Lots of recent in work on multimodal processing: image description generation; visual question answering; multimodal machine translation; video summarization. Mirella Lapata Understanding Visual Scenes 3

4 Structure in Multimodal Processing Lots of recent in work on multimodal processing: image description generation; visual question answering; multimodal machine translation; video summarization. We need to understand the meaning of images and text: Who does what to whom? Mirella Lapata Understanding Visual Scenes 3

5 Structure in Multimodal Processing Lots of recent in work on multimodal processing: image description generation; visual question answering; multimodal machine translation; video summarization. We need to understand the meaning of images and text: Who does what to whom? Understanding requires structure, not just an unordered set of labels: linguistic structure; image structure. Mirella Lapata Understanding Visual Scenes 3

6 Structure in Multimodal Processing A man is playing a trumpet in front of a little boy. Mirella Lapata Understanding Visual Scenes 4

7 Linguistic Structure Representing Visual Structure Output of dependency parser (with PoS labels): Mirella Lapata Understanding Visual Scenes 5

8 Linguistic Structure Representing Visual Structure Output of a semantic role labeler (with word senses): Mirella Lapata Understanding Visual Scenes 6

9 Structure in Multimodal Processing Linguistic structure: discrete base units (words), ordered in 1D; span-based labels (e.g., PoS, phrases); tree-based hierarchies; clear distinction between syntax and semantics; canonical representations defined by linguistic theory. Mirella Lapata Understanding Visual Scenes 7

10 Structure in Multimodal Processing Linguistic structure: discrete base units (words), ordered in 1D; span-based labels (e.g., PoS, phrases); tree-based hierarchies; clear distinction between syntax and semantics; canonical representations defined by linguistic theory. Now let s compare this to image structure. Mirella Lapata Understanding Visual Scenes 7

11 Image Structure Representing Visual Structure Output of an image labeler: We could also label: attributes, scene type, colors, textures, etc. Mirella Lapata Understanding Visual Scenes 8

12 Image Structure Representing Visual Structure Output of an object recognizer: Output of FastRCNN model with AlexNet architecture trained on PASCAL VOC Mirella Lapata Understanding Visual Scenes 9

Image Structure Representing Visual Structure Hierarchical segmentation (indicates part-whole relationships): http://www.socher.

13 Image Structure Representing Visual Structure Hierarchical segmentation (indicates part-whole relationships): Mirella Lapata Understanding Visual Scenes 10

14 Structure in Multimodal Processing Linguistic structure: discrete base units (words), ordered in 1D; span-based labels (e.g., PoS, phrases); tree-based hierarchies; clear distinction between syntax and semantics; canonical representations defined by linguistic theory. Mirella Lapata Understanding Visual Scenes 11

15 Structure in Multimodal Processing Linguistic structure: discrete base units (words), ordered in 1D; span-based labels (e.g., PoS, phrases); tree-based hierarchies; clear distinction between syntax and semantics; canonical representations defined by linguistic theory. Image structure: continuous base units (pixels), ordered in 2D; region-based labels (e.g., objects, attributes); part whole structure; no clear distinction between syntax and semantics; no correct canonical representations. Mirella Lapata Understanding Visual Scenes 11

16 Representational Divergence Representational divergence: for multimodal processing, we need to fuse linguistic and image structures, but they are very different. Mirella Lapata Understanding Visual Scenes 12

17 Representational Divergence Representational divergence: for multimodal processing, we need to fuse linguistic and image structures, but they are very different. Hypothesis: We need to align visual representations. Two examples in this talk: visual dependency representations; visual sense disambiguation. Mirella Lapata Understanding Visual Scenes 12

18 1 Representing Visual Structure Visual Dependency Representations Visual Constituency Representations Applications 2 Task Definition Dataset Construction 3 Mirella Lapata Understanding Visual Scenes 13

19 Visual Dependency Representations Visual Constituency Representations Applications 1 Representing Visual Structure Visual Dependency Representations Visual Constituency Representations Applications 2 Task Definition Dataset Construction 3 Mirella Lapata Understanding Visual Scenes 14

20 Spatial Relations Representing Visual Structure Visual Dependency Representations Visual Constituency Representations Applications We need a grammar that defines the relations between the objects in an image: Visual Dependency Grammar (Elliott & Keller 2013). It assumes eight relations that can hold between pairs of objects, based on three geometric properties: pixel overlap; angle between objects; distance between objects. Mirella Lapata Understanding Visual Scenes 15

21 Spatial Relations Representing Visual Structure Visual Dependency Representations Visual Constituency Representations Applications X on Y X surrounds Y X beside Y X opposite Y X above Y X below Y X infront Y X behind Y Mirella Lapata Understanding Visual Scenes 16

22 Visual Tuples Representing Visual Structure Visual Dependency Representations Visual Constituency Representations Applications An image represented a bag of VDR tuples (Ortiz et al., 2015). person close person person on_beside d_table d_table surrounds cake person near cake person close d_table person above_close cake Mirella Lapata Understanding Visual Scenes 17

23 Visual Dependency Representations Visual Dependency Representations Visual Constituency Representations Applications An image is represented as a dependency tree (Silberer et al., 2017). root on_beside close surrounds person person d_table cake Mirella Lapata Understanding Visual Scenes 18

24 Visual Constituency Representations Visual Dependency Representations Visual Constituency Representations Applications An image is represented as a constituency tree (Silberer et al., 2017). NP NP SR NP SR R NP R NP NP SR R NP close on_beside surrounds person person d_table cake Mirella Lapata Understanding Visual Scenes 19

25 Tree Construction Representing Visual Structure Visual Dependency Representations Visual Constituency Representations Applications tv d 12 person d 24 bottle d d Build a fully connected graph with all objects as nodes; edge weights correspond to spatial distance; pizza minimum spanning tree (MST): visual dependency representation; use grammar to generate visual constituency representation. table Mirella Lapata Understanding Visual Scenes 20 82

26 Tree Construction Representing Visual Structure Visual Dependency Representations Visual Constituency Representations Applications root on_beside below_close pizza d_table person Build a fully connected graph with all objects as nodes; edge weights correspond to spatial distance; minimum spanning tree (MST): visual dependency representation; use grammar to generate visual constituency representation. Mirella Lapata Understanding Visual Scenes 20

Tree Construction Representing Visual Structure Visual Dependency Representations Visual Constituency Representations Applications NP NP SR NP SR R NP R NP on_beside below_close pizza d_table person

27 Tree Construction Representing Visual Structure Visual Dependency Representations Visual Constituency Representations Applications NP NP SR NP SR R NP R NP on_beside below_close pizza d_table person Build a fully connected graph with all objects as nodes; edge weights correspond to spatial distance; minimum spanning tree (MST): visual dependency representation; use grammar to generate visual constituency representation. Mirella Lapata Understanding Visual Scenes 20

28 Visual Dependency Representations Visual Constituency Representations Applications Image Description Generation via Machine Translation Repurpose existing NLP technology to construct visual representations; use machine translation models: focus on tree-to-string translation; Mirella Lapata Understanding Visual Scenes 21

29 Visual Dependency Representations Visual Constituency Representations Applications Image Description Generation via Machine Translation Repurpose existing NLP technology to construct visual representations; use machine translation models: focus on tree-to-string translation; trees are task-independent, do not take descriptions into account: Mirella Lapata Understanding Visual Scenes 21

30 Visual Dependency Representations Visual Constituency Representations Applications Image Description Generation via Machine Translation Repurpose existing NLP technology to construct visual representations; use machine translation models: focus on tree-to-string translation; trees are task-independent, do not take descriptions into account: create parallel corpus of trees with multiple descriptions; Mirella Lapata Understanding Visual Scenes 21

31 Visual Dependency Representations Visual Constituency Representations Applications Image Description Generation via Machine Translation Repurpose existing NLP technology to construct visual representations; use machine translation models: focus on tree-to-string translation; trees are task-independent, do not take descriptions into account: create parallel corpus of trees with multiple descriptions; translation is loose: not all visual objects are verbalized; multiple descriptions can focus different aspects of a scene: Mirella Lapata Understanding Visual Scenes 21

32 Visual Dependency Representations Visual Constituency Representations Applications Image Description Generation via Machine Translation Repurpose existing NLP technology to construct visual representations; use machine translation models: focus on tree-to-string translation; trees are task-independent, do not take descriptions into account: create parallel corpus of trees with multiple descriptions; translation is loose: not all visual objects are verbalized; multiple descriptions can focus different aspects of a scene: generation model performs content selection. Mirella Lapata Understanding Visual Scenes 21

33 Parallel Corpus Creation Representing Visual Structure Visual Dependency Representations Visual Constituency Representations Applications Step 1: Grounding objects to linguistic expressions. person d_table person cake plate cup Little kids sitting around a table that has a birthday cake on it. A group of young children standing around a cake. Mirella Lapata Understanding Visual Scenes 22

34 Parallel Corpus Creation Representing Visual Structure Visual Dependency Representations Visual Constituency Representations Applications Step 1: Grounding objects to linguistic expressions. person d_table person cake plate cup [Little kids] A1 sitting sit.01 [around a table] A2 that has has.01 [a birthday cake] A2 on it. [A group of young children] A1 standing stand.01 [around a cake] A2. Mirella Lapata Understanding Visual Scenes 23

35 Parallel Corpus Creation Representing Visual Structure Visual Dependency Representations Visual Constituency Representations Applications Step 1: Grounding objects to linguistic expressions. person d_table person cake plate cup [Little kids] A1 sitting sit.01 [around a table] A2 that has has.01 [a birthday cake] A2 on it. [A group of young children] A1 standing stand.01 [around a cake] A2. Mirella Lapata Understanding Visual Scenes 24

36 Parallel Corpus Creation Representing Visual Structure Visual Dependency Representations Visual Constituency Representations Applications Step 2: Render scenes as trees and generate corpus. root on_beside close surrounds person person d_table cake Kids sitting around a table. root on_beside close surrounds person person d_table cake A table that has a birthday cake. root on_beside close surrounds person person d_table cake Children standing around a cake. Mirella Lapata Understanding Visual Scenes 25

37 MT Model: Surface Realization Visual Dependency Representations Visual Constituency Representations Applications We train a translation model on our parallel corpus using the MT framework implemented in Moses (Koehn et al., 2007): t = arg max P(t s) t ( K ) P(t s) = arg max λ k h k (d) d k=1 d D(s, t) are derivations in a synchronous grammar; h k feature functions (language model, translation table, word penalty model); constants λ k scale different models, tuned during training. Mirella Lapata Understanding Visual Scenes 26

38 MT Model: Content Selection Visual Dependency Representations Visual Constituency Representations Applications At test time we must decide which objects to talk about: predict whether a detected object is relevant for scene; we use logistic regression with l 2 regularization; trained on positive and negative instances; positives: objects aligned to SRL arguments; negatives: unaligned objects; features: object detection score, relative size, relative distance between two objects, object occurrences, spatial features. Mirella Lapata Understanding Visual Scenes 27

39 Query-by-Example Image Retrieval Visual Dependency Representations Visual Constituency Representations Applications Mirella Lapata Understanding Visual Scenes 28

40 Query-by-Example Image Retrieval Visual Dependency Representations Visual Constituency Representations Applications Let I denote an image collection; for every image q produce a ranking in order of similarity to q; subtree kernels measure similarity of constituent trees; partial tree kernels measure similarly of dependency trees. NP NP SR R NP on_beside pizza d_table SR R NP on_beside d_table SR R NP on_beside d_table on_beside below_close on_beside on_beside pizza d_table personpizza d_table person pizza d_table Mirella Lapata Understanding Visual Scenes 29

41 Results: Image Description Generation CIDEr (%) Visual Dependency Representations Visual Constituency Representations Applications Template Bag-of-Objects Tuples Constituency Dependency NeuralTalk 10 0 COCO 2015 Test Set Mirella Lapata Understanding Visual Scenes 30

42 Results: Image Retrieval Representing Visual Structure Visual Dependency Representations Visual Constituency Representations Applications Macro-averaged precision Bag-of-Objects Tuples Constituency Dependency NeuralTalk P@1 P@5 P@10 Mirella Lapata Understanding Visual Scenes 31

laying on a couch 3) a dog is looking at something 2) an airplane is near a car 5) a airplane sitting on a street 3) a airplane

43 Example Output Representing Visual Structure Visual Dependency Representations Visual Constituency Representations Applications Template Tuples Dependency Constituency Human 5) a couch has a couch 4) the room has a couch 1) a dog sitting on a couch 2) dog laying on a couch 3) a dog is looking at something 2) an airplane is near a car 5) a airplane sitting on a street 3) a airplane parked next to a car 4) a airplane parked next to a car 1) a large plane with a red tail Mirella Lapata Understanding Visual Scenes 32

44 Task Definition Dataset Construction 1 Representing Visual Structure Visual Dependency Representations Visual Constituency Representations Applications 2 Task Definition Dataset Construction 3 Mirella Lapata Understanding Visual Scenes 33

45 Aligning Actions and Verbs Task Definition Dataset Construction So far, we have looked at syntactic structure only: how do the objects in an image relate to each other. To really understand the content of an image, we need semantics: represent the event depicted, its participants, and the roles they play. We can achieve this using verb senses: well established in linguistics (e.g., WordNet); more general that the action labels used in computer vision; can be aligned with both sentences and images. Mirella Lapata Understanding Visual Scenes 34

46 Word Sense Disambiguation Task Definition Dataset Construction Word sense disambiguation is a standard NLP task: (1) A man is playing a guitar. (2) The children are playing across the street. (3) Two men playing doubles tennis on a grass court. Mirella Lapata Understanding Visual Scenes 35

47 Word Sense Disambiguation Task Definition Dataset Construction Word sense disambiguation is a standard NLP task: (1) A man is playing a guitar. play:1 perform music on musical instrument (2) The children are playing across the street. (3) Two men playing doubles tennis on a grass court. Mirella Lapata Understanding Visual Scenes 35

48 Word Sense Disambiguation Task Definition Dataset Construction Word sense disambiguation is a standard NLP task: (1) A man is playing a guitar. play:1 perform music on musical instrument (2) The children are playing across the street. play:2 engage in a fun or recreational (childlike) activity (3) Two men playing doubles tennis on a grass court. Mirella Lapata Understanding Visual Scenes 35

49 Word Sense Disambiguation Task Definition Dataset Construction Word sense disambiguation is a standard NLP task: (1) A man is playing a guitar. play:1 perform music on musical instrument (2) The children are playing across the street. play:2 engage in a fun or recreational (childlike) activity (3) Two men playing doubles tennis on a grass court. play:3 engage in or make moves related to competition or sport Mirella Lapata Understanding Visual Scenes 35

50 Task Definition Dataset Construction We can apply this task to an image/verb pair: play Mirella Lapata Understanding Visual Scenes 36

51 Task Definition Dataset Construction We can apply this task to an image/verb pair: play play:1 perform music on musical instrument New task: visual sense disambiguation (VSD, Gella et al. 2016). Mirella Lapata Understanding Visual Scenes 36

52 Existing Action Recognition Datasets Task Definition Dataset Construction Dataset Actions PPMI (Yao & Fei-Fei 2010) 24 Stanford 40 (Yao et al. 2011) 40 PASCAL 2012 (Everingham et al. 2015) 11 TUHOI (Le et al. 2014) 2974 Mirella Lapata Understanding Visual Scenes 37

53 Existing Action Recognition Datasets Task Definition Dataset Construction Dataset Verbs Actions Sense PPMI (Yao & Fei-Fei 2010) 2 24 N Stanford 40 (Yao et al. 2011) N PASCAL 2012 (Everingham et al. 2015) 9 11 N TUHOI (Le et al. 2014) 2974 N Actions: verb phrases or verb-object pairs; verb senses are more general than actions; no existing datasets with verb sense annotation. Mirella Lapata Understanding Visual Scenes 37

54 Task Definition Dataset Construction Dataset for Visual Verb Sense Disambiguation Design a new dataset using images from: MSCOCO: 123k images with object labels, image descriptions: not designed for action recognition; use verbs in descriptions as labels. TUHOI: 10,805 images with object labels: labeled with actions (verb-object pairs); use verbs as labels. Mirella Lapata Understanding Visual Scenes 38

55 Task Definition Dataset Construction Dataset for Visual Verb Sense Disambiguation We use the OntoNotes inventory of verb senses (less fine-grained than WordNet). But: not all verb senses are visual. Visual: Mirella Lapata Understanding Visual Scenes 39

56 Task Definition Dataset Construction Dataset for Visual Verb Sense Disambiguation We use the OntoNotes inventory of verb senses (less fine-grained than WordNet). But: not all verb senses are visual. Visual: Non-Visual: Mirella Lapata Understanding Visual Scenes 39

Visual: Non-Visual: Solution: annotate only the visual senses: annotators decide which senses are visual

57 Task Definition Dataset Construction Dataset for Visual Verb Sense Disambiguation We use the OntoNotes inventory of verb senses (less fine-grained than WordNet). But: not all verb senses are visual. Visual: Non-Visual: Solution: annotate only the visual senses: annotators decide which senses are visual (about 50% in MSCOCO); new annotators select correct visual sense for each image. Mirella Lapata Understanding Visual Scenes 39

58 Task Definition Dataset Construction Annotating Image and Verb with Visual Sense Mirella Lapata Understanding Visual Scenes 40

59 Task Definition Dataset Construction Annotating Image and Verb with Visual Sense Mirella Lapata Understanding Visual Scenes 40

60 VerSe Dataset Representing Visual Structure Task Definition Dataset Construction Comparison of VerSe with existing action recognition datasets: Dataset Verbs Actions Sense PPMI (Yao & Fei-Fei 2010) 2 24 N Stanford 40 (Yao et al. 2011) N PASCAL 2012 (Everingham et al. 2015) 9 11 N TUHOI (Le et al. 2014) 2974 N VerSe (our dataset) 90 Y (163) Mirella Lapata Understanding Visual Scenes 41

61 VerSe Dataset Representing Visual Structure Task Definition Dataset Construction VerSe dataset divided into motion and non-motion verbs: Verb type Verbs Images Senses Examples Motion run, walk, jump, swing, hit, kick Non-motion sleep, sit, lean, read, write, look Mirella Lapata Understanding Visual Scenes 42

62 O: person, guitar, microphone C: A man playing guitar. Task Definition Dataset Construction Image Representations objects captions CNN-fc7 play Sense Inventory: D s 1 s 2 s 3 engage in competition or sport perform or transmit music engage in a playful activity Scoring Function Φ s 2 Sense Representations Mirella Lapata Understanding Visual Scenes 43

63 O: person, guitar, microphone C: A man playing guitar. Task Definition Dataset Construction Image Representations objects captions CNN-fc7 play Sense Inventory: D s 1 s 2 s 3 engage in competition or sport perform or transmit music engage in a playful activity Scoring Function Φ s 2 Sense Representations Mirella Lapata Understanding Visual Scenes 43

64 Task Definition Dataset Construction O: person, guitar, microphone VGG - CNN CNN-fc7 objects word2vec Object labels obtained using VGG (Simonyan & Zisserman 2014). Mirella Lapata Understanding Visual Scenes 43

65 O: person, guitar, microphone C: A man playing guitar. Task Definition Dataset Construction Image Representations objects captions CNN-fc7 play Sense Inventory: D s 1 s 2 s 3 engage in competition or sport perform or transmit music engage in a playful activity Scoring Function Φ s 2 Sense Representations Mirella Lapata Understanding Visual Scenes 43

66 Task Definition Dataset Construction LSTM VGG - CNN C: A man playing a guitar word2vec Image descriptions from Show and Tell (Vinyals et al. 2015). captions Mirella Lapata Understanding Visual Scenes 43

67 Task Definition Dataset Construction O: person, guitar, microphone C: A man playing guitar. Image Representations objects captions CNN-fc7 play Sense Inventory: D s 1 s 2 s 3 engage in competition or sport perform or transmit music engage in a playful activity Scoring Function Φ s 2 Sense Representations Mirella Lapata Understanding Visual Scenes 43

68 Task Definition Dataset Construction O: person, guitar, microphone C: A man playing guitar. Image Representations objects captions CNN-fc7 play Sense Inventory: D s 1 s 2 s 3 engage in competition or sport perform or transmit music engage in a playful activity Scoring Function Φ s 2 Sense Representations Mirella Lapata Understanding Visual Scenes 43

69 #1 #3 Representing Visual Structure Visual Representation for Senses Task Definition Dataset Construction perform or transmit music play #2 engage in competition or sport..... Mirella Lapata Understanding Visual Scenes 44

70 #1 Representing Visual Structure Visual Representation for Senses Task Definition Dataset Construction perform or transmit music q 11 q 12 q 13 playing guitar playing music playing in a band..... q 21 playing tennis play #2 engage in competition or sport q23 q 22 playing sport #3 playing game Mirella Lapata Understanding Visual Scenes 44

71 #1 Representing Visual Structure Visual Representation for Senses Task Definition Dataset Construction perform or transmit music q 11 q 12 q 13 playing guitar playing music playing in a band..... q 21 playing tennis play #2 engage in competition or sport q23 q 22 playing sport #3 playing game Mirella Lapata Understanding Visual Scenes 44

72 Visual Representation for Senses Task Definition Dataset Construction play #1 #2 perform or transmit music engage in competition or sport q 11 q 12 q 13 q 21 q23 q 22 playing guitar playing music playing in a band..... playing tennis playing sport CNN - fc7 CNN - fc7 CNN - fc7 CNN - fc7 Mean Pooling Mean Pooling play #1 play #2 #3 playing game Mirella Lapata Understanding Visual Scenes 44

73 Task Definition Dataset Construction O: person, guitar, microphone C: A man playing guitar. Image Representations objects captions CNN-fc7 play Sense Inventory: D s 1 s 2 s 3 engage in competition or sport perform or transmit music engage in a playful activity Scoring Function Φ s 2 Sense Representations Mirella Lapata Understanding Visual Scenes 45

74 Scoring Function Representing Visual Structure Task Definition Dataset Construction Use vector similarity (cosine) as scoring function: ŝ = arg max Φ(s, i, v, D) s S(v) Mirella Lapata Understanding Visual Scenes 46

75 Scoring Function Representing Visual Structure Task Definition Dataset Construction Use vector similarity (cosine) as scoring function: Representations: textual: O, C embeddings; ŝ = arg max Φ(s, i, v, D) s S(v) Mirella Lapata Understanding Visual Scenes 46

76 Scoring Function Representing Visual Structure Task Definition Dataset Construction Use vector similarity (cosine) as scoring function: Representations: textual: O, C embeddings; visual: CNN features; ŝ = arg max Φ(s, i, v, D) s S(v) Mirella Lapata Understanding Visual Scenes 46

77 Scoring Function Representing Visual Structure Task Definition Dataset Construction Use vector similarity (cosine) as scoring function: Representations: textual: O, C embeddings; visual: CNN features; ŝ = arg max Φ(s, i, v, D) s S(v) multi-modal: fused textual and visual features using Canonical Correlation Analysis. Mirella Lapata Understanding Visual Scenes 46

78 Results Representing Visual Structure Task Definition Dataset Construction 85 Motion Non-Motion 80.6 Accuracy Scores First-sense Mirella Lapata Understanding Visual Scenes 47

79 Results Representing Visual Structure Task Definition Dataset Construction 85 Motion Non-Motion 80.6 Accuracy Scores First-sense Text Mirella Lapata Understanding Visual Scenes 47

80 Results Representing Visual Structure Task Definition Dataset Construction 85 Motion Non-Motion 80.6 Accuracy Scores First-sense Text Visual Mirella Lapata Understanding Visual Scenes 47

81 Results Representing Visual Structure Task Definition Dataset Construction 85 Motion Non-Motion 80.6 Accuracy Scores First-sense Text Visual Multi-modal Mirella Lapata Understanding Visual Scenes 47

82 Task Definition Dataset Construction Results: Gold Standard Image Descriptions Motion Non-Motion Accuracy Scores First-sense Text Visual Multi-modal Mirella Lapata Understanding Visual Scenes 48

Verb Prediction Representing Visual Structure Task Definition Dataset Construction ConvNet Classifier Output fc7 (2048,12,12) Linear Sigmoid for each v MIL-Noisy OR play swing throw detect

83 Verb Prediction Representing Visual Structure Task Definition Dataset Construction ConvNet Classifier Output fc7 (2048,12,12) Linear Sigmoid for each v MIL-Noisy OR play swing throw detect the verbs that are present in an image (250 classes); use multiple instance learning (we do not know which bounding boxes correspond to which verbs). Mirella Lapata Understanding Visual Scenes 49

84 Examples: Verb Prediction Task Definition Dataset Construction play, perform hit, swing, play hold, sit, use Mirella Lapata Understanding Visual Scenes 50

85 Task Definition Dataset Construction Verb Prediction and Sense Disambiguation Mirella Lapata Understanding Visual Scenes 51

86 1 Representing Visual Structure Visual Dependency Representations Visual Constituency Representations Applications 2 Task Definition Dataset Construction 3 Mirella Lapata Understanding Visual Scenes 52

87 Representing Visual Structure Image understanding (like text understanding) requires structured representations; for multimodal tasks, we need to align linguistic and image structure; syntactic example: visual dependency representations align geometric structure of an image with syntactic structure of a sentence; application in image description and image retrieval; semantic example: visual word senses align event depicted in an image with event described in a sentence; unsupervised VSD model using multimodal embeddings. Mirella Lapata Understanding Visual Scenes 53

88 Other Approaches to Image Structure Other approaches that align linguistic structure and image structure: Scene (description) graphs (Johnson et al. 2015; Aditya et al. 2015): triples of object, attribute, relation; aligned with image regions and region descriptions; no explicit alignment with linguistic structure (but could be derived). Visual semantic roles (Yatskar et al. 2016): uses semantic frames from FrameNet; annotates images with frames, participants, and roles; not aligned with regions or image descriptions; no verb senses. Mirella Lapata Understanding Visual Scenes 54

Scene Graphs Representing Visual Structure http://cs.stanford.

89 Scene Graphs Representing Visual Structure Mirella Lapata Understanding Visual Scenes 55

90 Scene Graphs Representing Visual Structure Mirella Lapata Understanding Visual Scenes 55

91 Visual Semantic Roles Representing Visual Structure Mirella Lapata Understanding Visual Scenes 56

92 References I Representing Visual Structure Aditya, S., Yang, Y., Baral, C., Fermuller, C., & Aloimonos, Y. (2015). From images to sentences through scene description graphs using commonsense reasoning and knowledge. arxiv preprint arxiv: Elliott, D., & Keller, F. (2013). Image description using visual dependency representations. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, (pp ), Seattle, WA. Everingham, M., Eslami, S. M. A., Gool, L. V., Williams, C. K. I., Winn, J. M., & Zisserman, A. (2015). The Pascal visual object classes challenge: A retrospective. International Journal of Computer Vision, 111, Gella, S., Lapata, M., & Keller, F. (2016). Unsupervised visual sense disambiguation for verbs using multimodal embedding. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, (pp ), San Diego, CA. Johnson, J., Krishna, R., Stark, M., Li, L.-J., Shamma, D. A., Bernstein, M., & Fei-Fei, L. (2015). Image retrieval using scene graphs. In Proceedings of the Conference on Computer Vision and Pattern Recognition, (pp ), Boston, MA. Le, D.-T., Uijlings, J., & Bernardi, R. (2014). Proceedings of the Third Workshop on Vision and Language, chap. TUHOI: Trento Universal Human Object Interaction Dataset, (pp ). Dublin City University and the Association for Computational Linguistics. Mirella Lapata Understanding Visual Scenes 57

93 References II Representing Visual Structure Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. CoRR, abs/ Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015). Show and tell: A neural image caption generator. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, (pp ). Yao, B., & Fei-Fei, L. (2010). Grouplet: A structured image representation for recognizing human and object interactions. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, (pp. 9 16). IEEE. Yao, B., Jiang, X., Khosla, A., Lin, A. L., Guibas, L., & Fei-Fei, L. (2011). Human action recognition by learning bases of action attributes and parts. In Computer Vision (ICCV), 2011 IEEE International Conference on, (pp ). IEEE. Yatskar, M., Zettlemoyer, L., & Farhadi, A. (2016). Situation recognition: Visual semantic role labeling for image understanding. In Computer Vision and Pattern Recognition. Mirella Lapata Understanding Visual Scenes 58

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention Damien Teney 1, Peter Anderson 2*, David Golub 4*, Po-Sen Huang 3, Lei Zhang 3, Xiaodong He 3, Anton van den Hengel 1 1