Using Images to Ground Machine Translation

Similar documents
A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Residual Stacking of RNNs for Neural Machine Translation

Lip Reading in Profile

Ask Me Anything: Dynamic Memory Networks for Natural Language Processing

arxiv: v2 [cs.cl] 18 Nov 2015

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

arxiv: v1 [cs.cl] 27 Apr 2016

THE world surrounding us involves multiple modalities

Semantic Segmentation with Histological Image Data: Cancer Cell vs. Stroma

arxiv: v1 [cs.cl] 2 Apr 2017

A Compact DNN: Approaching GoogLeNet-Level Accuracy of Classification and Domain Adaptation

SORT: Second-Order Response Transform for Visual Recognition

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

arxiv: v4 [cs.cv] 13 Aug 2017

The RWTH Aachen University English-German and German-English Machine Translation System for WMT 2017

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

ON THE USE OF WORD EMBEDDINGS ALONE TO

arxiv: v1 [cs.cv] 10 May 2017

Second Exam: Natural Language Parsing with Neural Networks

arxiv: v4 [cs.cl] 28 Mar 2016

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

A Review: Speech Recognition with Deep Learning Methods

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

arxiv: v2 [cs.cv] 3 Aug 2017

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Глубокие рекуррентные нейронные сети для аспектно-ориентированного анализа тональности отзывов пользователей на различных языках

arxiv: v1 [cs.lg] 15 Jun 2015

arxiv: v3 [cs.cl] 7 Feb 2017

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, VOL XXX, NO. XXX,

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING

WebLogo-2M: Scalable Logo Detection by Deep Learning from the Web

Dialog-based Language Learning

Noisy SMS Machine Translation in Low-Density Languages

arxiv: v3 [cs.cl] 24 Apr 2017

arxiv: v1 [cs.lg] 7 Apr 2015

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Dual-Memory Deep Learning Architectures for Lifelong Learning of Everyday Human Behaviors

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Cross Language Information Retrieval

Learning Methods for Fuzzy Systems

WebLogo-2M: Scalable Logo Detection by Deep Learning from the Web

Language Model and Grammar Extraction Variation in Machine Translation

Constructing Parallel Corpus from Movie Subtitles

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Georgetown University at TREC 2017 Dynamic Domain Track

Overview of the 3rd Workshop on Asian Translation

HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION

THE enormous growth of unstructured data, including

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen

Knowledge Transfer in Deep Convolutional Neural Nets

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

Word Segmentation of Off-line Handwritten Documents

Model Ensemble for Click Prediction in Bing Search Ads

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Image based Static Facial Expression Recognition with Multiple Deep Network Learning

arxiv: v2 [cs.cv] 4 Mar 2016

A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval

Speech Emotion Recognition Using Support Vector Machine

Taxonomy-Regularized Semantic Deep Convolutional Neural Networks

Improvements to the Pruning Behavior of DNN Acoustic Models

Python Machine Learning

UCEAS: User-centred Evaluations of Adaptive Systems

A study of speaker adaptation for DNN-based speech synthesis

arxiv: v2 [cs.ir] 22 Aug 2016

Diverse Concept-Level Features for Multi-Object Classification

Welcome to. ECML/PKDD 2004 Community meeting

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Cultivating DNN Diversity for Large Scale Video Labelling

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Deep Neural Network Language Models

Dropout improves Recurrent Neural Networks for Handwriting Recognition

arxiv: v1 [cs.cv] 2 Jun 2017

Generative models and adversarial training

A deep architecture for non-projective dependency parsing

Word Embedding Based Correlation Model for Question/Answer Matching

arxiv: v2 [cs.cl] 26 Mar 2015

arxiv: v1 [cs.cl] 20 Jul 2015

Linking Task: Identifying authors and book titles in verbose queries

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

Using Deep Convolutional Neural Networks in Monte Carlo Tree Search

Offline Writer Identification Using Convolutional Neural Network Activation Features

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

DNN ACOUSTIC MODELING WITH MODULAR MULTI-LINGUAL FEATURE EXTRACTION NETWORKS

U : Survey of Astronomy

arxiv: v2 [cs.cv] 30 Mar 2017

Applications of memory-based natural language processing

Attributed Social Network Embedding

Top US Tech Talent for the Top China Tech Company

What Can Neural Networks Teach us about Language? Graham Neubig a2-dlearn 11/18/2017

Deep Facial Action Unit Recognition from Partially Labeled Data

Indian Institute of Technology, Kanpur

SARDNET: A Self-Organizing Feature Map for Sequences

GREAT Britain: Film Brief

NEURAL DIALOG STATE TRACKER FOR LARGE ONTOLOGIES BY ATTENTION MECHANISM. Youngsoo Jang*, Jiyeon Ham*, Byung-Jun Lee, Youngjae Chang, Kee-Eung Kim

Speech Recognition at ICSI: Broadcast News and beyond

Domain Adaptation in Statistical Machine Translation of User-Forum Data using Component-Level Mixture Modelling

Transcription:

1 / 52 Using Images to Ground Machine Translation Iacer Calixto December 7, 2017 ADAPT Centre, School of Computing, Dublin City University Dublin, Ireland. iacer.calixto@adaptcentre.ie

2 / 52 Outline Introduction NMT and IDG Architectures Multi-modal MT Shared Task(s) Our MMT Models Experiments

Introduction

4 / 52 Introduction [1/2] Machine Translation (MT): the task in which we wish to learn a model to translate text from one natural language (e.g., English) into another (e.g., German). text-only task; model is trained on parallel source/target sentence pairs. Image description generation (IDG): the task in which we wish to learn a model to describe an image using natural language (e.g., German). multi-modal task (text and vision); model is trained on image/target sentence pairs.

5 / 52 Introduction [1/2] Machine Translation (MT): the task in which we wish to learn a model to translate text from one natural language (e.g., English) into another (e.g., German). text-only task; model is trained on parallel source/target sentence pairs. Image description generation (IDG): the task in which we wish to learn a model to describe an image using natural language (e.g., German). multi-modal task (text and vision); model is trained on image/target sentence pairs.

6 / 52 Introduction [1/2] Machine Translation (MT): the task in which we wish to learn a model to translate text from one natural language (e.g., English) into another (e.g., German). text-only task; model is trained on parallel source/target sentence pairs. Image description generation (IDG): the task in which we wish to learn a model to describe an image using natural language (e.g., German). multi-modal task (text and vision); model is trained on image/target sentence pairs.

7 / 52 Introduction [1/2] Machine Translation (MT): the task in which we wish to learn a model to translate text from one natural language (e.g., English) into another (e.g., German). text-only task; model is trained on parallel source/target sentence pairs. Image description generation (IDG): the task in which we wish to learn a model to describe an image using natural language (e.g., German). multi-modal task (text and vision); model is trained on image/target sentence pairs.

8 / 52 Introduction [1/2] Machine Translation (MT): the task in which we wish to learn a model to translate text from one natural language (e.g., English) into another (e.g., German). text-only task; model is trained on parallel source/target sentence pairs. Image description generation (IDG): the task in which we wish to learn a model to describe an image using natural language (e.g., German). multi-modal task (text and vision); model is trained on image/target sentence pairs.

9 / 52 Introduction [1/2] Machine Translation (MT): the task in which we wish to learn a model to translate text from one natural language (e.g., English) into another (e.g., German). text-only task; model is trained on parallel source/target sentence pairs. Image description generation (IDG): the task in which we wish to learn a model to describe an image using natural language (e.g., German). multi-modal task (text and vision); model is trained on image/target sentence pairs.

10 / 52 Introduction [2/2] Multi-Modal Machine Translation (MMT): learn a model to translate text and an image that illustrates this text from one natural language (e.g., English) into another (e.g., German). multi-modal task (text and vision); model is trained on source/image/target triplets; can be seen as a form of augmented MT or augmented image description generation.

11 / 52 Introduction [2/2] Multi-Modal Machine Translation (MMT): learn a model to translate text and an image that illustrates this text from one natural language (e.g., English) into another (e.g., German). multi-modal task (text and vision); model is trained on source/image/target triplets; can be seen as a form of augmented MT or augmented image description generation.

12 / 52 Introduction [2/2] Multi-Modal Machine Translation (MMT): learn a model to translate text and an image that illustrates this text from one natural language (e.g., English) into another (e.g., German). multi-modal task (text and vision); model is trained on source/image/target triplets; can be seen as a form of augmented MT or augmented image description generation.

13 / 52 Introduction [2/2] Multi-Modal Machine Translation (MMT): learn a model to translate text and an image that illustrates this text from one natural language (e.g., English) into another (e.g., German). multi-modal task (text and vision); model is trained on source/image/target triplets; can be seen as a form of augmented MT or augmented image description generation.

14 / 52 Use Cases Multi-Modal Machine Translation (MMT) use-cases: localisation of product information in e-commerce, e.g. ebay, Amazon; localisation of user posts and photos in social networks, e.g. Facebook, Instagram, Twitter; translation of image descriptions in general; translation of subtitles (video), etc.

15 / 52 Use Cases Multi-Modal Machine Translation (MMT) use-cases: localisation of product information in e-commerce, e.g. ebay, Amazon; localisation of user posts and photos in social networks, e.g. Facebook, Instagram, Twitter; translation of image descriptions in general; translation of subtitles (video), etc.

16 / 52 Use Cases Multi-Modal Machine Translation (MMT) use-cases: localisation of product information in e-commerce, e.g. ebay, Amazon; localisation of user posts and photos in social networks, e.g. Facebook, Instagram, Twitter; translation of image descriptions in general; translation of subtitles (video), etc.

17 / 52 Use Cases Multi-Modal Machine Translation (MMT) use-cases: localisation of product information in e-commerce, e.g. ebay, Amazon; localisation of user posts and photos in social networks, e.g. Facebook, Instagram, Twitter; translation of image descriptions in general; translation of subtitles (video), etc.

18 / 52 Use Cases Multi-Modal Machine Translation (MMT) use-cases: localisation of product information in e-commerce, e.g. ebay, Amazon; localisation of user posts and photos in social networks, e.g. Facebook, Instagram, Twitter; translation of image descriptions in general; translation of subtitles (video), etc.

Convolutional Neural Networks (CNN) Virtually all MMT and IDG models use pre-trained CNNs for image feature extraction; Illustration of the VGG19 network (Simonyan and Zisserman, 2014): Figure 1: https://goo.gl/y0so1l 19 / 52

Example CNNs (b) Illustration of a residual connection (He et al., 2015). (a) https://goo.gl/jqqevg 20 / 52

NMT and IDG Architectures

22 / 52 Neural Machine Translation The attention mechanism lets the decoder search for the best source words to generate each target word, e.g. Bahdanau et al., 2015.

23 / 52 Neural Image Description Generation The attention mechanism lets the decoder look at or attend to specific parts of the image when generating each target word, e.g. Xu et al., 2015.

Multi-modal MT Shared Task(s)

Multimodal MT Shared Tasks overall ideas 3 types of submissions: Two attention mechanisms: compute context vectors over the source language hidden states and location-preserving image features; Encoder and/or decoder initialisation: initialise encoder and/or decoder RNNs with bottleneck image features; Other alternatives: element-wise multiplication of the target-language embeddings with bottleneck image features; sum source-language word embeddings with bottleneck image features; use visual features in a retrieval framework; visually-ground encoder representations by learning to predict bottleneck image features from the source-language hidden states. http://mtm2017.unbabel.com/assets/images/slides/lucia_specia.pdf 25 / 52

Multimodal MT Shared Tasks overall ideas 3 types of submissions: Two attention mechanisms: compute context vectors over the source language hidden states and location-preserving image features; Encoder and/or decoder initialisation: initialise encoder and/or decoder RNNs with bottleneck image features; Other alternatives: element-wise multiplication of the target-language embeddings with bottleneck image features; sum source-language word embeddings with bottleneck image features; use visual features in a retrieval framework; visually-ground encoder representations by learning to predict bottleneck image features from the source-language hidden states. http://mtm2017.unbabel.com/assets/images/slides/lucia_specia.pdf 26 / 52

Multimodal MT Shared Tasks overall ideas 3 types of submissions: Two attention mechanisms: compute context vectors over the source language hidden states and location-preserving image features; Encoder and/or decoder initialisation: initialise encoder and/or decoder RNNs with bottleneck image features; Other alternatives: element-wise multiplication of the target-language embeddings with bottleneck image features; sum source-language word embeddings with bottleneck image features; use visual features in a retrieval framework; visually-ground encoder representations by learning to predict bottleneck image features from the source-language hidden states. http://mtm2017.unbabel.com/assets/images/slides/lucia_specia.pdf 27 / 52

Multimodal MT Shared Tasks overall ideas 3 types of submissions: Two attention mechanisms: compute context vectors over the source language hidden states and location-preserving image features; Encoder and/or decoder initialisation: initialise encoder and/or decoder RNNs with bottleneck image features; Other alternatives: element-wise multiplication of the target-language embeddings with bottleneck image features; sum source-language word embeddings with bottleneck image features; use visual features in a retrieval framework; visually-ground encoder representations by learning to predict bottleneck image features from the source-language hidden states. http://mtm2017.unbabel.com/assets/images/slides/lucia_specia.pdf 28 / 52

Multimodal MT Shared Tasks overall ideas 3 types of submissions: Two attention mechanisms: compute context vectors over the source language hidden states and location-preserving image features; Encoder and/or decoder initialisation: initialise encoder and/or decoder RNNs with bottleneck image features; Other alternatives: element-wise multiplication of the target-language embeddings with bottleneck image features; sum source-language word embeddings with bottleneck image features; use visual features in a retrieval framework; visually-ground encoder representations by learning to predict bottleneck image features from the source-language hidden states. http://mtm2017.unbabel.com/assets/images/slides/lucia_specia.pdf 29 / 52

Multimodal MT Shared Tasks overall ideas 3 types of submissions: Two attention mechanisms: compute context vectors over the source language hidden states and location-preserving image features; Encoder and/or decoder initialisation: initialise encoder and/or decoder RNNs with bottleneck image features; Other alternatives: element-wise multiplication of the target-language embeddings with bottleneck image features; sum source-language word embeddings with bottleneck image features; use visual features in a retrieval framework; visually-ground encoder representations by learning to predict bottleneck image features from the source-language hidden states. http://mtm2017.unbabel.com/assets/images/slides/lucia_specia.pdf 30 / 52

Multimodal MT Shared Tasks overall ideas 3 types of submissions: Two attention mechanisms: compute context vectors over the source language hidden states and location-preserving image features; Encoder and/or decoder initialisation: initialise encoder and/or decoder RNNs with bottleneck image features; Other alternatives: element-wise multiplication of the target-language embeddings with bottleneck image features; sum source-language word embeddings with bottleneck image features; use visual features in a retrieval framework; visually-ground encoder representations by learning to predict bottleneck image features from the source-language hidden states. http://mtm2017.unbabel.com/assets/images/slides/lucia_specia.pdf 31 / 52

Multimodal MT Shared Tasks overall ideas 3 types of submissions: Two attention mechanisms: compute context vectors over the source language hidden states and location-preserving image features; Encoder and/or decoder initialisation: initialise encoder and/or decoder RNNs with bottleneck image features; Other alternatives: element-wise multiplication of the target-language embeddings with bottleneck image features; sum source-language word embeddings with bottleneck image features; use visual features in a retrieval framework; visually-ground encoder representations by learning to predict bottleneck image features from the source-language hidden states. http://mtm2017.unbabel.com/assets/images/slides/lucia_specia.pdf 32 / 52

Heidelberg University (Hitschler et al., 2016) 33 / 52

CMU (Huang et al., 2016) [1/3] 34 / 52

CMU (Huang et al., 2016) [2/3] 35 / 52

CMU (Huang et al., 2016) [3/3] 36 / 52

UvA-TiCC (Elliott and Kádár, 2017) 37 / 52

38 / 52 LIUM-CVC (Caglayan et al., 2017) Global visual features, i.e. 2048D pool5 features from ResNet-50, are multiplicatively interacted with the target word embeddings; With 128D embeddings and 256D recurrent layers, their resulting models have 5M parameters. (Elliott et al., 2017)

39 / 52 LIUM-CVC (Caglayan et al., 2017) Global visual features, i.e. 2048D pool5 features from ResNet-50, are multiplicatively interacted with the target word embeddings; With 128D embeddings and 256D recurrent layers, their resulting models have 5M parameters. (Elliott et al., 2017)

Our MMT Models

Doubly-Attentive Multi-Modal NMT NMT SRC+IMG Figure 3: Doubly-Attentive Multi-modal NMT (Calixto et al., 2017a) image gating 41 / 52

42 / 52 Image as source-language words IMG W IMG W Global visual features are projected into the source-language word embeddings space and used as the first/last word in the source sequence. (Calixto et al., 2017b)

43 / 52 Image for encoder initialisation IMG E IMG E Global visual features are projected into the source-language RNN hidden states space and used to compute the initial state of the source-language RNN. (Calixto et al., 2017b)

44 / 52 Image for decoder initialisation IMG D IMG D Global visual features are projected into the target-language RNN hidden states space and used as additional data to compute the initial state of the target-language RNN. (Calixto et al., 2017b)

Experiments

46 / 52 English German [1/2] Training data: Multi30k data set (Elliott et al., 2016). Model Training BLEU4 METEOR TER chrf3 data NMT M30k T 33.7 52.3 46.7 65.2 PBSMT M30k T 32.9 54.3 45.1 67.4 Huang et al., 2016 M30k T 35.1 ( 1.4) 52.2 ( 2.1) + RCNN 36.5 ( 2.8) 54.1 ( 0.2) NMT SRC+IMG M30k T 36.5 ( 2.8) 55.0 ( 0.9) 43.7 ( 1.4) 67.3 ( 0.1) IMG W M30k T 36.9 ( 3.2) 54.3 ( 0.2) 41.9 ( 3.2) 66.8 ( 0.6) IMG E M30k T 37.1 ( 3.4) 55.0 ( 0.9) 43.1 ( 2.0) 67.6 ( 0.2) IMG D M30k T 37.3 ( 3.6) 55.1 ( 1.0) 42.8 ( 2.3) 67.7 ( 0.3)

47 / 52 English German [2/2] Pre-training on back-translated comparable Multi30k data set (Elliott et al., 2016). Model Training BLEU4 METEOR TER chrf3 data PBSMT (LM) M30k T 34.0 0.0 55.0 0.0 44.7 0.0 68.0 NMT M30k T 35.5 0.0 53.4 0.0 43.3 0.0 65.2 NMT SRC+IMG M30k T 37.1 ( 1.6) 54.5 ( 0.5) 42.8 ( 0.5) 66.6 ( 1.4) IMG W M30k T 36.7 ( 1.2) 54.6 ( 0.4) 42.0 ( 1.3) 66.8 ( 1.2) IMG E M30k T 38.5 ( 3.0) 55.7 ( 0.9) 41.4 ( 1.9) 68.3 ( 0.3) IMG D M30k T 38.5 ( 3.0) 55.9 ( 1.1) 41.6 ( 1.7) 68.4 ( 0.4)

48 / 52 German English [1/2] Training data: Multi30k data set (Elliott et al., 2016). Model BLEU4 METEOR TER chrf3 PBSMT 32.8 34.8 43.9 61.8 NMT 38.2 35.8 40.2 62.8 NMT SRC+IMG 40.6 ( 2.4) 37.5 ( 1.7) 37.7 ( 2.5) 65.2 ( 2.4) IMG W 39.5 ( 1.3) 37.1 ( 1.3) 37.1 ( 3.1) 63.8 ( 1.0) IMG E 41.1 ( 2.9) 37.7 ( 1.9) 37.9 ( 2.3) 65.7 ( 2.9) IMG D 41.3 ( 3.1) 37.8 ( 2.0) 37.9 ( 2.3) 65.7 ( 2.9)

49 / 52 German English [2/2] Pre-training on back-translated comparable Multi30k data set (Elliott et al., 2016). Model BLEU4 METEOR TER chrf3 PBSMT 36.8 36.4 40.8 64.5 NMT 42.6 38.9 36.1 67.6 NMT SRC+IMG 43.2 ( 0.6) 39.0 ( 0.1) 35.5 ( 0.6) 67.7 ( 0.1) IMG 2W 42.4 ( 0.2) 39.0 ( 0.1) 34.7 ( 1.4) 67.6 ( 0.0) IMG E 43.9 ( 1.3) 39.7 ( 0.8) 34.8 ( 1.3) 68.6 ( 1.0) IMG D 43.4 ( 0.8) 39.3 ( 0.4) 35.2 ( 0.9) 67.8 ( 0.2)

50 / 52 NMT SRC+IMG Visualisation of attention states (a) Image target word alignments. (b) Source target word alignments.

51 / 52 References I Bahdanau, D., Cho, K., and Bengio, Y. (2015). Neural machine translation by jointly learning to align and translate. In International Conference on Learning Representations. ICLR 2015. Caglayan, O., Aransa, W., Bardet, A., García-Martínez, M., Bougares, F., Barrault, L., Masana, M., Herranz, L., and van de Weijer, J. (2017). LIUM-CVC Submissions for WMT17 Multimodal Translation Task. In Proceedings of the Second Conference on Machine Translation, Volume 2: Shared Task Papers, pages 432 439. Calixto, I., Liu, Q., and Campbell, N. (2017a). Doubly-Attentive Decoder for Multi-modal Neural Machine Translation. In Proceedings of the 55th Conference of the Association for Computational Linguistics: Volume 1, Long Papers, pages 1913 1924, Vancouver, Canada. Calixto, I. and Liu, Q. (2017b). Incorporating Global Visual Features into Attention-based Neural Machine Translation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1003 1014, Copenhagen, Denmark. Elliott, D., Frank, S., Sima an, K., and Specia, L. (2016). Multi30K: Multilingual English-German Image Descriptions. In Proceedings of the 5th Workshop on Vision and Language, VL@ACL 2016, Berlin, Germany. Elliott, D., Kádár, Á. (2017). Imagination improves Multimodal Translation. arxiv preprint arxiv:1705.04350. He, K., Zhang, X., Ren, S., and Sun, J. (2015). Deep Residual Learning for Image Recognition. arxiv preprint arxiv:1512.03385. Hitschler, J., Schamoni, S., and Riezler, S. (2016). Multimodal Pivots for Image Caption Translation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2399 2409, Berlin, Germany. Huang, P.-Y., Liu, F., Shiang, S.-R., Oh, J., and Dyer, C. (2016). Attention-based multimodal neural machine translation. In Proceedings of the First Conference on Machine Translation, pages 639 645, Berlin, Germany. Simonyan, K. and Zisserman, A. (2014). Very Deep Convolutional Networks for Large-Scale Image Recognition. CoRR, abs/1409.1556. Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., and Bengio, Y. (2015). Show, attend and tell: Neural image caption generation with visual attention. In Blei, D. and Bach, F., editors, Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pages 2048 2057. JMLR Workshop and Conference Proceedings.

Thank you! Questions? 52 / 52