CAP 6412 Advanced Computer Vision

Similar documents
System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Lecture 1: Machine Learning Basics

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

THE world surrounding us involves multiple modalities

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

arxiv: v1 [cs.cv] 10 May 2017

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Ask Me Anything: Dynamic Memory Networks for Natural Language Processing

arxiv: v1 [cs.lg] 15 Jun 2015

Second Exam: Natural Language Parsing with Neural Networks

Probing for semantic evidence of composition by means of simple classification tasks

arxiv: v1 [cs.cv] 2 Jun 2017

Phrase Localization and Visual Relationship Detection with Comprehensive Image-Language Cues

Python Machine Learning

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

WebLogo-2M: Scalable Logo Detection by Deep Learning from the Web

SORT: Second-Order Response Transform for Visual Recognition

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Diverse Concept-Level Features for Multi-Object Classification

WebLogo-2M: Scalable Logo Detection by Deep Learning from the Web

arxiv: v2 [cs.cv] 3 Aug 2017

CSL465/603 - Machine Learning

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Глубокие рекуррентные нейронные сети для аспектно-ориентированного анализа тональности отзывов пользователей на различных языках

Assignment 1: Predicting Amazon Review Ratings

HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION

(Sub)Gradient Descent

Georgetown University at TREC 2017 Dynamic Domain Track

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Semantic Segmentation with Histological Image Data: Cancer Cell vs. Stroma

A Case Study: News Classification Based on Term Frequency

Model Ensemble for Click Prediction in Bing Search Ads

Human Emotion Recognition From Speech

arxiv: v1 [cs.cl] 2 Apr 2017

Dialog-based Language Learning

TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen

Residual Stacking of RNNs for Neural Machine Translation

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, VOL XXX, NO. XXX,

Lip Reading in Profile

Word Segmentation of Off-line Handwritten Documents

Unsupervised Cross-Lingual Scaling of Political Texts

A Compact DNN: Approaching GoogLeNet-Level Accuracy of Classification and Domain Adaptation

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

A Review: Speech Recognition with Deep Learning Methods

arxiv: v4 [cs.cl] 28 Mar 2016

Speech Emotion Recognition Using Support Vector Machine

arxiv: v2 [cs.cv] 4 Mar 2016

Taxonomy-Regularized Semantic Deep Convolutional Neural Networks

Calibration of Confidence Measures in Speech Recognition

arxiv: v4 [cs.cv] 13 Aug 2017

Probabilistic Latent Semantic Analysis

Artificial Neural Networks written examination

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Knowledge Transfer in Deep Convolutional Neural Nets

Axiom 2013 Team Description Paper

A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval

Reducing Features to Improve Bug Prediction

CS Machine Learning

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Deep Facial Action Unit Recognition from Partially Labeled Data

Modeling function word errors in DNN-HMM based LVCSR systems

Deep Neural Network Language Models

Using Deep Convolutional Neural Networks in Monte Carlo Tree Search

FUZZY EXPERT. Dr. Kasim M. Al-Aubidy. Philadelphia University. Computer Eng. Dept February 2002 University of Damascus-Syria

INPE São José dos Campos

Webly Supervised Learning of Convolutional Networks

SARDNET: A Self-Organizing Feature Map for Sequences

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

There are some definitions for what Word

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Rule Learning With Negation: Issues Regarding Effectiveness

Semi-Supervised Face Detection

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

arxiv: v1 [cs.lg] 7 Apr 2015

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

CS 1103 Computer Science I Honors. Fall Instructor Muller. Syllabus

Highlighting and Annotation Tips Foundation Lesson

Innovative Methods for Teaching Engineering Courses

arxiv: v1 [cs.cl] 27 Apr 2016

Why Did My Detector Do That?!

ON THE USE OF WORD EMBEDDINGS ALONE TO

Generative models and adversarial training

A study of speaker adaptation for DNN-based speech synthesis

Learning Methods in Multilingual Speech Recognition

arxiv: v2 [cs.cl] 18 Nov 2015

Dual-Memory Deep Learning Architectures for Lifelong Learning of Everyday Human Behaviors

Introduction to Simulation

Speech Recognition at ICSI: Broadcast News and beyond

arxiv: v2 [cs.cl] 26 Mar 2015

FSL-BM: Fuzzy Supervised Learning with Binary Meta-Feature for Classification

Transferring End-to-End Visuomotor Control from Simulation to Real World for a Multi-Stage Task

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

arxiv: v3 [cs.cl] 7 Feb 2017

Device Independence and Extensibility in Gesture Recognition

Learning Methods for Fuzzy Systems

A Case-Based Approach To Imitation Learning in Robotic Agents

Transcription:

CAP 6412 Advanced Computer Vision http://www.cs.ucf.edu/~bgong/cap6412.html Boqing Gong Feb 25, 2016

Next week: Vision and language Tuesday (03/01) Javier Lores Thursday (03/03) Aisha Urooji [Relation Phrases] Sadeghi, Fereshteh, Santosh K. Divvala, and Ali Farhadi. VisKE: Visual Knowledge Extraction and Question Answering by Visual Verification of Relation Phrases. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1456-1464. 2015. & Secondary papers [OCR in the wild] Jaderberg, Max, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. "Reading text in the wild with convolutional neural networks." International Journal of Computer Vision 116, no. 1 (2016): 1-20. & Secondary papers

Project 1: Due on 02/28 If you have discussed option 2 with me Send me the meeting minutes / slides --- grading criteria If you take option 1 In total, >6,000 validation images Test 3 images per class of the validation set

Travel plan At Washington DC on 03/01, Tuesday Guest lecture by Dr. Ulas Bagci

Today Administrivia Recurrent Neural Networks (RNNs) (I) VQA, by Nandakishore

Why RNN? (1) Feed-forward networks Model static input-output concept No time series Exists a single forward direction CNN Recurrent neural networks Model dynamic state transition Time & sequence data Exists feedback connections LSTM, GRU

Why RNN? (2) Markov models Model dynamic state transition Time & sequence data Markov (short-range) dependency Moderately sized states Recurrent neural networks Model dynamic state transition Time & sequence data Long-range dependency Exponentially expressive states

Why RNN? (3) Image caption generator Image credits: Karpathy, Andrej, and Li Fei-Fei

Why RNN? (3) Image caption generator Sentence embedding Kiros, Ryan, Yukun Zhu, Ruslan R. Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. "Skip-thought vectors." In Advances in Neural Information Processing Systems, pp. 3276-3284. 2015.

Why RNN? (3) Image caption generator Sentence embedding Word embedding Image credits: Karpathy, Andrej, and Li Fei-Fei

Why RNN? (3) Image caption generator Sentence embedding Word embedding Activity recognition Donahue, Jeffrey, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. "Long-term recurrent convolutional networks for visual recognition and description." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2625-2634. 2015.

Why RNN? (3) Image caption generator Sentence embedding Word embedding Activity recognition Video representation Srivastava, Nitish, Elman Mansimov, and Ruslan Salakhutdinov. "Unsupervised learning of video representations using LSTMs." arxiv preprint arxiv:1502.04681 (2015).

Why RNN? (3) Image caption generator Sentence embedding Word embedding Activity recognition Video representation VQA Early action detection Human dynamics Scene labeling Language modeling Machine translation

(Discrete-time) RNN The processing occurs in discrete steps Image credit: http://mesin-belajar.blogspot.com/2016/01/a-brief-history-of-neural-nets-and-deep_84.html

(Discrete-time) RNN The processing occurs in discrete steps Image credits: Richard Socher

(Discrete-time) RNN At a single time step t: Image credits: Richard Socher

(Discrete-time) RNN Three time steps and beyond Image credits: Richard Socher

(Discrete-time) RNN Three time steps and beyond In math Image credits: Richard Socher

(Discrete-time) RNN Three time steps and beyond A layered feedforward net Tied weights for different time steps Conditioning (memorizing?) on all previous input Memory being cheap to save in RAM Image credits: Richard Socher

(Discrete-time) RNN Compare with holistically nested network Difference: Tied weights for RCNN #layers not fixed for RCNN Input at every layer for RCNN Less RAM for RCNN Image credits: Richard Socher, Saining Xie, and Zhuowen Tu

Today Administrivia Recurrent Neural Networks (RNNs) (I) VQA, by Nandakishore

Upload slides before or after class See Paper Presentation on UCF webcourse Sharing your slides Refer to the originals sources of images, figures, etc. in your slides Convert them to a PDF file Upload the PDF file to Paper Presentation after your presentation

A Multi-World Approach to Question Answering about Real-World Scenes based on Uncertain Input - NIPS 2014 Mateusz Malinowski and Mario Fritz Max Planck Institute for Informatics Saarbrucken, Germany Presented by: Nandakishore Puttashamachar Nandakishore@knights.ucf.edu

Presentation Outline Motivation Goal Background Approach Experiments Quantitative results Contribution Summary

Motivation Can machines answer on questions about images? Evaluating chain of perception, representation and deduction A holistic and open-ended test that resembles the famous Turing Test

Goal To answer natural-language queries about images Question - what is on the desk and behind the black cup? Answer bottle

Background

Background

Background How does the system interpret the question?

Background How does the system interpret the question?

Background

Approach

Background What is a world? Facts - chair(segment, color, X {min,mean,max}, Y {min,mean,max}, Z {min,mean,max} ) Relations - above(a, B), infront(a, B)

Background Single-world Generate a single perceived world W based on segmentation Given a question Q and a world W, predict an answer A Compute the posterior which marginalizes over the latent logical forms T - Denotation of a logical form T on the world W - Evaluate a logical form on the world W

Background Single-world where Log-linear distribution over the logical forms Feature vector ф measures compatibility between Q and T Model parameters are learnt by searching over a restricted space of valid trees and with gradient descent updates of the parameter. Datalog inference produces answers from latent logical forms.

Approach Perceived-world World W now consists of Facts derived from automatic semantic image Segmentation S. Semantic segmentation is used to collect information about objects such as Object class 3D position Color Object hypothesis is represented as n-tuple where, predicate Є { table, bag, books, window, } predicate(instance_id, image_id, color, spatial_loc) Instance_id object s id Image_id id of the image containing the object Spatial_loc min, max and mean location of the object along X, Y, Z axis.

Approach Perceived-world Predicates defining spatial relations between two objects -

Approach Multi-world Previous approach did not consider the uncertainty in class labeling Multiple interpretation of the scene/different perceived worlds Computationally efficient Probability of an answer can be estimated on different world independent of each other

Approach

Approach Scalability and Implementation For worlds containing many facts and spatial relations implementation is computationally expensive. Batch-based approximation Every image induces a set of facts named batch of facts. For every test image find k nearest neighbors in the space of training batches. Use TFIDF to measure similarity. In short, build a training world from k images with similar content to the perceived world of test image.

Experiments DAtaset for QUestion Answering on Real-world images(daquar) Images and Semantic Segmentation NYU-Depth V2 dataset 1449 RGBD images with annotated semantic segmentations. Every pixel mapped into 40 classes out of which 37 informative object classes are used. New dataset of question and answers Two types of annotation Synthetic and Human Synthetic QA pairs generated automatically based on these templates

Experiments

Experiments DAtaset for QUestion Answering on Real-world images(daquar) Performance measure Standard Accuracy are the i-th answer and ground truth respectively Set of objects - and WUPS is a soft measure to find the similarity between generated answer and the ground truth label

Quantitative Results How the Multi world approach will perform under uncertain segmentation and unknown logical forms. Severe drop in performance switching from human to automatic segmentation.

Quantitative Results Performance of the system with Human QA pairs (HumanQA) Human annotations exhibit more variations in contrast to synthetic approach. Longer questions and includes more spatially related objects.

Quantitative Results WUPS score for different thresholds

Contributions An approach and a dataset of question answer pairs Combine language with perception in a multi-world Bayesian framework

The Big Picture

Thank you