CAP 6412 Advanced Computer Vision

Similar documents
System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Lecture 1: Machine Learning Basics

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

Python Machine Learning

arxiv: v1 [cs.cv] 10 May 2017

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

(Sub)Gradient Descent

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

arxiv: v1 [cs.lg] 15 Jun 2015

A Neural Network GUI Tested on Text-To-Phoneme Mapping

CS Machine Learning

arxiv: v1 [cs.lg] 7 Apr 2015

HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Word Segmentation of Off-line Handwritten Documents

Knowledge Transfer in Deep Convolutional Neural Nets

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Artificial Neural Networks written examination

Model Ensemble for Click Prediction in Bing Search Ads

Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures

Assignment 1: Predicting Amazon Review Ratings

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Softprop: Softmax Neural Network Backpropagation Learning

arxiv: v1 [cs.cl] 27 Apr 2016

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Human Emotion Recognition From Speech

Axiom 2013 Team Description Paper

Evolutive Neural Net Fuzzy Filtering: Basic Description

Truth Inference in Crowdsourcing: Is the Problem Solved?

Speech Emotion Recognition Using Support Vector Machine

Modeling function word errors in DNN-HMM based LVCSR systems

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Глубокие рекуррентные нейронные сети для аспектно-ориентированного анализа тональности отзывов пользователей на различных языках

Calibration of Confidence Measures in Speech Recognition

Learning Methods for Fuzzy Systems

Second Exam: Natural Language Parsing with Neural Networks

THE world surrounding us involves multiple modalities

Semantic Segmentation with Histological Image Data: Cancer Cell vs. Stroma

Predicting Future User Actions by Observing Unmodified Applications

SARDNET: A Self-Organizing Feature Map for Sequences

Dropout improves Recurrent Neural Networks for Handwriting Recognition

Dual-Memory Deep Learning Architectures for Lifelong Learning of Everyday Human Behaviors

CSL465/603 - Machine Learning

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

arxiv: v4 [cs.cl] 28 Mar 2016

Probabilistic Latent Semantic Analysis

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

The Good Judgment Project: A large scale test of different methods of combining expert predictions

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Image based Static Facial Expression Recognition with Multiple Deep Network Learning

Dialog-based Language Learning

Lecture 1: Basic Concepts of Machine Learning

A Review: Speech Recognition with Deep Learning Methods

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

arxiv: v1 [cs.cv] 2 Jun 2017

A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval

Ask Me Anything: Dynamic Memory Networks for Natural Language Processing

Device Independence and Extensibility in Gesture Recognition

Attributed Social Network Embedding

arxiv: v1 [cs.cl] 2 Apr 2017

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Georgetown University at TREC 2017 Dynamic Domain Track

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

On-the-Fly Customization of Automated Essay Scoring

Proposal of Pattern Recognition as a necessary and sufficient principle to Cognitive Science

Laboratorio di Intelligenza Artificiale e Robotica

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

A Reinforcement Learning Variant for Control Scheduling

Lecture 2: Quantifiers and Approximation

Summarizing Answers in Non-Factoid Community Question-Answering

WebLogo-2M: Scalable Logo Detection by Deep Learning from the Web

arxiv: v2 [cs.cv] 4 Mar 2016

SORT: Second-Order Response Transform for Visual Recognition

Residual Stacking of RNNs for Neural Machine Translation

Modeling function word errors in DNN-HMM based LVCSR systems

Lip Reading in Profile

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

arxiv: v4 [cs.cv] 13 Aug 2017

Lecture 10: Reinforcement Learning

12- A whirlwind tour of statistics

Generative models and adversarial training

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Discriminative Learning of Beam-Search Heuristics for Planning

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

A Vector Space Approach for Aspect-Based Sentiment Analysis

Evolution of Symbolisation in Chimpanzees and Neural Nets

A study of speaker adaptation for DNN-based speech synthesis

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen

Data Integration through Clustering and Finding Statistical Relations - Validation of Approach

Classification Using ANN: A Review

Phrase Localization and Visual Relationship Detection with Comprehensive Image-Language Cues

Active Learning. Yingyu Liang Computer Sciences 760 Fall

The Importance of Social Network Structure in the Open Source Software Developer Community

Transcription:

CAP 6412 Advanced Computer Vision http://www.cs.ucf.edu/~bgong/cap6412.html Boqing Gong Feb 23, 2016

Today Administrivia Neural networks & Backpropagation (IX) Pose estimation, by Amar

This week: Vision and language Tuesday (02/23) Suhas Nithyanandappa Thursday (02/25) Nandakishore Puttashamachar [VQA-1] Antol, Stanislaw, AishwaryaAgrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. VQA: Visual Question Answering. arxiv preprint arxiv:1505.00468 (2015). & Secondary papers [VQA-2] Malinowski, Mateusz, and Mario Fritz. A multi-world approach to question answering about real-world scenes based on uncertain input. In Advances in Neural Information Processing Systems, pp. 1682-1690. 2014. & Secondary papers

Next week: Vision and language Tuesday (03/01) Javier Lores Thursday (03/03) Aisha Urooji [Relation Phrases] Sadeghi, Fereshteh, Santosh K. Divvala, and Ali Farhadi. VisKE: Visual Knowledge Extraction and Question Answering by Visual Verification of Relation Phrases. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1456-1464. 2015. & Secondary papers [OCR in the wild] Jaderberg, Max, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. "Reading text in the wild with convolutional neural networks." International Journal of Computer Vision 116, no. 1 (2016): 1-20. & Secondary papers

Project 1: Due on 02/28 If you have discussed option 2 with me Send me the meeting minutes / slides --- grading criteria If you take option 1 In total, >6,000 validation images Test 3 images per class of the validation set

Travel plan At Washington DC on 03/01, Tuesday Guest lecture by Dr. Ulas Bagci

Today Administrivia Neural networks & Backpropagation (IX) VQA-1, by Suhas

Recap Data: (x i,y i ) 2 X Y,i=1, 2,,n Goal: Find the labeling function c : X 7! Y,c(x) =y Hypotheses: net(x; ) Expected risk: R( ) =E (x,y) [net(x; ) 6= y] 1 Empirical risk: ˆR( ) = n nx L(x i,y i ; ) i=1

Recap 2 6 Empirical risk: ˆR( ) = 1 n nx i=1 L(x i,y i ; ) Parameter estimation: ˆ arg min ˆR( ) Optimization by stochastic gradient descent (SGD) : : learning rate w t w t 1 rl(x i,y i ; t 1 )

Checking for convergence 1 ˆR What happened? 1. Bug in the program 2. Learning rate too high 3. What if c(x)=y is totally random? 4. No appropriate pre-processing of the Input dataàdummy gradients iteration t

Checking for convergence 2 What happened? Different learning rateà different local optimum & decreasing rate ˆR iteration t Why oscillation: 1. Skipped the optimum because of large learning rate 2. Gradient of a single data point is noisy 3. Not computing the real R, instead we approximate it (see P12)

Checking for convergence 3 Turning down the learning rate! After some iterations: ˆR = const 1 t + const 2 iteration t

Overfitting Error 1 2 iteration t Training, validation, test What shall we do? Early stopping Data augmentation Regularization (Dropout) Reduce the network complexity Which place should we (early) stop? Theta_1

Under-fitting Training, validation Possible reasons Error iteration t

Recap Neuron Two basic neural network (NN) structures Convolutional NN (CNN): a special feedforward network Using NN to approximate concepts (underlying labeling function) Training a NN (how to determine the weights of neurons)? Gradient descent, stochastic gradient descent (SGD) Backpropagation (for efficient gradient computation) Debugging tricks: learning rate, momentum, early stopping, dropout, weight decay, etc.

Next: Recurrent neural networks (RNN) Feed-forward networks Recurrent neural networks Image credit: http://mesin-belajar.blogspot.com/2016/01/a-brief-history-of-neural-nets-and-deep_84.html

Why RNN? Feed-forward networks Model static input-output concept No time series Exists a single forward direction CNN Recurrent neural networks Model dynamic state transition Time & sequence data Exists feedback connections LSTM

Why RNN? (cont d) Markov models Model dynamic state transition Time & sequence data Markov (short-range) dependency Moderately sized states Recurrent neural networks Model dynamic state transition Time & sequence data Long-range dependency Exponentially expressive states

Next: Recurrent neural networks (RNN) RNN Vanishing and exploding gradients Long short-term memory (LSTM) Bi-directional RNN, GRU, Training algorithms, Applications (tentative)

Today Administrivia Neural networks & Backpropagation (IX) VQA-1, by Suhas

Upload slides before or after class See Paper Presentation on UCF webcourse Sharing your slides Refer to the originals sources of images, figures, etc. in your slides Convert them to a PDF file Upload the PDF file to Paper Presentation after your presentation

VQA: Visual Question Answering Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, Devi Parikh; The IEEE International Conference on Computer Vision (ICCV), 2015 Presented by Suhas Nithyanandappa suhasn@knights.ucf.edu February 23, 2016

Presentation Outline Motivation Problem Statement VQA Dataset Collection VQA Dataset Analysis VQA Baselines and Methods Results Related Work

Motivation Using word n-gram statistics to generate reasonable image captions isn t Artificial Intelligence Current state of the art doesn t capture the common sense reasoning present in the image Image credits:(c) Dhruv Batra

Why is AI hard? Slide credits:(c) Dhruv Batra

Is it useful? - VizWiz Slide credits:(c) Dhruv Batra (C) Dhruv Batra 5

Problem Statement In order to such common sense knowledge we need a big enough dataset Open-ended questions require a potentially vast set of AI capabilities to answer fine-grained recognition object detection activity recognition knowledge base reasoning In order to building such a system we need blocks from various fields: CV, NLP and KR VQA Good place to start

VQA Dataset Collection It is collect using Amazon Mechanical Turk service : >10,000 Turkers ~ >41,000 Human Hours ~ 4.7 Human Years ~ 20.61 Person-Job-Years! Real Images 123,287 training and validation images and 81,434 test images from the newly-released Microsoft Common Objects in Context (MS COCO) dataset Images contain multiple objects and rich contextual information As they are visual complex, they are well-suited for VQA task Dataset contains five single-sentence captions for all images. Image credits:(c) MS COCO

VQA Dataset Collection Abstract Images In order to focus on reasoning, instead of the low-level vision tasks, a new abstract scenes dataset containing 50K scenes has been created The dataset contains 20 paperdoll human models : spanning genders, races, and ages with 8 different expressions The limbs are adjustable to allow for continuous pose variations. The clipart may be used to depict both indoor and outdoor scenes The set contains over 100 objects and 31 animals in various poses. Image credits : Appendix of the paper

VQA Dataset Collection Image credits : Appendix of the paper

VQA Dataset Collection Questions Collecting interesting, diverse, and well-posed questions is a significant challenge. Many questions require low-level computer vision knowledge, such as What color is the cat? or How many chairs are present in the scene? However, importance was given to collect questions that require common sense knowledge about the scene, such as What sound does the pictured animal make? Goal: Having wide variety of question types and difficulty, we may be able to measure the continual progress of both visual understanding and common sense reasoning Image credits : Appendix of the paper

AMT Data Collection Interface Slide credits:(c) Dhruv Batra

Slide credits:(c) Dhruv Batra

Slide credits:(c) Dhruv Batra

VQA Dataset Collection Answers Open-ended questions result in a diverse set of possible answers. Human subjects may also disagree on the correct answer, e.g., some saying yes while others say no To handle these discrepancies, we gather 10 answers from unique workers Two types of questions: open-answer and multiple-choice. For the open-answer task accuracy metric: an answer is deemed 100% accurate if at least 3 workers provided that exact answer Confidence level in answering By taking in responses yes, no and maybe

AMT Interface for Answer Collection (C) Dhruv Batra Slide credits:(c) Dhruv Batra 15

Slide credits:(c) Dhruv Batra

VQA Dataset Analysis The dataset includes : MS COCO dataset : 614,163 questions and 7,984,119 answers for 204,721 images Abstract scenes dataset: 150,000 questions with 1,950,000 answers for 50, 000 images Questions: We can cluster questions into different types based on the words that start the question Interestingly, the distribution of questions is quite similar for both real images and abstract scenes Length : We see that most questions range from four to ten words. Answers: Many questions have yes or no answers Questions such as What is... and What type... have a rich diversity of responses Question types such as What color... or Which... have more specialized responses Length : The distribution of answers containing one, two, or three words, 89.32%, 6.91%, and 2.74% for real images 90.51%, 5.89%, and 2.49% for abstract scenes This is in contrast with image captions that generically describe the entire image and hence tend to be longer

VQA Dataset Analysis Yes/ No and Number Answers Among these yes/no questions, there is a bias towards yes 58:83% and 55:86% of yes/no answers are yes for real images and abstract scenes Subject Confidence A majority of the answers were labelled as confident for both real images and abstract scenes Inter Human Agreement Does the self-judgment of confidence correspond to the answer agreement between subjects? Image credits:(c) Dhruv Batra

VQA Dataset Analysis Is the Image Necessary? Questions can sometimes be answered correctly using commonsense knowledge This issue by asking three subjects to answer the questions without seeing the image

VQA Dataset Analysis Captions vs. Questions Do generic image captions provide enough information to answer the questions? Answer : Statistically different from those mentioned in our questions + answers (Kolmogorov-Smirnov test, p <.001) for both real images and abstract scenes Table credits:(c) Dhruv Batra

VQA Dataset Analysis Which Questions Require Common Sense? To capture required knowledge external to the image, three categories were formed Toddler (3-4), younger child (5-8), older child (9-12), teenager (13-17), adult (18+) Statistics After data analysis it was observed that For 47.43% of question 3 or more subjects voted yes to common sense, (18:14%: 6 or more). Image credits:(c) Dhruv Batra

Least commonsense questions Slide credits:(c) Dhruv Batra 23

Most commonsense questions Slide credits:(c) Dhruv Batra 24

VQA Baseline and Analysis To establish baselines, difficulty of the VQA dataset for the MS COCO images is explored If we randomly choose an answer from the top 1K answers of the VQA train/ validation dataset, the teststandard accuracy is 0.12% If we always select the most popular answer ( yes ), the accuracy is 29.72% Picking the most popular answer per question type does 36.18% Nearest neighbour approach does 40.61% on validation dataset

2-Channel VQA Model Image Embedding 1k output units Neural Network Softmax over top K answers Convolution Layer + Non-Linearity Pooling Layer Convolution Layer + Non-Linearity Pooling Layer Fully-Connected MLP Question Embedding (BoW) How many horses are in this image? what 0 where 0 how 1 is 0 could 0 are 0 are 1 horse 1 image 1 Beginning of question words Slide credits:(c) Dhruv Batra 26

2-Channel VQA Model Image Embedding 1k output units Neural Network Softmax over top K answers Convolution Layer + Non-Linearity Pooling Layer Convolution Layer + Non-Linearity Pooling Layer Fully-Connected MLP Question Embedding (LSTM) How many horses are in this image? Slide credits:(c) Dhruv Batra 27

VQA Baseline and Analysis Results Slide credits:(c) Dhruv Batra

VQA Baseline and Analysis Results Image credits:(c) Dhruv Batra

Future Work Training on task-specific datasets may help enable practical VQA applications. Creation of Video datasets performing basic actions might help in capturing commonsense

Related Work: Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question Answering Haoyuan Gao, Junhua Mao, Jie Zhou, Zhiheng Huang, Lei Wang, Wei Xu

Introduction It contains over 150,000 images and 310,000 freestyle Chinese question-answer pairs and their English translations. The quality of the generated answers of our mqa model on this dataset is evaluated by human judges through a Turing Test They will also provide a score (i.e. 0, 1, 2, the larger the better) indicating the quality of the answer

Model Image credits: Paper -Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question Answering

Fusing: The activation of the fusing layer f(t) for the (t)th word in the answer can be calculated as follows: where + denotes element-wise addition rq stands for the activation of the LSTM(Q) memory cells of the last word in the question I denotes the image representation Ra(t) and Wa(t) denotes the activation of the LSTM(A) memory cells and the word embedding of the t th word in the answer respectively Vrq, Vi, Vra and Vw are the weight matrices that need to be learned g(.) is an element-wise non-linear function Image credits: Paper -Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question Answering

Results Table credits: Paper -Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question Answering