A deep learning strategy for wide-area surveillance

Similar documents
Lecture 1: Machine Learning Basics

Generative models and adversarial training

CS Machine Learning

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Python Machine Learning

Semantic Segmentation with Histological Image Data: Cancer Cell vs. Stroma

Knowledge Transfer in Deep Convolutional Neural Nets

Speech Recognition at ICSI: Broadcast News and beyond

Calibration of Confidence Measures in Speech Recognition

Artificial Neural Networks written examination

INPE São José dos Campos

(Sub)Gradient Descent

HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION

arxiv: v1 [cs.lg] 15 Jun 2015

Word Segmentation of Off-line Handwritten Documents

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

arxiv: v1 [cs.cv] 10 May 2017

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Exploration. CS : Deep Reinforcement Learning Sergey Levine

SARDNET: A Self-Organizing Feature Map for Sequences

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Cultivating DNN Diversity for Large Scale Video Labelling

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Evolutive Neural Net Fuzzy Filtering: Basic Description

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

Softprop: Softmax Neural Network Backpropagation Learning

College Pricing. Ben Johnson. April 30, Abstract. Colleges in the United States price discriminate based on student characteristics

Switchboard Language Model Improvement with Conversational Data from Gigaword

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Software Maintenance

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Human Emotion Recognition From Speech

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Model Ensemble for Click Prediction in Bing Search Ads

A study of speaker adaptation for DNN-based speech synthesis

Residual Stacking of RNNs for Neural Machine Translation

Attributed Social Network Embedding

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Conversation Starters: Using Spatial Context to Initiate Dialogue in First Person Perspective Games

The Good Judgment Project: A large scale test of different methods of combining expert predictions

arxiv: v2 [cs.cv] 30 Mar 2017

Using Deep Convolutional Neural Networks in Monte Carlo Tree Search

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Introduction to Simulation

Learning Methods in Multilingual Speech Recognition

Designing a Rubric to Assess the Modelling Phase of Student Design Projects in Upper Year Engineering Courses

Semi-Supervised Face Detection

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Linking Task: Identifying authors and book titles in verbose queries

WHEN THERE IS A mismatch between the acoustic

Learning Methods for Fuzzy Systems

Learning From the Past with Experiment Databases

arxiv: v1 [cs.lg] 7 Apr 2015

Computerized Adaptive Psychological Testing A Personalisation Perspective

Rule Learning With Negation: Issues Regarding Effectiveness

Modeling user preferences and norms in context-aware systems

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

An Introduction to Simio for Beginners

Learning to Schedule Straight-Line Code

Probabilistic Latent Semantic Analysis

Modeling function word errors in DNN-HMM based LVCSR systems

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Modeling function word errors in DNN-HMM based LVCSR systems

LEGO MINDSTORMS Education EV3 Coding Activities

Georgetown University at TREC 2017 Dynamic Domain Track

Assignment 1: Predicting Amazon Review Ratings

Axiom 2013 Team Description Paper

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Copyright by Sung Ju Hwang 2013

Test Effort Estimation Using Neural Network

Comment-based Multi-View Clustering of Web 2.0 Items

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

arxiv: v2 [cs.ir] 22 Aug 2016

Lip Reading in Profile

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

State Budget Update February 2016

Active Learning. Yingyu Liang Computer Sciences 760 Fall

arxiv: v4 [cs.cl] 28 Mar 2016

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

An Empirical Analysis of the Effects of Mexican American Studies Participation on Student Achievement within Tucson Unified School District

Evolution of Symbolisation in Chimpanzees and Neural Nets

TRANSFER LEARNING IN MIR: SHARING LEARNED LATENT REPRESENTATIONS FOR MUSIC AUDIO CLASSIFICATION AND SIMILARITY

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

arxiv:submit/ [cs.cv] 2 Aug 2017

On the Combined Behavior of Autonomous Resource Management Agents

Australian Journal of Basic and Applied Sciences

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

P. Belsis, C. Sgouropoulou, K. Sfikas, G. Pantziou, C. Skourlas, J. Varnas

BMBF Project ROBUKOM: Robust Communication Networks

Transcription:

A deep learning strategy for wide-area surveillance 17/05/2016 Mr Alessandro Borgia Supervisor: Prof Neil Robertson Heriot-Watt University EPS/ISSS Visionlab Roke Manor Research partnership

17/05/2016 Implementation details of the CNN for re-identification Outline The proposed re-identification system: - A boostrap process for tracking: unifying tracking and deep learning-based re-identifications - Intra-camera tracking scheme - Inter-camera tracking: time transition s over the network Cross-Input Neighborhood Differences (CIND) CNN: for CNN: - Going deeper by residual learning training scheme - Batch normalization Visualizing deep features References Spacial

17/05/2016 Implementation details of the CNN for re-identification Motivation Context: people tracking in multiple non-overlapping cameras Problem: dealing with targets disappearing for extended periods of time (long occlusions) Challenges arising in different camera views: complex variations of lightings, poses, viewpoints, occlusions. Traditional es: engineering hand-crafted features Actual : employing a deep learning-based (DL) reidentification strategy Why?: a deep architecture allows to model effectively the mixture of complex multimodal photometric and geometric transforms that targets undergo. Novelty: the proposed DL-based re-identification scheme is proposed as a boostrap process for the inter-camera tracking task, defining a unified framework Spacial

The proposed system Iterative adaptive interaction between the re-identification and tracking tasks Effect: boosting each other: more powerful tracking capabilities in presence of disappearing targets and The re-id stage feeds the process of automatic refinement of the logical topology and temporal interdependences of the network (automatically learned from observations) The temporal s, by feeding the CNN classifier (and backtuning the weights accordingly) enable the CNN to take more reliable context-aware re-id decisions. Spacial

Intra-camera tracking scheme Investigated context: a wide area surveillance network with unknown, unconstrained topology and non-calibrated static CCTV cameras Tracking based only on re-identifications by a CNN. Gathering entry and exit points of all the built trajectories Estimation of the entry/exit regions by Gaussian Mixture Model and Expectation Maximization algorithm Entry/exit points represent the network nodes according to which to buid the network logical topology Spacial

Time transition over all links C a C b Spacial

Advantages Achieved context-aware decisions that boost the tracking of people going out-of-view More accurate intra-view tracks provided by the strong discrimination capabilities of a deep architecture in re-id Re-identifications based on posterior probabilities built from both the spatio-temporal priors over the network Automatic and adaptive learning of the logical topology and the time transition relationships of the network Robustness against cameras breakdown Spacial

1 st CNN implemented Spacial

1 st CNN: Cross-Input Neighborhood Differences CNN Spacial Each output a j can be interpreted of the softmax function in terms of predicted probability p j =P(y=j x) for the j th class given a sample vector x:

Data augmentation and data balancing (minibatches) Applying label-preserving operations: random 2D translational transforms on each pedestrian image Uncovered stripes of the bounding-box filled with pixels randomly selected from the original image First, the gradient of the loss over a mini-batch is an estimate of the gradient over the training set, whose quality improves as the batch size increases. Second, computation over a batch can be much more efficient than m computations for individual examples, due to the parallelism afforded by the modern computing platforms. Minibatches size: 256 images Spacial

CIND-CNN limitations Issue: huge peak (~1e20) within the first epoch after some mini-batch iterations Spacial BP+SGD make it very sensible to initialization values and to the initial learning rate value Not very deep Deep learning paradigm violation: the function approximated is constrained at the level of the difference layer This CNN performs feature extraction and classification by a fully connected layer preventing to make sense of how the features are getting distributed in their space

2 nd CNN implemented Spacial

A more flexible The end-to-end neural network can learns an optimal metric for discriminating the target automatically. This scheme allows to have a clear objective function and to treat the feature maps as multidimensional points in a geometrical (Euclidean) space thus allowing to learn useful representations by distance comparisons Spacial Advantage: ease of application of any clustering algorithm to associate these points exploring the feature space

Going deeper by deep residual learning [6] Does a deep CNN learn more the more layers are stuck? Problem: vanishing/exploding gradients This can be addressed by intermediate normalization layers and using Rectified LinearUnits Problem: accuracy degradation not caused by overfitting because the training error increases Deep residual learning framework Layers learn residual functions with reference to their inputs instead of learning unreferenced functions. Residual networks are easier to optimize. They can gain accuracy from increased depth (3.57% error on the ImageNet with 152-layers residual nets) Lower complexity at parity of depth: identity shortcuts are parameter-free and this helps the training Spacial

Siamese vs triplet networks Pairwise similarity function Net(x) Net(x + ) 2 Net(x) Net(x - ) 2 Net x1 Net x2 Net X+ Net x Net x- Spacial Siamese networks are sensitive to calibration in the sense that the notion of similarity vs dissimilarity requires context. For example, a person might be deemed similar to another person when a dataset of random objects is provided, but might be deemed dissimilar with respect to the same other person when we wish to distinguish between two individuals in a set of individuals only. With the triplet model, such a calibration is not required. Triplet networks learns a better representation than siamese networks, improving the classification accuracy in several problems

2 nd CNN: network structure 64 64x72x24 32x144x48 32x144x48 16x288x96 Net Global Pool Layer Residual block Residual block (increase dim) Residual block Residual block (increase dim) Batch normalization Batch normalization Spacial Residual block 16x288x96 Convolutional layer Batch normalization 3x288x96 Normalized input

Training by the triplet network scheme Learns a mapping into an Euclidean space for identity verification where distances directly correspond to a measure of the similarity of two pedestrians. The triplet loss enforces a margin between each pair of images from one person to all other people. The loss to minimize is: The Triplet Loss minimizes the distance between an anchor and a positive, both of which have the same identity, and maximizes the distance between the anchor and a negative of a different identity. Spacial

Batch normalization (BN) Internal Covariate Shift: the change in the of network activations due to the change in network parameters during training. The layers need to continuously adapt to the new Small changes to the network parameters amplify as the network becomes deeper Impact: it slows down the training by requiring lower learning rates and careful parameter initialization Normalize each scalar feature independently and add two scale and translation parameters to make it an identity tranform It allows to use much higher learning rates and be less careful about initialization It acts as a regularizer, often eliminating the need for Dropout It achieves the same accuracy with fewer training steps (even for nondecorrelated features) Spacial

From simulations Spacial

From simulations Augmentation factor 3 - Number of images after augmentation: 42086-11 conv layers ~80000 parameters Dataset split into three partitions: - Training set: 554223 positive (triplet) samples - Test set: 43500 (triplet) samples (100 identities) - Validation set: 43500 (triplet) samples (100 identities) Depending on the number of parameters of the CNN the training time for each epoch is ~1h 30min For each epoch a validation step is also performed for stopping the training when the validation accuracy curve starts decreasing Training loss decreasing Validation and test accuracy still equal to zero under investigation Spacial

Appearance of Features at each layer Feature maps extracted at the 1 st layer by different filters to be trained: Spacial Filter 1 Filter 2 Filter 3

Appearance of Features at each layer Feature of the same input image extracted at different layers of the CNN in correspondence of the first filter: 1 2 3 4 5 6 Spacial

Next steps Set a suitable number of layers/parameters to achieve state-of-the-art performance in training/testing against CUHK-03 dataset Test the performances of the trained CNN gainst SAIVT-BIO video dataset Exploring the feature space and apply clustering in the metric space of the representation Spacial

References [1] E. Ahmed, A. V Williams, C. Park, M. Jones, and T. K. Marks, An Improved Deep Learning Architecture for Person Re-Identification. [2] Schroff, F., Kalenichenko, D., & Philbin, J. (2015). FaceNet: A Unified Embedding for Face Recognition and Clustering. Retrieved from http://arxiv.org/abs/1503.03832 [5] Yi, D., Lei, Z., Liao, S., & Li, S. Z. (2014). Deep Metric Learning for Person Reidentification. 2014 22nd International Conference on Pattern Recognition, (1), 34 39. http://doi.org/10.1109/icpr.2014.16 [6] Technologii, C. H., Poc, S., & Multime, G. a. (2013). Deep Residual Learning for Image Recognition, 7(3), 171 180. [7] Hoffer, E., & Ailon, N. (2014). Deep metric learning using Triplet network, (2010), 1 8. Retrieved from http://arxiv.org/abs/1412.6622 [8] Ioffe, S., & Szegedy, C. (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arxiv. Retrieved from http://arxiv.org/abs/1502.03167 [9] Kingma, D., & Ba, J. (2014). Adam: A Method for Stochastic Optimization. arxiv:1412.6980 [Cs], 1 15. Retrieved from http://arxiv.org/abs/1412.6980 http://www.arxiv.org/pdf/1412.6980.pdf Spacial

Thank you! Questions?