UNSUPERVISED LEARNING OF INVARIANT REPRESENTATIONS OF FACES THROUGH TEMPORAL ASSOCIATION

Similar documents
SARDNET: A Self-Organizing Feature Map for Sequences

Knowledge Transfer in Deep Convolutional Neural Nets

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Accelerated Learning Online. Course Outline

Accelerated Learning Course Outline

Evolution of Symbolisation in Chimpanzees and Neural Nets

Human Emotion Recognition From Speech

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

INPE São José dos Campos

Proposal of Pattern Recognition as a necessary and sufficient principle to Cognitive Science

Python Machine Learning

Generative models and adversarial training

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Artificial Neural Networks written examination

Lecture 1: Machine Learning Basics

Learning Methods for Fuzzy Systems

Assignment 1: Predicting Amazon Review Ratings

Modeling function word errors in DNN-HMM based LVCSR systems

A study of speaker adaptation for DNN-based speech synthesis

Probabilistic principles in unsupervised learning of visual structure: human data and a model

Artificial Neural Networks

On the Formation of Phoneme Categories in DNN Acoustic Models

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Breaking the Habit of Being Yourself Workshop for Quantum University

Word Segmentation of Off-line Handwritten Documents

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Using the Attribute Hierarchy Method to Make Diagnostic Inferences about Examinees Cognitive Skills in Algebra on the SAT

Axiom 2013 Team Description Paper

Evolutive Neural Net Fuzzy Filtering: Basic Description

SOFTWARE EVALUATION TOOL

On the Combined Behavior of Autonomous Resource Management Agents

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

How People Learn Physics

Introduction and Motivation

Test Effort Estimation Using Neural Network

Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures

AGENDA LEARNING THEORIES LEARNING THEORIES. Advanced Learning Theories 2/22/2016

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Neural pattern formation via a competitive Hebbian mechanism

Missouri Mathematics Grade-Level Expectations

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Speaker Identification by Comparison of Smart Methods. Abstract

Grade 6: Correlated to AGS Basic Math Skills

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION

Modeling function word errors in DNN-HMM based LVCSR systems

Rule Learning With Negation: Issues Regarding Effectiveness

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Dublin City Schools Mathematics Graded Course of Study GRADE 4

A Reinforcement Learning Variant for Control Scheduling

Arizona s College and Career Ready Standards Mathematics

Speech Emotion Recognition Using Support Vector Machine

Neuroscience I. BIOS/PHIL/PSCH 484 MWF 1:00-1:50 Lecture Center F6. Fall credit hours

Degeneracy results in canalisation of language structure: A computational model of word learning

A GENERIC SPLIT PROCESS MODEL FOR ASSET MANAGEMENT DECISION-MAKING

Corpus Linguistics (L615)

An Empirical and Computational Test of Linguistic Relativity

Beeson, P. M. (1999). Treating acquired writing impairment. Aphasiology, 13,

Semantic Segmentation with Histological Image Data: Cancer Cell vs. Stroma

***** Article in press in Neural Networks ***** BOTTOM-UP LEARNING OF EXPLICIT KNOWLEDGE USING A BAYESIAN ALGORITHM AND A NEW HEBBIAN LEARNING RULE

On-Line Data Analytics

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Robot Shaping: Developing Autonomous Agents through Learning*

Deep Neural Network Language Models

Using EEG to Improve Massive Open Online Courses Feedback Interaction

Full text of O L O W Science As Inquiry conference. Science as Inquiry

TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen

THE ROLE OF TOOL AND TEACHER MEDIATIONS IN THE CONSTRUCTION OF MEANINGS FOR REFLECTION

Knowledge Elicitation Tool Classification. Janet E. Burge. Artificial Intelligence Research Group. Worcester Polytechnic Institute

While you are waiting... socrative.com, room number SIMLANG2016

Pre-AP Geometry Course Syllabus Page 1

A Pipelined Approach for Iterative Software Process Model

Speech Recognition at ICSI: Broadcast News and beyond

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN

Designing a Computer to Play Nim: A Mini-Capstone Project in Digital Design I

MULTIMEDIA Motion Graphics for Multimedia

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Mathematics. Mathematics

Probabilistic Latent Semantic Analysis

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

arxiv: v1 [cs.lg] 15 Jun 2015

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Julia Smith. Effective Classroom Approaches to.

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

WHEN THERE IS A mismatch between the acoustic

Beyond the Pipeline: Discrete Optimization in NLP

Time series prediction

Author: Justyna Kowalczys Stowarzyszenie Angielski w Medycynie (PL) Feb 2015

Activities, Exercises, Assignments Copyright 2009 Cem Kaner 1

Forget catastrophic forgetting: AI that learns after deployment

Interaction Design Considerations for an Aircraft Carrier Deck Agent-based Simulation

Rule Learning with Negation: Issues Regarding Effectiveness

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

Probability estimates in a scenario tree

Designing a case study

Transcription:

In: J. M. Bower, Ed.) Garrmpur ilrianal Naurosciencs: Trends In Research 1 995, San Dirzga, CA. kcadmrnic Prews, 31 7-322 ('1 QY#3).. UNSUPERVISED LEARNING OF INVARIANT REPRESENTATIONS OF FACES THROUGH TEMPORAL ASSOCIATION Marian Stewart ~artlett"' Terrence J. ~ejnowski"' *~e~artments of Cognitive Science and Psychology, UCSD ~oward Hughes Medical Institute 'The Salk Institute, La Jolla, CA, 92037 The appearance of an object or a face changes continuously as the observer moves through the environment or as a face changes expression or pose. Recognizing an object or a face despite these image changes is a challenging problem for computer vision systems, yet we perform the task quickly and easily. This simulation investigates the ability of an unsupervised learning mechanism to acquire representations that are tolerant to such changes in the image. The learning mechanism finds these representations by capturing temporal relationships between 2-D patterns. Previous rnodels of temporal association learning have used idealized input representations. The input to this model consists of graylevel images of faces. A two-layer network learned face representations that incorporated changes of pose up to rt30. A second network learned representations that were independent of facial expression. One of the greatest challenges in visual recognition of objects or faces is that the projected image can vary substantially with changes in viewing conditions. In normal visual experience, however, these different views tend to appear in close temporal proximity. Unsupervised learning can find invariant representations by capitalizing on this dynamic information. Capturing the temporal relationships among patterns is a way to automatically associate different views of an object without requiring complex geometrical transformations or three dimensional structural descriptions [I]. Temporal association may be a fundamental component of visual processing in the temporal lobe. Cells in the anterior inferior temporal lobe will adjust their receptive fields so that they respond to temporally contiguous inputs [2]. A temporal window for Hebbian learning could be provided by the long open-time of

the NMDA channel [3], a hysteresis in neural activity caused by reciprocal connections between cortical regions [4], or the release of a chemical signal following activity such as nitric oxide [5]. This simulation investigates the capability of such Hebbian learning mechanisms to acquire transformation invariant representations of complex objects such as faces. These mechanisms have been previously tested with idealized input representations with little or no crosstalk on the connections [6, 7, 41. In order to understand the capabilities of temporal association learning, it is important to evaluate it using complex, realistic stimuli. We tested the temporal association learning mechanism on a very simple architecture (Figure 1). We used a feed-forward network with two layers of units. There were 400 input units and ten output (representation) units. The input layer was fully connected to the output layer and there was winner-take-all competition in the output layer. We used a linear transfer function, and the total weight coming into each output unit was constrained to sum to one. At each time step t, the network took one 20 x 20 graylevel image as input. Figure 1: Network architecture. Images at resolution used in the simulations. The weight update rule is based on the Competitive Learning Rule [8, 91. Let a be the learning rate, xik be the value of input unit i for pattern k, and tk be the total amount of input activation for pattern k. The weight update rule is

We introduce a temporal manipulation into the competition phase. Let yj be the activation of output j, computed by a weighted sum of the inputs. After Foldiak [6], the winning unit i at time t is determined by the trace of the activation:' winner = maxj [yj] + xyj Yj = (1 - ~ )~~.-l The Competitive Learning Rule alone, without the temporal manipulation, will partition the set of inputs into roughly equal groups by spatial similarity. The resulting weights to each output unit are proportional to the probability that a given input unit is active when that unit wins [8]. The temporal manipulation allows temporal association to influence these partitions. The winning unit in the current time step has a competitive advantage for recruiting the pattern in the next time step. This learning rule therefore partitions the input by a combination of spatial similarity and temporal proximity, where X determines the relative influence of the two factors. 3 We first tested the ability of this learning algorithm to develop representations of faces that were independent of pose. The inputs were graylevel face images provided by David Beymer at the MIT Media Lab [lo]. We used images of ten individuals at each of five different angles of view (OO,f 15 O, and f 30 ), for a total of fifty stimuli (Figure 2). A single window based on the eye and mouth positions in the frontal view was used for cropping and scaling the other images in each sequence. The faces were reduced to 20 x 20 pixels, producing a 400 dimensional input vector, and each image was normalized for luminance. The learning was performed in two stages. In the first stage, the network was exposed only to the ten frontal view faces in order to associate each face with a different output unit. Once this initial correspondence was established, the training set was slowly expanded to include variations in pose. Images were presented in sequence beginning with the frontal view. The trace function was reset between sequences. The network stabilized after 100 training epochs. Network classification was assessed by presenting each image individually, without the activation trace, and recording the output unit with the highest activation. The network response was considered "correct" if the winning output unit is the one corresponding to the frontal view of that subject. Figure 3a compares invariance to pose 'Representation units that failed to win for two iterations were given a competitive advantage by increasing slightly the value of yi. This adjustment is equivalent to enlarging the receptive field of that unit [7].

Figure 2: Sample pose sequences. The example set contained ten subjects. after temporal association learning (dashed line) to baseline performance (solid line), in which the network was trained on the frontal views only. Mean correct classification at; each pose is shown, collapsed over the ten subjects in the data set. The graph is analogous to a mean tuning curve for pose. Temporal association (TA) learning improved the mean classification accuracy of the f 15' views from 65% to 90% and increased invariance to the f 30 ' views from 55% to 90%. Performance for the f15o views initially reached loo%, but fell to 95% when the f 30' views were added, indicating the beginnings of interference between the patterns. To test for interpolation between and extrapolation beyond the set of training views, we retrained the network, this time reserving some of the poses as test images. Figure 3b shows an increase in accuracy for the f 30' views following training on the O0 and f15' views only, and an increase in accuracy for the f 15O views when the network was trained on the 0' and f 30' views only. \-----. -c. After TA Learnina - No TA Learning Figure 3: a. Mean tuning curves for pose at baseline and after temporal association (TA) learning. b. Interpolation and extrapolation.

The learning rule was also tested for learning representations invariant of facial. expression. Changes in facial expression introduce a particular challenge to recognition systems, as they produce a non rigid deformation in the image. The input to the network consisted of ten faces in six sequential stages of an expression, from low to full muscle contraction intensity (Figure 4). If the images included hair, Competitive Learning alone correctly classified all of the images. The task was therefore made more difficult by cropping out the hairline. Training was performed in two phases as above. I Figure 4: Sample facial expression sequences. Images provided by Paul Ekman and Joe Hager at the Human Interaction Laboratory, UCSF. Figure 5a shows the increase in invariance to facial expression due to temporal association learning. These results are following exposure to sequences of length 3. The addition of frame 4 to the training sequence caused the network to destabilize, revealing the limits of the range of invariance that this learning method can achieve on this kind of dataset. This network also showed interpolation and extrapolation between trained expression intensities following training on frames 1 and 3 alone (Figure 5b). SUMMARY AND CONCLUSIONS By associating patterns by temporal proximity, our system developed representations of faces with a degree of invariance to changes in pose or changes in facial expression. This simulation demonstrates that unsupervised learning can solve a challenging problem in object recognition, and provides another example of how problems in image understanding can be simplified by taking advantage of dynamic information. This is an idea that has been espoused by the "active vision" approach to computer vision. The extent of invariance and the number of subjects that this system can tolerate is limited by the redundancy in the input representation. If there is no

+ No TA Learning -*- Training Set: 1,2 -+- Training Set: 1.3 Figure 5: a. Invariance to changes in facial expression before and after temporal association learning on frames 1-3. b. Interpolation and Extrapolation. redundancy in the input, then there is no limit to the amount of invariance that this system can learn. This points to the importance of intermediate representations with reduced input redundancy, such as principal components or sparse distributed representations [ll]. Larger invariances can also be obtained in a hierarchical system that learns new invariances at each level of the hierarchy. Acknowledgments This research was supported by Lawrence Livermore National Laboratory, Intra- University Agreement B29 1436, and Howard Hughes Medical Institute. REFERENCES 1. Stryker, M. 1991. Temporal Associations. Nature: 354(14):108-109. 2. Miyashita, Y. 1988. Neuronal correlate of visual associative long-term memory in the primate temporal cortex. Nature: 335(27):817-820. 3. Rhodes, P. 1992. The long open time of the NMDA channel facilitates the self-organization of invariant object responses in cortex. Society for Neuroscience Abstracts 18:740. 4. O'Reilly, R. & Johnson, M. 1994. Object recognition and sensitive periods: A computational analysis of visual imprinting. Neural Computation 6:357-389. 5. Montague, R., Gally, J., & Edelman, G., 1991. Spatial signaling in the development and function of neural connections. Cerebral Cortex: 1:199-220. 6. Foldiak, P. 1991. Learning invariance from transformation sequences. Neural Computation: 3:194-200. 7. Weinshall, D., Edelman, S., & Bulthoff, H. A self-organizing multiple view representation of 3D objects in Advances in Neural Information Processing Systems No. 2, D. Touretzky, ed. Cambridge MA: MIT Press, 1990: 274-281. 8. Rumelhart, D. & Zipser, D. 1985. Feature discovery by competitive learning. Cognitive Science: 9:75-112. 9. Grossberg, S. 1976. Adaptive pattern classification and universal recoding: Part 1. Parallel development and coding of neural feature detectors. Biological Cybernetics: 23:121-134. 10. Beymer, D. Face recognition under varying pose. In Proceedings of the 1994 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Los Alamitos, CA: IEEE Comput. Soc. Press, 1994: 756-61. 11. Field, D. 1994. What is the goal of sensory coding? Neural Computation: 6(4):559-601.