Unsupervised Learning

Similar documents
Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Evolution of Symbolisation in Chimpanzees and Neural Nets

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Learning Methods for Fuzzy Systems

Python Machine Learning

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Accelerated Learning Course Outline

Lecture 1: Machine Learning Basics

Speech Recognition at ICSI: Broadcast News and beyond

Accelerated Learning Online. Course Outline

Generative models and adversarial training

Proposal of Pattern Recognition as a necessary and sufficient principle to Cognitive Science

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

COMPUTER-AIDED DESIGN TOOLS THAT ADAPT

Neural Representation and Neural Computation. Philosophical Perspectives, Vol. 4, Action Theory and Philosophy of Mind (1990),

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Probabilistic Latent Semantic Analysis

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Probabilistic principles in unsupervised learning of visual structure: human data and a model

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Henry Tirri* Petri Myllymgki

Modeling function word errors in DNN-HMM based LVCSR systems

Knowledge Transfer in Deep Convolutional Neural Nets

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Word learning as Bayesian inference

Modeling function word errors in DNN-HMM based LVCSR systems

Artificial Neural Networks

1 NETWORKS VERSUS SYMBOL SYSTEMS: TWO APPROACHES TO MODELING COGNITION

Artificial Neural Networks written examination

Self-Supervised Acquisition of Vowels in American English

arxiv: v2 [cs.cv] 30 Mar 2017

A cautionary note is research still caught up in an implementer approach to the teacher?

Neural pattern formation via a competitive Hebbian mechanism

Self-Supervised Acquisition of Vowels in American English

An Empirical and Computational Test of Linguistic Relativity

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Data Integration through Clustering and Finding Statistical Relations - Validation of Approach

INPE São José dos Campos

Axiom 2013 Team Description Paper

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

A study of speaker adaptation for DNN-based speech synthesis

(Sub)Gradient Descent

A Reinforcement Learning Variant for Control Scheduling

UC Merced Proceedings of the Annual Meeting of the Cognitive Science Society

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Softprop: Softmax Neural Network Backpropagation Learning

Rule Learning With Negation: Issues Regarding Effectiveness

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Lecture 1: Basic Concepts of Machine Learning

THEORETICAL CONSIDERATIONS

Why Did My Detector Do That?!

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

Abstractions and the Brain

Pp. 176{182 in Proceedings of The Second International Conference on Knowledge Discovery and Data Mining. Predictive Data Mining with Finite Mixtures

A Note on Structuring Employability Skills for Accounting Students

Reinforcement Learning by Comparing Immediate Reward

How People Learn Physics

Human Emotion Recognition From Speech

*** * * * COUNCIL * * CONSEIL OFEUROPE * * * DE L'EUROPE. Proceedings of the 9th Symposium on Legal Data Processing in Europe

A Case-Based Approach To Imitation Learning in Robotic Agents

Word Segmentation of Off-line Handwritten Documents

The Strong Minimalist Thesis and Bounded Optimality

Multi-Dimensional, Multi-Level, and Multi-Timepoint Item Response Modeling.

Guide to Teaching Computer Science

Forget catastrophic forgetting: AI that learns after deployment

Speaker recognition using universal background model on YOHO database

An Online Handwriting Recognition System For Turkish

Laboratorio di Intelligenza Artificiale e Robotica

Seminar - Organic Computing

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

Learning Methods in Multilingual Speech Recognition

CSL465/603 - Machine Learning

Communication and Cybernetics 17

***** Article in press in Neural Networks ***** BOTTOM-UP LEARNING OF EXPLICIT KNOWLEDGE USING A BAYESIAN ALGORITHM AND A NEW HEBBIAN LEARNING RULE

arxiv:cmp-lg/ v1 22 Aug 1994

Speech Emotion Recognition Using Support Vector Machine

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

Breaking the Habit of Being Yourself Workshop for Quantum University

Learning to Schedule Straight-Line Code

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

Comparison of network inference packages and methods for multiple networks inference

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Calibration of Confidence Measures in Speech Recognition

Guru: A Computer Tutor that Models Expert Human Tutors

POLA: a student modeling framework for Probabilistic On-Line Assessment of problem solving performance

Lecture 2: Quantifiers and Approximation

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Laboratorio di Intelligenza Artificiale e Robotica

Uncertainty concepts, types, sources

(Still) Unskilled and Unaware of It?

Key concepts for the insider-researcher

Visual CP Representation of Knowledge

The taming of the data:

Comparison of EM and Two-Step Cluster Method for Mixed Data: An Application

Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge

Transcription:

Appeared in Wilson, RA & Keil, F, editors. The MIT Encyclopedia of the Cognitive Sciences. Unsupervised Learning Peter Dayan MIT Unsupervised learning studies how systems can learn to represent particular input patterns in a way that reflects the statistical structure of the overall collection of input patterns. By contrast with SUPERVISED LEARNING or REINFORCEMENT LEARNING, there are no explicit target outputs or environmental evaluations associated with each input; rather the unsupervised learner brings to bear prior biases as to what aspects of the structure of the input should be captured in the output. Unsupervised learning is important since it is likely to be much more common in the brain than supervised learning. For instance there are around 10 6 photoreceptors in each eye whose activities are constantly changing with the visual world and which provide all the information that is available to indicate what objects there are in the world, how they are presented, what the lighting conditions are, etc. Developmental and adult plasticity are critical in animal vision (see VISION AND LEARNING) indeed structural and physiological properties of synapses in the neocortex are known to be substantially influenced by the patterns of activity in sensory neurons that occur. However, essentially none of the information about the contents of scenes is available during learning. This makes unsupervised methods essential, and, equally, allows them to be used as computational models for synaptic adaptation. The only things that unsupervised learning methods have to work with are the observed input patterns x i, which are often assumed to be independent samples from an underlying unknown probability distribution P I [x], and some explicit or implicit a priori information as to what is important. One key notion is that input, such as the image of 1

a scene, has distal independent causes, such as objects at given locations illuminated by particular lighting. Since it is on those independent causes that we normally must act, the best representation for an input is in their terms. Two classes of method have been suggested for unsupervised learning. Density estimation techniques explicitly build statistical models (such as BAYESIAN NETWORKS) of how underlying causes could create the input. Feature extraction techniques try to extract statistical regularities (or sometimes irregularities) directly from the inputs. Unsupervised learning in general has a long and distinguished history. Some early influences were Horace Barlow (see Barlow, 1992), who sought ways of characterising neural codes, Donald MacKay (1956), who adopted a cybernetic-theoretic approach, and David Marr (1970), who made an early unsupervised learning postulate about the goal of learning in his model of the neocortex. The Hebb rule (Hebb, 1949), which links statistical methods to neurophysiological experiments on plasticity, has also cast a long shadow. Geoffrey Hinton and Terrence Sejnowski in inventing a model of learning called the Boltzmann machine (1986), imported many of the concepts from statistics that now dominate the density estimation methods (Grenander, 1976-1981). Feature extraction methods have generally been less extensively explored. Clustering provides a convenient example. Consider the case in which the inputs are the photoreceptor activities created by various images of an apple or an orange. In the space of all possible activities, these particular inputs form two clusters, with many fewer degrees of variation than 10 6, ie lower dimension. One natural task for unsupervised learning is to find and characterise these separate, low dimensional clusters. The larger class of unsupervised learning methods consists of maximum likelihood (ML) density estimation methods. All of these are based on building parameterised models 2

P[x; G] (with parameters G) of the probability distribution P I [x], where the forms of the models (and possibly prior distributions over the parameters G) are constrained by a priori information in the form of the representational goals. These are called synthetic or generative models, since, given a particular value of G, they specify how to synthesise or generate samples x from P[x; G], whose statistics should match P I [x]. A typical model has the structure: P[x; G] =XP[y; G]P[xjy; G] y where y represents all the potential causes of the input x. The typical measure of the degree of mismatch is called the Kullback-Leibler divergence: KL [P I [x]; P[x; G]] =X with equality if and only if P I [x] =P[x; G]. x P I [x]log» PI [x] P[x; G] 0 Given an input pattern x, the most general output of this model is the posterior, analytical, or recognition distribution P[yjx; G], which recognises which particular causes might underlie x. This analytical distribution is the statistical inverse of the synthetic distribution. A very simple model can be used in the example of clustering (Nowlan, 1990). Consider the case in which there are two values for y (1 and 2), with P[y = 1] = ß; P[y = 2] = 1 ß, where ß is called a mixing proportion, and two different Gaussian distributions for the activities x of the photoreceptors depending on which y is chosen: P[xjy =1] ο N [μ 1 ; ± 1 ] and P[xjy =2] οn[μ 2 ; ± 2 ], where μ Λ are means and ± Λ are covariance matrices. Unsupervised learning of the means determines the clusters. Unsupervised learning of the mixing proportions and the covariances characterises the size and (rather coarsely) the shape of the clusters. The posterior distribution P[y =1jx; ß;μ Λ ; ± Λ ] reports how likely it is that a new image x was generated from the first cluster, ie that y = 1 is the true 3

hidden cause. Clustering can occur (with or) without any supervision information about the different classes. This model is called a mixture of Gaussians. Maximum likelihood density estimation, and approximations to it, cover a very wide spectrum of the principles that have been suggested for unsupervised learning. This includes versions of the notion that the outputs should convey most of the information in the input; that they should be able to reconstruct the inputs well, perhaps subject to constraints such as being independent or sparse; and that they should report on the underlying causes of the input. Many different mechanisms apart from clustering have been suggested for each of these, including forms of Hebbian learning, the Boltzmann and Helmholtz machines, sparse-coding, various other mixture models, and independent components analysis. Density estimation is just a heuristic for learning good representations. It can be too stringent making it necessary to build a model of all the irrelevant richness in sensory input. It can also be too lax a look-up table that reported P I [x] for each x might be an excellent way of modeling the distribution, but provides no way to represent particular examples x. The smaller class of unsupervised learning methods seeks to discover how to represent the inputs x by defining some quality that good 6features have, and then searching for those features in the inputs. For instance, consider the case that the output y(x)=w x is a linear projection of the input onto a weight vector w. The central limit theorem implies that most such linear projections will have Gaussian statistics. Therefore if one can find weights w such that the projection has a highly non-gaussian (for instance, multi-modal) distribution, then the output is likely to reflect some interesting aspect of the input. This is the intuition behind a statistical method called projection pursuit. It has been shown 4

that projection pursuit can be implemented using a modified form of Hebbian learning (Intrator & Cooper, 1992). Arranging that different outputs should represent different aspects of the input turns out to be surprisingly tricky. Projection pursuit can also execute a form of clustering in the example. Consider projecting the photoreceptor activities onto the line joining the centers of the clusters. The distribution of all activities will be bimodal one mode for each cluster and therefore highly non-gaussian. Note that this single projection does not characterise well the nature or shape of the clusters. Another example of a heuristic underlying good features is that causes are often somewhat global. For instance, consider the visual input from an object observed in depth. Different parts of the object may share few features, except that they are at the same depth, ie one aspect of the disparity in the information from the two eyes at the separate locations is similar. This is the global underlying feature. By maximising the mutual information between outputs y a and y 0 a that are calculated on the basis of the separate input, one can find this disparity. This technique was invented by Becker & Hinton (1992) and is called IMAX. References Barlow, HB (1989). Unsupervised learning. Neural Computation, 1, 295-311. Becker, S & Hinton, GE (1992). A self-organizing neural network that discovers surfaces in random-dot stereograms. Nature, 355, 161-163. Grenander, U (1976-1981). Lectures in Pattern Theory I, II and III: Pattern Analysis, Pattern Synthesis and Regular Structures. Berlin: Springer-Verlag., Berlin, 1976-1981). 5

Hebb, DO (1949) The Organization of Behavior. New York: Wiley. Hinton, GE & Sejnowski, TJ (1986). Learning and relearning in Boltzmann machines. In DE Rumelhart, JL McClelland and the PDP research group, editors, Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Volume 1: Foundations, Cambridge, MA: MIT Press, 282-317. Intrator, N & Cooper, LN (1992). Objective function formulation of the BCM theory of visual cortical plasticity: statistical connections, stability conditions. Neural Networks, 5, 3-17. MacKay, DM (1956). The epistemological problem for automata. In CE Shannon & J McCarthy, editors, Automata Studies, Princeton, NJ: Princeton University Press, 235-251. Marr, D (1970). A theory for cerebral neocortex. Proceedings of the Royal Society of London, Series B, 176, 161-234. Nowlan, SJ (1990). Maximum likelihood competitive learning. In DS Touretzky, editor, Advances in Neural Information Processing Systems, 2. San Mateo, CA: Morgan Kaufmann. Further Readings Becker, S & Plumbley, M (1996). Unsupervised neural network learning procedures for feature extraction and classification. International Journal of Applied Intelligence, 6, 185-203. Dayan, P, Hinton, GE, Neal, RM & Zemel, RS (1995). The Helmholtz machine. Neural Computation, 7, 889-904. Hinton, GE (1989a). Connectionist learning procedures. Artificial Intelligence, 40, 185-234. Linsker, R (1988). Self-organization in a perceptual network. Computer, 21, 105-128. 6

Mumford, D (1994). Neuronal architectures for pattern-theoretic problems. In C Koch and J Davis, editors, Large-Scale Theories of the Cortex. Cambridge, MA: MIT Press, 125-152. Olshausen, BA & Field, DJ (1996). Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381, 607-609. 7