EE-002 Computational Learning & Pattern Recognition Where or how to find me? Turgay IBRIKCI Çukurova University Electrical-Electronics Engineering Department Associate Prof. Dr. Turgay IBRIKCI Room # 305 Thursdays 9:30-12:00 (322) 338 6868 / 139 turgayibrikci@hotmail.com 2 Course Outline Course Grading The course is divided in two parts: theory and practice. 1. Theory covers basic topics in pattern recognition theory and applications with computational learning. 2. Practice deals with basics of MATLAB and implementation of pattern recognition algorithms. We assume that you know MATLAB or you will learn yourself 3 Grading the Class: Project 40% Report Presentation (Week 14 ; 20 mins) Final Exam 20% (Week 15-We decide together) Homeworks 40% (At least 4 homeworks) Full attending the class 10% (Required Bonus) 4 In This Course What is pattern recognition? How should objects to be classified be represented? What algorithms can be used for recognition (or matching)? How should learning (training) be done? Much of the topics concern statistical classification methods. They include generative methods such as those based on Bayes decision theory and related techniques of parameter estimation and density estimation. Apply the algorithms with MATLAB The assignment of a physical object or event to one of several prespecified categeries -- Duda & Hart A pattern is an object, process or event that can be given a name. A pattern class (or category) is a set of patterns sharing common attributes and usually originating from the same source. During recognition (or classification) given objects are assigned to prescribed classes. A classifier is a machine which performs classification. 1
Examples of applications What are Patterns? Optical Character Recognition (OCR) Biometrics Diagnostic systems Military applications Handwritten: sorting letters by postal code, input device for PDA s. Printed texts: reading machines for blind people, digitalization of text documents. Face recognition, verification, retrieval. Finger prints recognition. Speech recognition. Medical diagnosis: X-Ray, EKG analysis. Machine diagnostics, waster detection. Automated Target Recognition (ATR). Image segmentation and analysis (recognition from aerial or satelite photographs). Laws of Physics & Chemistry generate patterns. Patterns in Astronomy Humans tend to see patterns everywhere. Patterns in Biology Applications: Biometrics, Computational Anatomy, Brain Mapping. Patterns of Brain Activity Relations between brain activity, emotion, cognition, and behaviour. Variations of Patterns Patterns vary with expression, lighting, occlusions. 2
Speech Patterns Acoustic signals. examples examples examples examples Goal of Pattern Recognition Recognize Patterns. Make decisions about patterns. Visual Example is this person happy or sad? Speech Example did the speaker say Yes or No? Physics Example is this an atom or a molecule? 3
Approaches Basic concepts Statistical PR: based on underlying statistical model of patterns and pattern classes. Structural (or syntactic) PR: pattern classes represented by means of formal structures as grammars, automata, strings, etc. Neural networks: classifier is represented as a network of cells modeling neurons of the human brain (connectionist approach). Pattern y x1 Feature vector xx x2 x - x is a point in feature space X. x n - A vector of observations (measurements) Hidden state yy - Cannot be directly measured. - Patterns with equal hidden state belong to the same class. Task - To design a classifer (decision rule) q:x Y which decides about a hidden state based on an observation. Example height weight Linear classifier: x1 x x2 Task: jockey-hoopster recognition. The set of hidden state is The feature space is Y { H, J} 2 X Training examples {( x1, y1),,( x l, yl )} x 2 y H H if ( wx) b 0 q( x) J if ( wx) b 0 y J ( wx) b0 Example: Salmon versus Sea Bass Generative methods attempt to model the full appearance of Salmon and Sea Bass. Discriminative methods extract features sufficient to make the decision (e.g. length and brightness). x 1 Fish Features. Length. Salmon are usually shorter than Sea Bass. Fish Features. Lightness. Sea Bass are usually brighter than Salmon. 4
Components of PR system Feature extraction Pattern Sensors and preprocessing Feature extraction Classifier Class assignment Task: to extract features which are good for classification. Good features: Objects from the same class have similar feature values. Objects from different classes have different values. Teacher Learning algorithm Sensors and preprocessing. A feature extraction aims to create discriminative features good for classification A classifier. A teacher provides information about hidden state -- supervised learning. A learning algorithm sets PR from training examples. Good features Bad features Feature extraction methods Classifier m1 m2 m k Feature extraction φ 1 φ 2 φ n x1 x2 x n Feature selection m1 x1 m2 m x2 3 m k x n Problem can be expressed as optimization of parameters of featrure extractor φ(θ). Supervised methods: objective function is a criterion of separability (discriminability) of labeled examples, e.g., linear discriminat analysis (LDA). Unsupervised methods: lower dimesional representation which preserves important characteristics of input data is sought for, e.g., principal component analysis (PCA). A classifier partitions feature space X into class-labeled regions such that X X1 X2 X Y and X1 X2 X Y {0} X 1 X 2 X 3 The classification consists of determining to which region a feature vector x belongs to. Borders between decision boundaries are called decision regions. X 3 X 1 X 2 X 1 Representation of classifier An Introduction A classifier is typically represented as a set of discriminant functions fi( x): X, i 1,, Y The classifier assigns a feature vector x to the i-the class if f ( x) f j( x) x Feature vector f 1 ( x) f 2 ( x) f Y x ( ) Discriminant function max i j i y Class identifier Bayesian Decision Theory came long before Version Spaces, Decision Tree Learning and Neural Networks. It was studied in the field of Statistical Theory and more specifically, in the field of Pattern Recognition. Bayesian Decision Theory is at the basis of important learning schemes such as the Naïve Bayes Classifier, Learning Bayesian Belief Networks and the EM Algorithm. 30 5
Bayesian decision making Bayes Theorem The Bayesian decision making is a fundamental statistical approach which allows to design the optimal classifier if complete statistical model is known. Definition: Obsevations Hidden states Decisions X Y D A loss function A decision rule A joint probability Task: to design decision rule q which minimizes Bayesian risk R(q) yy x X p( x, y ) W(q( x), y) W : Y D q:x D p( x, y) R Goal: To determine the most probable hypothesis, given the data D plus any initial knowledge about the prior probabilities of the various hypotheses in H. Prior probability of h, P(h): it reflects any background knowledge we have about the chance that h is a correct hypothesis (before having observed the data). Prior probability of D, P(D): it reflects the probability that training data D will be observed given no knowledge about which hypothesis h holds. Conditional Probability of observation D, P(D h): it denotes the probability of observing data D given some world in which hypothesis h holds. 32 Bayes Theorem (Cont d) Bayesian Belief Networks Posterior probability of h, P(h D): it represents the probability that h holds given the observed training data D. It reflects our confidence that h holds after we have seen the training data D and it is the quantity that Machine Learning researchers are interested in. Bayes Theorem allows us to compute P(h D): The Bayes Optimal Classifier is often too costly to apply. The Naïve Bayes Classifier uses the conditional independence assumption to defray these costs. However, in many cases, such an assumption is overly restrictive. Bayesian belief networks provide an intermediate approach which allows stating conditional independence assumptions that apply to subsets of the variable. P(h D)=P(D h)p(h)/p(d) 33 34 Representation in Bayesian Belief Networks Storm Lightning Thunder BusTourGroup Campfire ForestFire Associated with each node is a conditional probability table, which specifies the conditional distribution for the variable given its immediate parents in the graph Each node is asserted to be conditionally independent of its non-descendants, given its immediate parents 35 Inference in Bayesian Belief Networks A Bayesian Network can be used to compute the probability distribution for any subset of network variables given the values or distributions for any subset of the remaining variables. Unfortunately, exact inference of probabilities in general for an arbitrary Bayesian Network is known to be NP-hard. In theory, approximate techniques (such as Monte Carlo Methods) can also be NP-hard, though in practice, many such methods were shown to be useful. 36 6
Example of Bayesian task Limitations of Bayesian approach Task: minimization of classification error. A set of decisions D is the same as set of hidden states Y. 0 if q( x) y 0/1 - loss function used W(q( x), y) 1 if q( x) y The Bayesian risk R(q) corresponds to probability of misclassification. The solution of Bayesian task is * * p( x y)p( y) q argminr(q) y argmax p( y x) argmax q y y p( x) The statistical model p(x,y) is mostly not known therefore learning must be employed to estimate p(x,y) from training examples {(x 1,y 1 ),,(x,y )} -- plug-in Bayes. Non-Bayesian methods offers further task formulations: A partial statistical model is avaliable only: p(y) is not known or does not exist. p(x y,) is influenced by a non-random intervetion. The loss function is not defined. Examples: Neyman-Pearson s task, Minimax task, etc. Discriminative approaches Learning Theory Given a class of classification rules q(x;θ) parametrized by θ the task is to find the best parameter θ * based on a set of training examples {(x 1,y 1 ),,(x,y )} -- supervised learning. The task of learning: recognition which classification rule is to be used. The way how to perform the learning is determined by a selected inductive principle. Both Generative and Discriminative methods require training data to learn the models/features/decision rules. Machine Learning concentrates on learning discrimination rules. Key Issue: do we have enough training data to learn? Empirical risk minimization principle Overfitting and underfitting The true expected risk R(q) is approximated by empirical risk 1 Remp (q( x; θ)) W(q( xi; θ), y i ) i 1 with respect to a given labeled training set {(x 1,y 1 ),,(x,y )}. The learning based on the empirical minimization principle is defined as * θ argmin Remp (q( x; θ)) θ Examples of algorithms: Perceptron, Back-propagation, etc. Problem: how rich class of classifications q(x;θ) to use. underfitting good fit overfitting Problem of generalization: a small emprical risk R emp does not imply small true expected risk R. 7
Structural risk minimization principle Machine Learning is Statistical learning theory -- Vapnik & Chervonenkis An upper bound on the expected risk of a classification rule qq R(q) R emp (q) R str 1 1 (, h,log ) where is number of training examples, h is VC-dimension of class of functions Q and 1- is confidence of the upper bound. Machine learning, a branch of artificial intelligence, concerns the construction and study of systems that can learn from data. SRM principle: from a given nested function classes Q 1,Q 2,,Q m, such that h1 h2 h m select a rule q * which minimizes the upper bound on the expected risk. Machine Learning is Machine learning is programming computers to optimize a performance criterion using example data or past experience. -- Ethem Alpaydin The goal of machine learning is to develop methods that can automatically detect patterns in data, and then to use the uncovered patterns to predict future data or other outcomes of interest. -- Kevin P. Murphy Machine Learning is Machine learning is about predicting the future based on the past. -- Hal Daume III The field of pattern recognition is concerned with the automatic discovery of regularities in data through the use of computer algorithms and with the use of these regularities to take actions. -- Christopher M. Bishop Machine Learning is Supervised learning examples Machine learning is about predicting the future based on the past. -- Hal Daume III past future label label 1 label 3 labeled examples Training model/ predictor Testing model/ predictor label 4 label 5 Supervised learning: given labeled examples 8
Supervised learning Supervised learning label label 1 label 3 model/ predictor model/ predictor predicted label label 4 label 5 Supervised learning: given labeled examples Supervised learning: learn to predict new example Supervised learning: classification Classification Example label apple apple banana banana Classification: a finite set of labels Differentiate between low-risk and high-risk customers from their income and savings Supervised learning: given labeled examples Supervised learning: regression label Regression Example Price of a used car -4.5 10.1 3.2 Regression: label is real-valued x : car attributes (e.g. mileage) y : price y = wx+w 0 4.3 Supervised learning: given labeled examples 54 9
Regression Applications Supervised learning: ranking Economics/Finance: predict the value of a stock label Epidemiology Car/plane navigation: angle of the steering wheel, acceleration, Temporal trends: weather over time 1 4 2 3 Ranking: label is a ranking Supervised learning: given labeled examples Ranking example Unsupervised learning Given a query and a set of web pages, rank them according to relevance Unupervised learning: given data, i.e. examples, but no labels Unsupervised learning applications learn clusters/groups without any label customer segmentation (i.e. grouping) image compression bioinformatics: learn motifs Unsupervised learning Input: training examples {x 1,,x } without information about the hidden state. Clustering: goal is to find clusters of data sharing similar properties. A broad class of unsupervised learning algorithms: { 1 x,, x } { y 1,, y } Classifier Classifier q: XΘ Y θ Learning algorithm Learning algorithm L: (X Y) Θ (supervised) 10
Example of unsupervised learning algorithm k-means clustering: { 1 x,, x } Classifier y q( x) argmin xm i1,, k Learning algorithm 1 i x j, Ii j m I { :q( xj) i} i ji i θ m,, m } { 1 k { 1 y,, y } i Goal is to minimize i1 m 1 x m i 2 q( x ) i m 2 m 3 References Books Theodoridis, Koutroumbas Pattern Recognition( 4 th Edition, 2004) Duda, Heart: Pattern Classification and Scene Analysis. J. Wiley & Sons, New York, 1982. (2nd edition 2000). Fukunaga: Introduction to Statistical Pattern Recognition. Academic Press, 1990. Bishop: Neural Networks for Pattern Recognition. Claredon Press, Oxford, 1997. Schlesinger, Hlaváč: Ten lectures on statistical and structural pattern recognition. Kluwer Academic Publisher, 2002. Slices : Vojtěch Franc Journals Journal of Pattern Recognition Society. IEEE transactions on Neural Networks. Pattern Recognition and Machine Learning. 11