Programming Social Robots for Human Interaction Lecture 4: Machine Learning and Pattern Recognition Zheng-Hua Tan Dept. of Electronic Systems, Aalborg Univ., Denmark zt@es.aau.dk, http://kom.aau.dk/~zt Many of the figures are provided by Chris Bishop. Programming Social Robots 4, Zheng-Hua Tan 1 Course outline 1. Introduction to Robot Operating System (ROS) 2. Introduction to isociobot and NAO robot, and demos 3. Social Robots and Applications 4. Machine Learning and Pattern Recognition 5. Speech Processing I: Acquisition of Speech, Feature Extraction and Speaker Localization 6. Speech Processing II: Speaker Identification and Speech Recognition 7. Image Processing I: Image Acquisition, Pre-processing and Feature Extraction 8. Image Processing II: Face Detection and Face Recognition 9. User Modelling 10. Multimodal Human-Robot Interaction Programming Social Robots 4, Zheng-Hua Tan 2
Classification examples Handwritten Digit Recognition It s not easy to recognize speech. It s not easy to wreck a nice beach. Programming Social Robots 4, Zheng-Hua Tan 3 Regression example Polynomial Curve Fitting from Bishop Programming Social Robots 4, Zheng-Hua Tan 4
Density estimation examples from Bishop Programming Social Robots 4, Zheng-Hua Tan 5 General information References Pattern Recognition and Machine Learning by Bishop Introduction to Machine Learning by Alpaydin For more resources, refer to http://kom.aau.dk/~zt/cources/machine_learning/ Machine_learning_resources.htm Programming Social Robots 4, Zheng-Hua Tan 6
Lecture outline Introduction Machine learning Concepts, supervised learning, unsupervised learning Memory-based learning Model-based learning Programming Social Robots 4, Zheng-Hua Tan 7 What is machine learning? Machine: computing device. Learning is acquiring and improving performance through experience. is the acquisition and development of memories and behaviors, including skills, knowledge, understanding, values, and wisdom. Machine learning is programming computers to optimize a performance criterion using example data or past experience. is concerned with the design and development of algorithms and techniques that allow computers to learn from examples or experiences. Programming Social Robots 4, Zheng-Hua Tan 8
Machine learning s WHs When: We want computers to learn when it is too difficult or too expensive to program them directly to perform a task (e.g., spam filtering). human expertise does not exist (e.g., navigating on Mars), humans are unable to explain their expertise (e.g., speech recognition) Solution changes in time (routing on a computer network) What: Get the computer to learn density, discriminant or regression functions by showing examples of inputs (and outputs). How: We write a parameterized program, and let the learning algorithm find the set of parameters that best approximates the desired function or behavior. Programming Social Robots 4, Zheng-Hua Tan 9 Why study machine learning? Build intelligent computer systems that acquire or improve knowledge from examples adapt to users, customize and be context-aware discover patterns in large databases (data mining) Timing is good ubiquitous computing: computers are cheap, powerful and everywhere progresses in algorithm and theory development abundant data Study is needed develop new algorithms, and understand which algorithms should be applied in which circumstances, primarily aiming at good generalization performance on unseen test data Programming Social Robots 4, Zheng-Hua Tan 10
Related subjects and applications Statistics: statistical estimation targets the same problem as machine learning and most learning algorithms are statistical in nature. Pattern Recognition is when the output of the learning machine is a set of discrete categories. Data Mining is when machine learning is applied to a large database. Applications: speech recognition, handwriting recognition, bio-informatics, adaptive control, natural language processing, web search and text classification, fraud detection, time-series prediction, etc. Programming Social Robots 4, Zheng-Hua Tan 11 Types of machine learning Supervised learning: given inputs along with corresponding outputs, find the correct outputs for test inputs Classification: 1-of-N discrete output (pattern recognition) Regression: real-valued output (prediction) Unsupervised learning: given only inputs without outputs as training, find structure in the space density estimation clustering dimensionality reduction Reinforcement learning: given inputs from the environment, take actions that affect the environment and produce action sequences that maximize the expected scalar reward or punishment. This is similar to animal learning. Programming Social Robots 4, Zheng-Hua Tan 12
Supervised learning {input, output} Classification: assign each input to one of a finite number of discrete categories. Learn a decision boundary that separates one class from the other. Two separate stages: Inference stage: use a training data to learn a model for p(c k x), being either probabilistic generative or discriminative model. Decision stage: use these posterior probabilities to make optimal class assignments. Alternatively, we can solve both problems together and simply learn a discriminant function that maps inputs x directly into decisions. Regression: the desired output consists of one or more continuous variables. Learn a continuous input-output mapping from a limited number of examples. Regression is also known as curve fitting or function approximation. Programming Social Robots 4, Zheng-Hua Tan 13 Supervised learning {input, output} How to represent the inputs and outputs How to select a both powerful and searchable hypothesis space to represent the relationship between inputs and outputs Programming Social Robots 4, Zheng-Hua Tan 14
Unsupervised learning {input} Discover the unknown structure of the inputs. density estimation: determine the probability density distribution of data within the input space, e.g. k-nn, histogram, kernel. clustering: discover groups of similar examples (clumps) within the data, e.g., k-means, EM. Dimensionality reduction: project the data from a highdimensional space down to low dimensions. Compression/quantization: discover a function that for each input computes a compact code from which the input can be reconstructed (clustering). Association E.g., in retail, from customer transactions to consumer behavior: people who bought A also bought B. Programming Social Robots 4, Zheng-Hua Tan 15 Introduction Machine learning Memory-based learning Model-based learning Lecture outline Programming Social Robots 4, Zheng-Hua Tan 16
Learning is more than memorization Constructing a lookup table is easy Simply store all the inputs and their corresponding outputs in the training data. For a new input, compare it to all the samples and produce the output associated with the matching prototype. Problem In general, new inputs are different from training prototypes. The key of learning is generalization: the ability to produce correct outputs or behavior on previously unseen inputs. Programming Social Robots 4, Zheng-Hua Tan 17 Memory based learning: a simple trick Compute the distances between the input and all the stored prototypes, instead of identity requirement. 1-nearest neighbor search: choose the class of the nearest prototype. K-nearest neighbor search: choose the class that has the majority among the K nearest prototypes. so called lazy learning, memory based learning, or instancebased learning ; similar to case based reasoning. Challenges What is the right similarity measure? High computational intensity for large number of prototypes The curse of dimensionality and data sparsity. Programming Social Robots 4, Zheng-Hua Tan 18
Lecture outline Introduction Machine learning Memory-based learning Model-based learning Over-fitting, bias-variance trade-off Programming Social Robots 4, Zheng-Hua Tan 19 Model based learning Build a model that is a good, useful approximation to the data, or construct a general, explicit description of the target function linear vs nonlinear parametric vs nonparametric Discard learning examples when they are processed efficient computationally efficient in memory use. It is limited by the used learning bias - a coarse approximation of the target function. Programming Social Robots 4, Zheng-Hua Tan 20
Linear classifier - two classes g T ( x) = w x + w0 ( x) C1 if g > 0 choose C2 otherwise from Alpaydin Programming Social Robots 4, Zheng-Hua Tan 21 Regression - polynomial curve fitting, again! from Bishop Programming Social Robots 4, Zheng-Hua Tan 22
Sum-of-squares error function from Bishop Programming Social Robots 4, Zheng-Hua Tan 23 Polynomials model selection excessively tuned to the random noise! Programming Social Robots 4, Zheng-Hua Tan 24
Over-fitting from Bishop The need for a separate validation (or hold-out) set for model selection. Root-Mean-Square (RMS) Error: Programming Social Robots 4, Zheng-Hua Tan 25 Polynomial coefficients various order Programming Social Robots 4, Zheng-Hua Tan 26
Regularization Penalize large coefficient values from Bishop Programming Social Robots 4, Zheng-Hua Tan 27 Regularization: vs. from Bishop Programming Social Robots 4, Zheng-Hua Tan 28
Polynomial coefficients various λ 9 th Order Polynomial Programming Social Robots 4, Zheng-Hua Tan 29 Data set size 9 th Order Polynomial N = 10 For a given model complexity, the over-fitting problem become less severe as the size of the data set increases. Programming Social Robots 4, Zheng-Hua Tan 30
Number of data sets 100 data sets; training multiple polynomials and then averaging them, the contribution from the variance term tended to cancel, leading to improved predictions. The dependence of bias and variance on model complexity variance bias Programming Social Robots 4, Zheng-Hua Tan 31 Bias-variance trade-off There is a trade-off between bias and variance, with very flexible models having low bias and high variance, and relatively rigid models having high bias and low variance. Mean square error of the estimator d for unknown parameter θ : r (d,θ) = E [(d θ) 2 ] = (E [d] θ) 2 + E [(d E [d]) 2 ] = Bias 2 + Variance Programming Social Robots 4, Zheng-Hua Tan 32
Beating the bias-variance trade-off We can reduce the variance by averaging lots of models trained on different datasets. This seems silly. If we had lots of different datasets it would be better to combine them into one big training set. (With more training data there will be much less variance.) Weird idea: We can create different datasets by bootstrap sampling of our single training dataset. This is called bagging and it works surprisingly well. But if we have enough computation its better to do the right Bayesian thing: Combine the predictions of many models using the posterior probability of each parameter vector as the combination weight. Programming Social Robots 4, Zheng-Hua Tan 33 Over-fitting a still unsolved problem! from Bishop The least squares approach to finding the model parameters resorts to intuition and represents a specific case of maximum likelihood, and that the over-fitting problem can be understood as a general property of ML. More principled approach - probability theory, foundation for machine learning. Programming Social Robots 4, Zheng-Hua Tan 34
Lecture outline Introduction Machine learning Concepts, supervised learning, unsupervised learning Memory-based learning Model-based learning Over-fitting, bias-variance trade-off Programming Social Robots 4, Zheng-Hua Tan 35 Course outline 1. Introduction to Robot Operating System (ROS) 2. Introduction to isociobot and NAO robot, and demos 3. Social Robots and Applications 4. Machine Learning and Pattern Recognition 5. Speech Processing I: Acquisition of Speech, Feature Extraction and Speaker Localization 6. Speech Processing II: Speaker Identification and Speech Recognition 7. Image Processing I: Image Acquisition, Pre-processing and Feature Extraction 8. Image Processing II: Face Detection and Face Recognition 9. User Modelling 10. Multimodal Human-Robot Interaction Programming Social Robots 4, Zheng-Hua Tan 36