Welcome to CMPS 142 and 242: Machine Learning Instructor: David Helmbold, dph@soe.ucsc.edu Office hours: Monday 1:30-2:30, Thursday 4:15-5:00 TA: Aaron Michelony, amichelo@soe.ucsc.edu Web page: www.soe.ucsc.edu/classes/cmps242/fall13/01 Text: Pattern Recognition and Machine Learning, by Bishop 1
Administrivia Sign up sheet (enrollment) Evaluation: Group Homework 20% Late midterm exam 40 % Projects (group) 40 % Must pass both exam and project Expectations/Style Reading assignments Attendance/participation My hearing/writing Academic honesty Topics: Introduction Bayesian learning and parameter estimation Instance based methods Linear Regression Linear Classification Decision Trees and Neural networks Graphical Models Support Vector Machines Clustering, EM Algorithm Boosting (AdaBoost) On-line prediction Reinforcement learning 2
Lecture Slides for INTRODUCTION TO Machine Learning ETHEM ALPAYDIN The MIT Press, 2004 (modified by DPH 2006--2011) alpaydin@boun.edu.tr http://www.cmpe.boun.edu.tr/~ethem/i2ml
CHAPTER 1: Introduction
Why Learn? Machine learning is programming computers to optimize a performance criterion using example data or past experience (inference in statistics) There is no need to learn to calculate payroll Learning is used when: Human expertise does not exist (navigating on Mars), Humans are unable to explain their expertise (speech recognition, object detection) Solution changes in time (routing on a computer network) Solution needs to be adapted or customized to particular cases (or users) 5
What We Talk About When We Talk About Learning Learning general models from a set of particular examples Data is cheap and abundant (data warehouses, data marts); knowledge is expensive and scarce. Example in retail: Customer transactions to consumer behavior: People who bought Da Vinci Code also bought The Five People You Meet in Heaven (www.amazon.com) Build a model that is a good and useful approximation to the data. 6
What is Machine Learning? Optimize a performance criterion using example data or past experience. Role of Statistics: Inference from a sample Role of Computer science: Efficient algorithms to Solve the optimization problem Representing and evaluating the model for inference 7
Stat. Machine learning is not: Cognitive science (how people think/learn) Teaching computers to think But is related to: Statistics Data Mining - KDD Control theory part of AI, but not traditional AI 8
Supervised Batch Learning Assume (unknown) distribution over things Things have measurable attributes or features Get instances x by drawing things from distribution and recording observations. Teacher labels instances making examples (x, y) or (x, t) (Bishop) Set of labeled examples is the training set or sample Create hypothesis (rule) from sample hypothesis predicts on new random instances, evaluated using a loss function 9
Supervised Learning Framework learning prediction 10
Supervised Learning (cont.) Classification: labels are nominal (unordered set, e.g. {ham, spam} {democrat, republican, indep.}) Binary Classification Regression: labels are numeric (e.g. price of used car) Sometimes predictions are probabilities 11
Examples Thing Observations Prediction Written Digit Pixel array Which digit? Email message Words, Subject, sender Ham or Spam? Customer Recent purchases interest level in a new product Used car Year, make, mpg, options Price or value 12
Regression Example: Price of a used car x : car attributes t : price assume t = g (x θ ) g ( ) model (e.g. linear) θ parameters (w, w 0 ) t y(x) = wx+w 0 x 13
Batch Assumption: iid Examples Distribution of things and measurements defines some unknown (but fixed) P(x,t) over domain-label pairs Find a hypothesis h that is close to the truth A loss function L(t, t ) measures error of predictions, often L(t, t )=0 if t=t and L(t,t )=1 otherwise (classification) Want to minimize P(x,t) L(t, h(x)) -- probability of error for 0-1 loss Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V1.1) 14
Supervised Learning: Uses Prediction of future cases: Use the rule to predict the output for future inputs Knowledge extraction: The rule is easy to understand Compression: The rule is simpler than the data it explains Outlier detection: Exceptions that are not covered by the rule, e.g., fraud and data entry errors 15
Can we Generalize? Learning is an ill-posed problem: If we assume nothing else, any label t could be likely for an unseen x Need an inductive bias limiting possible P(x,t) Often assume some kind of simplicity (e.g. linearity) based on domain knowledge Bayesian approach: put prior on rules, and balance prior with evidence (data) 16
Noise Data not always perfect Unmeasured Features Attribute noise (random or systemic) Label noise (random or systemic) Noise associated with inductive bias errors Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V1.1) 17
Overfitting and Underfitting Overfitting happens when the hypothesis that is too complex for the truth Underfitting happens when the hypothesis is too simple. 18
Bishop fig 1.4 19
Sup. Learning as parameter estimation Model (hypothesis set or class): h θ (x) Empirical error Error/Loss function: N E θ = L( t n,h θ (x n )) n =1 + regularization(θ) Optimization procedure: ˆ θ argmin(e θ ) θ Regularization penalizes complex θ Model choice + regularization = inductive bias! Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V1.1) 20
Don t rely on training error! To estimate generalization error, we need data unseen during training. Often data split into Training set (50%) Validation set (25%) (did training work? Use for Parameter selection/model complexity ) Final Test (publication) set (25%) Resampling when there are few examples cross validation (describe) Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V1.1) 21
Other kinds of supervised learning Reinforcement learning - learning a policy for influencing or reacting to environment No supervised output, but delayed rewards Credit assignment problem Game playing/robot in a maze, etc. On-line learning: predict on each instance in turn Semi-supervised learning uses both labeled and unlabeled data Active learning request labels for particular instances 22
Unsupervised Learning Learning what normally happens No labels Clustering: Grouping similar instances Example applications Segmentation in customer relationship mgmt Image compression: Color quantization Bioinformatics: Learning motifs Identifying unusual Airplane landings Deep learning learn the features 23
Resources: Datasets UCI Repository: http://www.ics.uci.edu/~mlearn/mlrepository.html UCI KDD Archive: http://kdd.ics.uci.edu/summary.data.application.html Statlib: http://lib.stat.cmu.edu/ Delve: http://www.cs.utoronto.ca/~delve/ MLcomp: http://mlcomp.org 24
Resources: Journals Journal of Machine Learning Research www.jmlr.org Machine Learning Neural Computation Neural Networks IEEE Transactions on Neural Networks IEEE Transactions on Pattern Analysis and Machine Intelligence Annals of Statistics Journal of the American Statistical Association... 25
Resources: Conferences International Conference on Machine Learning (ICML) Neural Information Processing Systems (NIPS) Uncertainty in Artificial Intelligence (UAI) Computational Learning Theory (COLT) European Conference on Machine Learning (ECML) Knowledge Discovery and Data Mining (KDD) International Joint Conference on Artificial Intelligence (IJCAI) International Conference on Neural Networks (ICANN)... 26