INTRODUCTION TO Lecture Slides for Machine Learning 2nd Edition ETHEM ALPAYDIN, modified by Leonardo Bobadilla and some parts from http://www.cs.tau.ac.il/~apartzin/machinelearning/ The MIT Press, 2010 alpaydin@boun.edu.tr http://www.cmpe.boun.edu.tr/~ethem/i2ml2e
CHAPTER 2: Supervised Learning
Outline Last Class: Ch 2 Supervised Learning (Sec 2.1-2.4) Learning a class from Examples VC Dimension PAC learning Noise This class: Learning Multiple Classes Regression Model Selection and Generalization Dimensions of a Supervised Learning Algorithm Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e The MIT Press (V1.0) 3
Multiple Classes General case K classes Family, Sport, Luxury cars Classes can overlap Can use different/same hypothesis class Fall into two classes? Sometimes worth to reject Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V1.1)
Multiple Classes, Ci i=1,...,k X { t,r t } N t= 1 = x r t i = 1 if x C Train hypotheses t hi(x), 0 if x C j, j i =1,...,K: Train hypotheses h i (x), i =1,...,K: t i i h i ( t x ) = 1 if 0 if x t x t C i C j, j i Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e The MIT Press (V1.0) 5
Regression Output is not Boolean (yes/no) or label but numeric value Training Set of examples Interpolation: fit function (polynomial) Extrapolation: predict output for any x Regression : added noise Assumption: hidden variables Approximate output by model: g(x) Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V1.1)
Examples Train hypotheses hi(x), i =1,...,K: Interpolation Extrapolation From: http://en.wikipedia.org Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e The MIT Press (V1.0) 7
Regression 8 Empirical error on training set Hypothesis space is linear functions Calculate best parameters to minimize error by taking partial derivatives Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V1.1)
Example Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V1.1)
Example A more complex model Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V1.1)
Higher-order polynomials Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V1.1)
Model Selection & Generalization Consider learning boolean functions If d inputs, examples at most Each example can be labeled 0 or 1 Therefore possible functions of d variables
Model Selection & Generalization Each training example removes half the hypothesis Learning as a way to remove hypothesis inconsistent with data But we need to see examples to
Model Selection & Generalization Learning is an ill-posed problem; data is not sufficient to find a unique solution Each sample remove irrelevant hypothesis The need for inductive bias, assumptions about H E.g. rectangles in our example But each hypothesis can only learn some functions Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V1.1)
Model Selection & Generalization Learning needs an inductive bias Model selection: How to choose the right bias? Each sample remove irrelevant hypothesis Want the model to be able to generalize Predict new data even more than fitting the training dataset Generalization: How well a model performs on new data Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V1.1)
Model Selection & Generalization Best generalization requires mathing the complexity of the hypothesis with the complexity of the function underlying the data Overfitting: H more complex than C or f e.g Fitting two rectangles to data sampled from one rectangle e.g Fitting a sixth-order polynomal to noisy data from a third-order polynomial Underfitting: H less complex than C or f e.g Fit a line to data sample from a third-order polynomial
Triple Trade-Off There is a trade-off between three factors (Dietterich, 2003): 1. Complexity of H, c (H), 2. Training set size, N, 3. Generalization error, E, on new data As N, E As c (H), first E and then E why? Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V1.1)
Cross-Validation To estimate generalization error, we need data unseen during training. We split the data as Training set (50%) To train a model Validation set (25%) To select a model (e.g. degree of polynomials) Test (publication) set (25%) Estimate the error, evaluate performance Resampling when there is few data Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V1.1)
Dimensions of a Supervised Learner Let us now recapitulate and generalize. We have a sample The sample is independent and identically distributed (i.i.d) from the same joint distribution İs 0/1 for classification K binary vector for multiclass classification real value in regression Goal: Build a good and useful approximation to using the model
Dimensions of a Supervised Learner We must make three decisions: 1. Model: 1. model input parameters Defines the hypothesis class H and defines h H -E.g. In classification? 2. In regression, Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning 20
Dimensions of a Supervised Learner We must make three decisions: 1. Model: 1. model input parameters Defines the hypothesis class H and defines h H -E.g. In classification rectangle is the model and the paramentes are the four coordinates 2. In regression, model is a linear function of the input, slope and intersect are the parameters
Dimensions of a Supervised Learner 2. Loss function: L() Difference between desire outpot and approximation given the parameters Class: learning 0/1 ( ) ( t ( t θ X = L r,g x θ) ) E Regression: numerical value t Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning 22
Dimensions of a Supervised Learner 3. Optimization procedure: Find θ* = arg min E θ ( X ) the value of the parameters that minimize the total error. Can be found analytically as in regression or through more complex optimization methods for more complicated models Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning θ 23
Dimensions of a Supervised Learner 3. Optimization procedure: Find the value of the parameters that minimize the total error. Can be found analytically as in regression or through more complex optimization methods for more complicated models Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning 24
Dimensions of a Supervised Learner The following conditions should be satisfied: 1) Hypothesis class g() must be big enough 2) Enough training data to find the best hypothesis 3) Good optimization procedure Different machine learning differ either in model, loss function or optimization procedure 25