INTRODUCTION TO MACHINE LEARNING SOME CONTENT COURTESY OF PROFESSOR ANDREW NG OF STANFORD UNIVERSITY

INTRODUCTION TO MACHINE LEARNING SOME CONTENT COURTESY OF PROFESSOR ANDREW NG OF STANFORD UNIVERSITY IQS2: Spring 2013

Machine Learning Definition 2 Arthur Samuel (1959). Machine Learning: Field of study that gives computers the ability to learn without being explicitly programmed. Samuel s claim to fame: Checkers. Had his program play against itself tens of thousands of times, noting board positions that tended to lead to wins, and those that tended to lead to losses. In time the program became much better at checkers than Samuel ever was!

Machine Learning Definition 3 Problem with Samuel s definition: too informal. How do we know when the definition has been satisfied? Tom Mitchell (1988). Well-posed learning problem: A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.

A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E. 5 Suppose your email program watches which emails you do or do not mark as spam, and based on that learns how to better filter spam. What are T, P, and E? T: Classifying emails as spam or not spam. E: Watching you label emails as spam or not spam P: The number (or fraction) of emails correctly classified as spam/not spam.

Machine Learning Algorithms 6 Supervised learning (what we ll do: learning is supervised in the sense that we provide training data) Currently the most common type of machine learning Unsupervised learning Example: Clustering algorithms Example: Some data mining Others: Recommender systems (think Netflix)

Terminology: Feature Vector 7 or simply features: The characteristics of the studied phenomenon that provide input to the machine learning algorithm Ex. In Housing prices example, only one feature: the size in square feet of the house. We could use more features, such as whether the house has a garage, whether it has central air conditioning, the number of bathrooms, etc. Then we would have an entire vector of features. NOTE: These are not parameters. We do not change their values to optimize our model. We do, however, try to select sufficient features to allow us to meet our goals.

Supervised Learning 8 Supervised learning: Correct answers given Ex. Linear regression: correct housing prices for some square foot values n Note: nothing to prevent two houses of the same size having two different prices Ex. Digit classification: Sample images are given along with the digit they represent n Note that in classification problems, though the answer is a class, it is often coded as a discrete numerical value n E.g., if code classified as malicious, code it 1, else code it 0 n In our digit classifier, the coded value is actually NOT the digit: it s a 10 x 1 column vector, each of whose entries are between 0 and 1 (typical of multi-class classification)

Supervised Learning Terminology 9 Training set: the data that is used to teach your program Test set: the data that will be used to test your program (this should definitely NOT be the same as your training data). Cross-validation VERY IMPORTANT: Just because your classifier works well (or even perfectly) on you training data, this does NOT mean that it will perform well on other data! n In fact, classifiers that work perfectly on the training data are often over fitted (more on this later).

Cost Function 10 Recall that machine learning, according to our formal definition, requires measured performance. In virtually all cases, this is provided by a cost function Recall from linear regression: In this example (as in most) lower cost means a better solution. So improving performance means finding values of the parameters θ0 and θ1 that minimize the cost function (or at least create a lower cost than our earlier choices).

Cost Function 11 Recall that machine learning, according to our formal definition, requires measured performance. In virtually all cases, this is provided by a cost function Recall from linear regression: Note here that θ0 and θ1 are the only variables in this example, in the sense that the values of the xi and yi are provided to us by the training data.

12 Your Cost Function

13 Where: m = number of training data K = output dimension (10 for us) h is the neural network output, which we ll discuss later Θ (l) is a family of matrices (actually just 2 matrices) n Θ (1) is a 257 x 256 matrix of weights n Θ (2) is a 257 x 10 matrix of weights The matrix entries are the parameters. So I lied. There are not 400 parameters. There are 68,362.

Don t Panic! 14 I showed you that to try to condition you to it. Sort of like shock therapy. I won t show it again for at least a few days.

Don t Panic! 15 I showed you that to try to condition you to it. Sort of like shock therapy. I won t show it again for at least a few days. And when I do, you ll have a better understanding of what it all means.

Don t Panic! 16 I showed you that to try to condition you to it. Sort of like shock therapy. I won t show it again for at least a few days. And when I do, you ll have a better understanding of what it all means. BUT, your head will probably still hurt a bit when you work with it.

Regularization 17 Deals with under fitting and over fitting A.K.A. bias and variance Under fitting (a.k.a. high bias): Too few parameters to accurately capture real phenomenon Ex:

Solution? 18 Add more parameters to the model (in this case, allow for higher dimensional polynomial fits)

Problem: Over fitting 19 Over fitting (a.k.a. high variance): Fitting the model too closely to the data can sometimes create poor model outcomes

Problem: Over fitting 20 Over fitting: Fitting the model too closely to the data can sometimes create poor model outcomes What if this is the true value of the data point you want to test?

Example: Nearest-Neighbor 21 The k-nearest neighbor algorithm classifies a a given point by looking at the k training data values nearest to the point. The majority class wins. 1-nearest neighbor will always classify the training set perfectly! But that will rarely be what you want to use to classify new values (it s classic over fitting). k-nearest neighbor, for k bigger than 1, will not always classify the training set correctly. But will do a much better job at classifying unknown data

22 Example: 1-nearest neighbor

23 Example: 15-nearest neighbor

For Emphasis 24 The issue with over fitting is that if your model has too many parameters (or too many features), you may fit the training set so well that your model fails to generalize to new examples. Since it s new examples you want to classify, this is a problem!

In General 25 Machine learning experts use various statistical tools to determine the likelihood that their model is experiencing over fitting or under fitting (and which of the two is the case). When identified, there are methods for remedying the situation. In general, beyond the scope of our foray into machine learning. For over fitting, I ll mention one obvious one: get rid of some features. n But which ones?

Regularization 26 A method for dealing with over fitting Basic idea: keep all the features or parameters, but reduce their magnitude As an intuitive example, think in general about what reducing coefficients does to a polynomial Works well when we have a lot of features, each of which contributes a little bit to predicting the class of the data.

So... 27 If we have the following cost function, and want to force θ3 and θ4 to be small, how can we do this?

So... 28 If we have the following cost function, and want to force θ3 and θ4 to be small, how can we do this? How about like this?:

Intuitively 29 it s a bit easier to see why shrinking θ 3 and θ4 simplify the model and reduce the effect of over fitting than it is to see why shrinking the values of ALL the parameters has a similar effect But shrinking ALL the parameters does do what we want Creates a simpler model Helps avoid the bad effects of over fitting Easiest way to see this is to play around and see the effects for yourself. But it s not necessary...unless you want to someday work in the machine learning arena.

Our Cost Function (Once Again) 30 The regularization term

Summary 31 We ve seen the basic concepts involved in machine learning We ve discussed the problems of under fitting and over fitting We ve discussed regularization We ve looked at the cost function we ll be using And have possibly been traumatized by it