Lecture 3: Transcripts - Basic Concepts (1) and Decision Trees (1)

Lecture 3: Transcripts - Basic Concepts (1) and Decision Trees (1) Basic concepts 1. Welcome to Lecture 3. We will start Lecture 3 by introducing some basic notions and basic terminology. 2. These are the references for this presentation. 3. Let s start with the concept of induction. In simple words, we can say that induction is the process of reaching a general conclusion from specific examples. Induction is the process that allows us to generalize on specific examples. We said previously that generalization is a key concept of machine learning. More specifically 4. The goal of inductive ML is to use data to induce, to work out, a model. The goodness of this model will be evaluated on unseen data that has not been used during the model construction. If the model has a high performance on unseen data, we can say that the model generalizes well. 5. Let s see a pictorial representation of induction: we have input data that have been annotated with a label say by humans. From this data, we try to figure out which are the best features for our classification problem. Our goal is to predict the label of unlabelled data. We work out a feature representation, and extract the features. We feed a machine leaning algorithm which induces a model from the input features. This induced model should be capable of predicting the right label of unseen data, tha is data that does not belong to the input data. 6. This is another way to visualize the input data, in this case our problem is to predict the class of an iris flower. So we have a dataset containing features, feature values and class labels. Based on these training examples, we want to induce a model that can give us a reliable prediction of the flower instance on the top of the slide. 7. In order to measure the performance of our classifier, we can use several techniques. One technique is to divide the data that we have into 2 parts: a training set and a test set. Suppose we have 1000 examples of iris flowers, we might select 800 of these as training data, and we set aside 200 as test data. We induce our model only from the 800 examples and then we run the induced model on the 200 examples separately and compute our test error on these 200 examples. -- Note that when we split up our data, the examples in test set have a class label, but these labels are not used to learn. They are used only to see if the predictions made by the induced model are correct. The performance of the model on these 200 examples is indicative of how well the model will do in the future on unseen non-labelled examples. Statistics tells us that if the sample is large enough, our data is representative and can get a reliable solution of our problem. Commonly used splits are 80% of the data used as training data and 20% as test data. Or 90% training and 10% test data; We can also see other proportions, for ex 50% and 50% which usually not recommended... There is no mathematical rule to decide about the split. It depends on your data. NEVER EVER MANIPULATE test data: if you do this, your results will be invalid. Remember also that the test data must belong to the same class distribution as the training data. You cannot just use data that have a different class distribution because the induced model can be confused. 8. ML uses formal models that might perform well on our data. In this course we are going to study some of these models. The choice of using one model rather than another is our choice. A model tells us what sort of things we can learn. A model tells us what our inductive bias is. We said that the inductive bias is the a-priori assumption that govern a model. The inductive bias is the set of assumptions the

learning algorithm makes that allow it to learn. For example, the inductive bias of decision trees is that we can split data into branches and nodes and that the root node is the most similar to the things we want to learn. The inductive bias of the perceptron is that data must be linearly separable and so on. 9. Learning algorithms have parameters associated to them. A parameter is a kind of setting. For ex in a decision tree, a parameter can regulate the order in which questions are asked. Models can have many parameters and finding the best combination of parameters is not trivial. Parameters are usually adjusted based on the training data. 10. Learning algorithms have also additional settings called hyperparameters. A hyperparameter is a parameter that controls other parameters of the model. Hyperparameters cannot be directly adjusted using training data. The process is more complex. Split your data into 70% training data, 10% development data and 20% test data. 11. For each possible setting of the hyperparameters: Train a model using that setting on the training data; compute the model error rate on the development data; from the above collection of medels, choose the one that achieve the lowest error rate on development data. Evaluate that model on the test data to estimate future test performance. 12. Accuracy measures the percentage of correct results that a classifier has achieved. Accuracy is the proportion or percentage of correctly predicted labels over all predictions. Accuracy alone is sometimes quite misleading as you may have a model with relatively 'high' accuracy with the model predicting the most frequent class labels fairly accurately but the model may be making all sorts of mistakes on the classes that are actually critical to the application. However, we can always compute precision and recall for each class label and analyze the individual performance on class labels or average the values to get the overall precision and recall. 13. Machine Learning has borrowed some terminology from IR. On the screen you can see definitions used in IR. We have 4 labels to categorize the results. Let s take a binary classification problem: the spam filter: is an email a spam email: yes or no. So our classifier must predict if an email is spam or not and the class label that we use are respectively yes and no. he categories that we can use to categorize the results are: TP=True positive, ie the number of positive spam examples that have been the yes label. TN=True Negative, ie the number of negative examples that have the no label. FP=False Positive the number of negative examples that have been labelled as positive. FN=False negative the number of positive example that have been labelled as negative by our model. 14. Given these four numbers, we can define the following metrics: precision, recall and f-measure. In a classification task, the precision for a class is the number of true positives (i.e. the number of items correctly labeled as belonging to the positive class) divided by the total number of elements labeled as belonging to the positive class (i.e. the sum of true positives and false positives, which are items incorrectly labeled as belonging to the class). Recall in this context is defined as the number of true positives divided by the total number of elements that actually belong to the positive class (i.e. the sum of true positives and false negatives, which are items which were not labeled as belonging to the positive class but should have been). Precision: Given all the predicted labels (for a given class X), how many instances were correctly predicted? Recall: For all instances that should have a label X, how many of these were correctly captured? The F-Measure (or F-Score), which combines

the precision and recall to give a single score, is defined to be the harmonic mean of the precision and recall 15. This a list of the metrics 16. The confusion matrix is a handy tool to see what kind of mistakes a classifier makes and how often it makes them. It is a useful table that presents both the class distribution in the data and the classifiers predicted class distribution with a breakdown of error types. Usually, the rows are the observed/actual class labels and the columns the predicted class labels. Each cell contains the number of predictions made by the classifier that fall into that cell. 17. If a classification system has been trained to distinguish between cats, dogs and rabbits, a confusion matrix will summarize the results of testing the algorithm for further inspection. Assuming a sample of 27 animals 8 cats, 6 dogs, and 13 rabbits, the resulting confusion matrix could look like the table on the screen. In this confusion matrix, of the 8 actual cats, the system predicted that three were dogs, and of the six dogs, it predicted that one was a rabbit and two were cats. We can see from the matrix that the system in question has trouble distinguishing between cats and dogs, but can make the distinction between rabbits and other types of animals pretty well. All correct guesses are located in the diagonal of the table, so it's easy to visually inspect the table for errors, as they will be represented by values outside the diagonal. 18. We said we could use development data to set hyper parameters. The main disadvantage is that we use up some training data just for one or two hyperparameter estimation. An alternative is to use cross validation. In 10-fold cross-validation you break you training data up into 10 equally-sized partitions. You train a learning algorithm on 9 of them and test it on the remaining 1. You do this 10 times, each holding out a different partition as the test data. Typical choices for n- fold are 2, 5, 10. 10-fold cross validation is the most common. After running cross validation, you can use the hyperparameters selected by cross-validation 19. Leave One Out (or LOO) is a simple cross-validation. Each learning set is created by taking all the samples except one, the test set being the sample left out. 20. Suppose we are training a classifier to predict which of 2 classes, C1 and C2, examples belong to. Suppose we have one sample randomly drawn from the original population. We divide the sample up into a training set and a test set. Suppose it turns out that most of the examples in the training set belong to c1 and most of those in the test set to C2. This is not good. We must ensure that the proportion of each class in the sets is the same as the proportion in the original sample. This is called: stratification 21. This is screen shot from the weka package where you can choose the different testing options. 22. This screen shot shows the kind of output you get from weka. In this example a classifier called ZeroR has classified the iris dataset (on left hand side).on the right hand side, you can see the results. In this video clip we have learned the accuracy (correctly classified instances), Precision, recall, f-mesures and the confusion matrix. We will learn about other metrics in the next lectures. 23. Underfitting: the model has not learned enough from the data and is unable to generalize Overfitting: the model has learned too many idiosyncrasies (noise) and is unable to generalize 24. Our goal when we choose a machine learning model is that it does well on future, unseen data. The way in which we measure performance should depend on the

problem we are trying to solve. There should be a strong relationship between the data that our algorithm sees at training time and the data it sees at test time. 25. Not everything is learnable: Noise at feature level; Noise at class label level; Features are insufficiently representative; Labels are controversial; Inductive bias not appropriate for the kind of problem we try to learn 26. Now some simple quizzes. Decision Trees 1 1. Part 1a: In this video clip we are going to talk about a simple and intuitive learning model: the decision tree. 2. There will be 2 lectures on decision trees. In today s lecture, I will explain how a decision tree works and I will cover some basic characteristics of this model, such as Greediness, the Divide and Conquer notion, the Inductive Bias, the Loss function, the Expected loss, the Empirical error, and at the end we will summarize the induction. 3. We said previously that we could simplify the concept on learning in the field of ML by saying that we want to make informed guesses about the future. The past is represented by the examples stored in the training set, and the future is represented by the unseen examples. We can evaluate the generalization ability of our learner by using a test set. In the figure we have classified examples of iris flowers divided into three classes (setosa, versicolor, virginica), each example is represented by measurements. The purpose of a machine learning model would then be to guess correctly the class of a previously unseen iris flower based only on its measurements, that might differ in some respects from the measurements stored in the training set. So we want to make a good prediction based on our previous experience of irises. Our experience is formalized in the dataset. 4. Now, let s make a more specific example by using the same problem that is presented in Daume s book. Our problem is now to predict if a student will like a course or not based on his/her ratings on previous courses. We could make predictions by asking yes-no questions to the student. For instance, does the new course belong to the Systems program? Has the student liked most previous Systems courses, etc.? And we could build a diagram, a tree-like diagram like the one you see on the slide. 5. When we build our supervised decision tree learning model, we do not ask questions directly to the students. Instead, we use training data in order to answer the questions. Essentially, we have a dataset similar to that on the screen where each row is an example paired with the correct answer. In the dataset on the screen, the column Rating is the class. Interpret the classes as Like (meaning that the student liked the course) if the rating is 0, +1, +2, and Hate if the student did not like the course and ranked it using -2 and -1. 6. So the ratings in this specific dataset are the class labels, the column names are questions and they are our features, the responses, yes and no, are the feature values. So we have here a feature representation that we assume is useful to solve our classification problem. With this data, we could build many possible trees. Since we do not want to spend months in deciding which of these possible trees is the best tree, we proceed greedily. 7. Being greedy, in this context means: if you could ask only one question, which question would you ask? Which is the most useful question? One can start depicting the usefulness of questions in histograms. Look at the histograms on the screen. Each histogram shows the frequency of like/hate labels for each possible value of a

feature. From these histograms, you can see that asking the first question (that is, is it easy or not?) is not useful because there is no clear divide between yes and no. On the contrary, asking the question is this a Systems course (the fourth histogram on the screen) is useful because if the value is no, you can be sure that students liked the course, if the value is yes students hated the course. Now, pick up a random example from this dataset, and ask this question. If you get the answer no, you would possibly be inclined to say that the class label of the example is like. On the contrary, if you would get the answer yes to this questions, you would be inclined to think the class label is hate. Try and use this feature and our assumptions to make informed guesses on the examples of the dataset. You will see that you will guess right many times. So, if you choose this feature you can make reliable informed guesses. Repeat the computation for each of the available features, and score them. When you have to choose which feature to consider first, you choose the one with the highest score. In this way you choose the ROOT node of the decision tree. 8. How do we choose subsequent features? Here is where the notion of divide and conquer is applied. When you ask the first question Is the course a Systems course? you can partition the data into 2 sets: the no set and the yes set. This is the divide step: you get 2 partitions. In the Conquer step, repeat the same process you have applied to choose the first feature on the examples listed under the no branch and the yes branch of the tree. 9. At one point, we realize that asking additional questions becomes redundant, or that we have run out of questions. In both cases, we create a LEAF NODE and we guess the most prevalent answer based on the training data you are looking at. 10. The goal of the decision tree learning model is to figure out what questions to ask in what order, what answer to predict once you have asked enough questions. The inductive bias of decision trees assumes that the things that we want to learn to predict are more like the root node and less like the other branch nodes. 11. We will talk more about the basic characteristics of decision trees in the next video clip. 12. Part 1b: Welcome back to decision trees part 1 13. Let s now start with an informal definition of the decision tree model. A decision tree is a flow-chart-like structure, where each internal (non-terminal) node denotes a test on an attribute, each branch represents the outcome of a test, and each leaf (or terminal) node holds a class label. The topmost node in a tree is the root node. 14. Let s now formalize the definition. We know that the performance of a learning algorithm should be measured on unseen data. We can use a function to measure the performance and we call it Loss function: The loss function is the price paid for inaccuracy of predictions in classification problems. Loss is this case means misclassifications or wrong predictions How bad is our system s predictions in comparison to the truth? In particular, if y is the truth and y-hat is the system s prediction, then the function l(y, yˆ) is a measure of error. Note that the loss function is something that we must decide on based on the goals of learning. There are many loss functions that we could use. Let s use the simplest here: the zero-one loss. If y is equal to y hat, the system s classification is correct so we have 0 errors. If y is not equal to y hat, the classification is incorrect, so we have to count one error.:

15. Distribution: Now that we have defined our loss function, we need to consider where the data (training and test) comes from. We talked about distribution before and we focussed on normal distribution. We now know that normal distribution is a bellshaped distribution of data. if we know a priori what your data generating distribution is, our learning problem becomes easier. In this case, we are not making any assumptions about what the distribution D looks like. We are assuming that we do not know what D is. Perhaps the hardest thing about machine learning is that we don t know what D is: all we get is a random sample from it. This random sample is our training data. We can say that the Data Generating Distribution is a probability distribution D over input/output pairs. If we write x for the input (examples/instances) and y for the output (the rating), then D is a distribution over (x, y) pairs. Remember that our problem is guess the rating of an unseen example. A useful way to think about D (Data Generating Distribution) is that it gives high probability to reasonable (x, y) pairs, and low probability to unreasonable (x, y) pairs. 16. Expected Loss: We are given access to training data, which is a random sample of input/output pairs drawn from D. Based on this training data, we need to induce a function f that maps new inputs to corresponding prediction. The key property that f should obey is that it should do well on future examples that are also drawn from D. Formally, its expected loss (epsilon) over the distribution (D) with respect to l should be as small as possible, meaning that we should minimize the expected loss, meaning that we should make as few error as possible. : 17. Now let s read and anlyse the formulae: Epsilon is equal by definition to blackboard-bold E sub the pair x y over script D of l of the pair y f of x. All this corresponds to: Sum (big sigma means sum) over all the pairs in script D of x and y times l of y and f of x. This is exactly the weighted average loss over all the pairs x and y in D, weighted by their probability under the distribution D. In practical terms, this formula accounts for the average loss if we draw a bunch of xy pairs for a distribution D. 18. Training error: The difficulty in minimizing our expected loss formula is that we don t know anything about the distribution D. but we know that in our training data we have certain number of xy pairs. So in order to compute our training error epsilon-hat (which is an average, hat indicated an), we divide the expected loss (the formula is explained in the previous slide) by the number of training examples, 1 over capital N. And we get the formula that you see on the screen: the training error epsilon-hat (the hat means that it is an estimate) is equal by definition to 1 over N of the Sum from n=1 to capital N of l of y and f of x. That is, our training error is simply our average error over the training data. The challenge for our learned function needs to generalize beyond the

training data to some future data that it might not have seen yet. the training error epsilon-hat is equal by definition to 1 over N of the Sum from n=1 to capital N of l of y and f of x. 19. The training error is sometimes called empirical error. Remember that terminology can be confusing sometimes. The empirical error can be called the training error, test error, or observed error depending on whether it is the error on a training set, test set, or a more general set. What out! Formulae can be written using different notation styles. For example, the formula on this slide is the formula for the empirical error given by Alpaydin: the empirical error is the proportion of training instances where the predictions of h (the hypothesis = the informed guess) do not match the required values given in big X (the training set). The formula should be read in this way: the empirical error of the hypothesis h given the training set X is the sum of the training instances (small x) where the hypothesis on the class label r fails. 20. Induction: So, putting it all together, we get a formal definition of induction machine learning: Given a loss function l and a sample small d from some unknown distribution capital D, you must compute a function f that has low expected error epsilon over D with respect to l. 21. Ok. We stop here today. Thank you for your attention. Termininology DEFINITION OF 'DISCRETE DISTRIBUTION' The statistical or probabilistic properties of observable (either finite or countably infinite) pre-defined values. Unlike a continuous distribution, which has an infinite number of outcomes, a discrete distribution is characterized by a limited number of possible observations. Discrete distribution is frequently used in statistical modeling and computer programming. Also known as a "discrete probability distribution". BREAKING DOWN 'DISCRETE DISTRIBUTION'

Examples of discrete probability distributions include binomial distribution (with a finite set of values) and Poisson distribution (with an countably infinite set of values). The concept of probability distrubtions and the random variables they describe are the underpinnnings of probability theory and statistical analysis. Terminology: Ordered Pairs: And here is another way to think about functions: Write the input and output of a function as an "ordered pair", such as (4,16). They are called ordered pairs because the input always comes first, and the output second: (input, output) So it looks like this: (x, f(x)).