Lecture 9: Classification and algorithmic methods

1/28 Lecture 9: Classification and algorithmic methods Måns Thulin Department of Mathematics, Uppsala University thulin@math.uu.se Multivariate Methods 17/5 2011

2/28 Outline What are algorithmic methods? Algorithmic methods for classification knn classification Decision trees Algorithmic versus probabilistic methods

3/28 Probabilistic methods Previously, we have looked at probabilistic methods for classification methods based on statistical theory and model assumptions. In a statistical problem, the basic situation is the following: Nature {Black box} Data The probabilistic approach is to assume a model for what happens in the black box (normal distribution, ARIMA time series, linear model, Markov chains...). We hope that the models describe the black box accurately enough. All models are wrong, but some are useful. - George Box Some statisticians, and indeed people from other fields as well, argue that it is time to think outside the box.

4/28 Algorithmic methods Suppose that we have a set of data with known classes. Without any model assumptions, we can use heuristics and good ideas to come up with new methods. We can create algorithms that creates rules for classifying new points using the given training data. By splitting the given data into a training set and a test set, we can evaluate the performance of our algorithmic method. All models are wrong, and increasingly you can succeed without them. - Peter Norvig, research director at Google As a motivating example, we ll look at a situation where it is more or less clear that we don t need fancy methods or model assumptions to classify new observations.

A toy example Example: consider a data set with two groups: red and blue. knn classification y 5/28

A toy example How should we classify the new black point? knn classification y 6/28

A toy example It seems reasonable to classify the point as being blue! knn classification y 7/28

A toy example How should we classify the new black point? knn classification y 8/28

A toy example It seems reasonable to classify the point as being red! knn classification y 9/28

A less nice example But what about this point? knn classification y 10/28

11/28 knn: basic idea In the first two examples, we could easily classify the point since all points in its neighbourhood had the same colour. What should we do when there is more than one colour in the neighbourhood? The knn algorithm classifies the new point by letting the k Nearest Neighbours the k points that are the closest to the point vote about the class of the new point.

knn: basic idea Look at the k = 1 closest neighbours. The point is classified as being blue, since the nearest neighbour is blue. knn classification k=1 y 12/28

knn: basic idea Look at the k = 2 closest neighbours. It is not clear how to classify the point (no colour has a majority). knn classification k=2 y 13/28

knn: basic idea Look at the k = 3 closest neighbours. The point is classified as being blue (2 votes against 1). knn classification k=3 y 14/28

15/28 knn: choosing k Clearly, the choice of k is very important. If k is too small, the algorithm becomes sensitive to noise points and outliers. If k is too large, the neighbourhood will probably include points from other classes. How should we choose k? This is a difficult question! There is no right answer. Often a test data set is used to investigate the performance for different k. Typically, we choose the k that has the lowest misclassification rate for the test data.

16/28 knn: no majority In our example, we encountered a problem when k = 2: no colour had a majority. What should we do in such cases? Flip a coin? This ignores some of the information that we have gathered! Let the closest neighbour decide? Or the k 1 closest? A better solution is probably to use weighted votes, so that the votes from closer neighbours are seen as more important. This idea could be used in all cases, and not just when there is no majority. Look at k + 1 neighbours instead? Essentially, this means that when we don t have enough information to make a decision, we gather more information.

17/28 knn: some last comments knn is essentially a rank method: we measure the distance to all points in the data set and rank them accordingly. The k points with the lowest ranks are used to classify the new point. An important question is what we mean by close. Which distance measure should we use? Euclidean distance? Statistical distance? Mahalanobis? Should we look at standardized data? Is it meaningful to use distance measures if the data is binary or categorical? If some of the variables are categorical and some are continuous measurements? Are more general similarity measures useful?

18/28 Decision trees: basic idea Another popular algorithmic classification method is decision trees. Have you ever played the game 20 questions? Decision trees is more or less that game! The idea is to classify the new observation by asking a series of questions. Depending on what the answer to the first question is, different second questions are asked, and so on. Questions are asked until a conclusion is reached.

19/28 Decision trees: basic idea Consider the following data set with vertebrate data: Name Body temp Gives birth Has legs Class Human warm-blooded yes yes mammal Whale warm-blooded yes no mammal Cat warm-blooded yes yes mammal Cow warm-blooded yes yes mammal Python cold-blooded no no reptile Komodo dragon cold-blooded no yes reptile Turtle cold-blooded no yes reptile Salmon cold-blooded no no fish Eel cold-blooded no no fish Pigeon warm-blooded no yes bird Penguin warm-blooded no yes bird Decision tree example: see blackboard!

20/28 Decision trees: building the tree Given training data, how can we build the decision tree? There are many algorithms for building the tree. One of the earliest is Hunt s algorithm: Let D t be a set of observations belonging to a node t. 1. If all observations in D t are of the same class i then t is a leaf node labeled as i. 2. Otherwise, use some condition to partition the observations into two smaller subsets. A child node is created for each outcome of the condition and the observations are distributed to the children based on the outcomes. When should the splitting stop? Other criterions are sometimes used, but a simple and reasonable stopping criterion is to stop splitting when all remaining nodes are leaf nodes. How, then, do we choose the condition for partitioning?

21/28 Decision trees: the best split Let p(i t) be the fraction of observations in class i at the node t and let c be the number of classes. The Gini for node t is defined as Gini(t) = 1 c (p(i t)) 2 Gini is a measure of impurity. If all observations belong to the same class, then i=1 Gini(t) = 1 1 2 0... 0 = 0. The Gini is maximized when all classes have the same number of observations at t. One criterion for splitting could be to minimize the Gini in the next level of the tree. That way we will get purer nodes.

22/28 Decision trees: the best split The situation becomes a bit more complicated if we take into account that the children can have different numbers of observations. To account for this, we try to maximize the gain: Gain = Gini(t) k j=1 n vj n t Gini(v j ) where v j are the children and n i are the number of observations at node i. This is equivalent to minimizing k n vj j=1 n t Gini(v j ). Vertebrate example: see blackboard! Sometimes other impurity measures than Gini are used. One example is the entropy: c Entropy(t) = p(i t) log 2 p(i t). i=1

23/28 Decision trees: vertebrate data Name Body temp Gives birth Has legs Class Human warm-blooded yes yes mammal Whale warm-blooded yes no mammal Cat warm-blooded yes yes mammal Cow warm-blooded yes yes mammal Python cold-blooded no no reptile Komodo dragon cold-blooded no yes reptile Turtle cold-blooded no yes reptile Salmon cold-blooded no no fish Eel cold-blooded no no fish Pigeon warm-blooded no yes bird Penguin warm-blooded no yes bird

24/28 Decision trees: extensions Some further remarks: In our example, we only used binary splits, were each internal node has two children. It is also possible to use non-binary splits, where each internal node can have more than two children. When the data is continuous, it is perhaps not as easy to choose the split criterions. Example: animal weight. A node question could be: is weight <10 kg? Is this a better question than is weight <11 kg? or is weight <9 kg? Having looked at two algorithmic methods, we will now compare the merits of algorithmic and probabilistic method.

25/28 Algorithmic versus probabilistic methods: pros Probabilistic methods Mathematical/probabilistic foundation. Possible to derive optimal methods. Often gives nice interpretations of the results. Possible to control error rates by choosing significance levels. Algorithmic methods No need for model assumptions. Is optimized using the test data. Often has a good heuristic foundation. Some methods work well when p > n.

26/28 Algorithmic versus probabilistic methods: cons Probabilistic methods May be based on asymptotic results, that do not work well when the sample size is small. The model may be a poor description of nature. The conclusions are only about the model s mechanism and not about the true mechanism. Evaluating the model fit can be difficult, especially in higher dimensions. Algorithmic methods Relies heavily on the training data, which may not be representative. Difficult or impossible to find optimal methods. Likely not as good as the probabilistic method if the model is accurate. Some methods lack solid theoretical support.

27/28 Algorithmic versus probabilistic methods: discussion A paper by Leo Breiman from 2001 (Statistical modeling: the two cultures, Statistical Science, Vol. 16) discusses the use of algorithmic methods in modern statistics. Breiman argues that: The data and the problem at hand should lead to the solution not prior ideas about what kind of methods are good. The statistician should focus on finding a good solution regardless of whether that solution uses algorithmic or probabilistic methods. How good a method is should be judged by the predictive accuracy of the method on the test data. This last point is perhaps controversial; we often judge probabilistic method by theoretical properties.

28/28 Algorithmic versus probabilistic methods: discussion Some further comments and questions: Are algorithmic methods simply even more non-parametric non-parametric methods? Today it is not uncommon for new probabilistic methods to be published with nothing but simulation results to back them up (as the underlying mathematics can be quite complicated). Is this any different from the support for algorithmic methods? There are some very interesting research problems in trying to provide probabilistic support for algorithmic methods. Regardless of how we feel about algorithmic methods, we should not be afraid to introduce new tools to our statistical toolbox!