CS 4510/9010 Applied Machine Learning 1 Evaluation Paula Matuszek Fall, 2016
Evaluating Classifiers 2 With a decision tree, or with any classifier, we need to know how well our trained model performs on other data Train on sample data, evaluate on test data (why?) Some things to look at: classification accuracy: percent correctly classified confusion matrix; Type 1 and Type 2 or alpha and beta errors precision and recall Other measures
Evaluating Classifiers 3 Standard methodology: 1. Collect a large set of examples (all with correct classifications) 2. Determine training and test sets 3. Apply learning algorithm to training set 4. Measure performance with respect to test set This applies to any classification method
Kinds of Test Sets 4 Weka provides for four kinds of Test Options: Use the training set Supply a separate test set Cross-validation, with n folds Percentage Split There are more options under Test Options, but they are not kinds of test sets.
Use Training Set 5 If you choose to use the training set evaluation will be on exactly the same data you learned on. Good: Uses all of the data. Bad: Gives you no measure of generalization or overfitting. Minimal test! With consistent data, this should give extremely good accuracy. If you re not getting better accuracy than random guess, time to rethink your approach.
Use Separate Test Set 6 This assumes that you actually have two sets of data; you give Weka both, train with one, test with the other If the two sets are comparable, gives you a good measure of generalizability. Significant effort goes into creating two comparable sets of data, and you don t use as much data to train as you could This is actually unusual. Mostly occurs when: replicating other research which used both competitions assessing different methods
Split Test Sets 7 Percentage split: randomly choose a subset to be test cases easiest, and gives a good measure of generalizability. Does not use all data to estimate accuracy; will underestimate of split doesn t cover most important combinations best with a large number of cases and few features want as many training cases as possible; 90%? Weka default is 66% Stratified split: identify subclasses and choose splits within each subclass. (Can t do this in Weka with the Explorer) useful if classes are unbalanced less important with large number of instances
Cross-Validation 8 Split instances multiple times, run classifier multiple times, average the results In Weka, folds are the splits 10-fold means divide the data into 10 sets, stratified. run the classifier 10 times, using one set each time All instances are used each time, and each instance is used as a test instance once Computationally expensive Good use of smaller data sets
9 Summary Important: keep the training and test sets disjoint! Otherwise we don t get a measure of overfitting Typical is to use stratified cross-validation For large datasets percentage split will work well and be much less resource-intensive. Note that in a split or cross-validated evaluation, the actual model output is the one learned from all the data. Only if there is a separate test set will the model not include all data.
Confusion Matrix 10 Now we have tested our data; we want to look at how we did. We care about how many mistakes we make We also care about what kind of mistakes we make We can discuss several measures in terms of the confusion matrix. For two classes error can be: Called something a positive instance when it is negative Called sometime a negative instance when it is positive For multiple classes, can be mis-classified more than one way
Confusion Matrix 11 Decision Classified or predicted as Yes Classified or predicted as No Actually Yes A: True Positives (These are correct) B: False Negatives Actually No C: False Positives D: True Negatives (These are also correct)
Confusion Matrix Example 12 Confusion Matrix for weather.nominal, with J48 defaults. Should We Play Outside? Actually Yes Actually No Classified as Yes A: True Positives(5) C: False Positives (3) Classified as No B: False Negatives (4) D: True Negatives (2)
Accuracy 13 The simplest measure Percent of correctly classified instances All the instances correctly predicted (True positives + True Negatives) / All instances (A + D) / (A + B + C + D) For weather.nominal, (5 + 2) /(5+2+4+3) = 50% Accuracy
Concept Check 14 For binary classifiers A and B, for balanced data: Which is better: A is 80% accurate, B is 60% accurate Which is better: A has 90% precision, B has 70% precision Would you use a spam filter that was 80% accurate? Would you use a classifier for who needs major surgery that was 80% accurate? Would you ever use a two-class classifier that is 50% accurate?
Delving More Deeply 15 We may want to look at the individual cells of a confusion matrix, and the kind of mistakes being made. Depending on the problem domain, we may care a lot more about false positives or about false negatives than about overall accuracy. Spam. Zika-free test for blood donations. Hurricane Warning
Check 16 Would you choose a higher false positive rate or a higher false negative rate? Is this food spoiled? Does this software download contain a virus? Will this person succeed in this program? and I can accept everyone I think will make it and I can accept 10% of the applicants. Should this loan application be approved?
Evaluation: Precision and Recall 17 Sometimes we want more detailed measures of our classifier Consider searching medical records for who has tested positive for Zika. Ideally, We want to find all the cases that tested positive. Recall. We want to find only the cases that tested positive. Precision
Precision and Recall 18 Recall: % of instances in a class which are correctly classified as that class correctly classified as i / total which are i, or A/(A+B) Precision: % of instances classified in a class which are actually in that class: correctly classified as i / total classified as i or A/(A+C) Note that these are defined in terms of A, or what we consider positive. Who has tested negative for Zika? In this case the matrix is flipped, A becomes D, etc. The values change.
Evaluation: Precision and Recall 19 Recall: A/(A+B) Precision: A/(A+C) Recall for Play 5/(5+4) = 5/9 =.556 Precision for Play 5/(5+3) = 5/8 =.625 Should We Play Outside? Actually Yes Actually No Classified as Yes A: True Positives(5) C: False Positives (3) Classified as No B: False Negatives (4) D: True Negatives (2) Weka gives us results for either answer being the yes, and the average of both.
Check 20 Patients with rash: Do they have measles Test Says Measles Test Says Not Measles Have Measles (positives) 20 5 False Positives? False Negatives? Accuracy? Precision? Recall? Don t have measles (negatives) 25 50
Non-Binary Outcomes 21 You can also define a confusion matrix for multiple outcomes: Iris: Precision and recall are one vs all others:
Evaluation: Overfitting 22 Training a model = predicting classification for our training set given the data in the set Model may capture chance variations in set This leads to overfitting -- the model is too closely matched to the exact data set it s been given More likely with large number of features small training sets
Combined Effectiveness 23 Ideally, we want a measure that combines precision and recall, in addition to accuracy. The F Measure F: 2pr / p+r For perfect precision and recall, F = 1 If either precision or recall drops, so does F If either precision or recall reaches 0 so does F Typically important if we want to compare different classifiers or options
Another Combined Score 24 AUC ROC: Area Under the Curve for Receiver Operating Characteristic. How well the test separates the group being tested into positive and negative instances. Likelihood that our classifier will rank a randomly chosen positive example as more likely than a randomly chosen negative example. TP rate/fp rate for various thresholds http://gim.unmc.edu/dxtests/roc3.htm There is a good video explanation of ROC and AUC at http://www.dataschool.io/ roc-curves-and-auc-explained/
And Yet Another 25 MCC: Matthews correlation coefficient https://en.wikipedia.org/wiki/matthews_correlation_coefficient Correlation will range from +1 (perfect predictions) to -1 (exactly wrong predictions). Completely random predictions will give a value of 0. Less common than F or AUC, but useful for comparisons when you have very unbalanced groups
ETC 26 There are many more potential evaluation statistics If you are evaluating a specific classifier compare to the majority classifier think about how you will use the classifier look at the confusion matrix, accuracy, precision, recall If you are comparing classifier methods F measure, AUC and MCC can all be useful for comparisons still need to consider whether any of them are adequate. And use separate test cases. Stratified 10-fold crossvalidation is usually best choice.
One More Point on Evaluating Classifiers 27 We are training a classifier because there is some task we want to carry out. Is the classifier actually useful? Majority classifier: assign all cases to the most common class. In Weka, this is the ZeroR classifier. Compare trained classifier to this. Especially relevant for very unbalanced classes Consider classifying x-rays into cancer/non-cancer, with a cancer rate of 5% We train a classifier, and get 95% accuracy. Is this valuable?