CS 4510/9010 Applied Machine Learning. Evaluation. Paula Matuszek Fall, copyright Paula Matuszek 2016

Similar documents
CS Machine Learning

Learning From the Past with Experiment Databases

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Rule Learning With Negation: Issues Regarding Effectiveness

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Python Machine Learning

Rule Learning with Negation: Issues Regarding Effectiveness

Lecture 1: Machine Learning Basics

Detecting Wikipedia Vandalism using Machine Learning Notebook for PAN at CLEF 2011

CS 446: Machine Learning

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

(Sub)Gradient Descent

The Evolution of Random Phenomena

Reducing Features to Improve Bug Prediction

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Evaluating and Comparing Classifiers: Review, Some Recommendations and Limitations

Applications of data mining algorithms to analysis of medical data

Australian Journal of Basic and Applied Sciences

Linking Task: Identifying authors and book titles in verbose queries

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Linking the Ohio State Assessments to NWEA MAP Growth Tests *

A Decision Tree Analysis of the Transfer Student Emma Gunu, MS Research Analyst Robert M Roe, PhD Executive Director of Institutional Research and

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Disambiguation of Thai Personal Name from Online News Articles

12- A whirlwind tour of statistics

Chapter 2 Rule Learning in a Nutshell

Probability estimates in a scenario tree

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

Memory-based grammatical error correction

Issues in the Mining of Heart Failure Datasets

For Jury Evaluation. The Road to Enlightenment: Generating Insight and Predicting Consumer Actions in Digital Markets

Introduction to Simulation

Experiment Databases: Towards an Improved Experimental Methodology in Machine Learning

Purdue Data Summit Communication of Big Data Analytics. New SAT Predictive Validity Case Study

Assessing System Agreement and Instance Difficulty in the Lexical Sample Tasks of SENSEVAL-2

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Probability and Statistics Curriculum Pacing Guide

CHAPTER 4: REIMBURSEMENT STRATEGIES 24

Active Learning. Yingyu Liang Computer Sciences 760 Fall

What Different Kinds of Stratification Can Reveal about the Generalizability of Data-Mined Skill Assessment Models

Assignment 1: Predicting Amazon Review Ratings

Generating Test Cases From Use Cases

Evidence for Reliability, Validity and Learning Effectiveness

STT 231 Test 1. Fill in the Letter of Your Choice to Each Question in the Scantron. Each question is worth 2 point.

Softprop: Softmax Neural Network Backpropagation Learning

Interpreting ACER Test Results

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Quantitative analysis with statistics (and ponies) (Some slides, pony-based examples from Blase Ur)

University of Cincinnati College of Medicine. DECISION ANALYSIS AND COST-EFFECTIVENESS BE-7068C: Spring 2016

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

Individual Differences & Item Effects: How to test them, & how to test them well

Classify: by elimination Road signs

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Historical maintenance relevant information roadmap for a self-learning maintenance prediction procedural approach

Lecture 1: Basic Concepts of Machine Learning

Individual Component Checklist L I S T E N I N G. for use with ONE task ENGLISH VERSION

Beyond the Pipeline: Discrete Optimization in NLP

An Empirical Analysis of the Effects of Mexican American Studies Participation on Student Achievement within Tucson Unified School District

CONSTRUCTION OF AN ACHIEVEMENT TEST Introduction One of the important duties of a teacher is to observe the student in the classroom, laboratory and

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

How to Judge the Quality of an Objective Classroom Test

MYCIN. The MYCIN Task

Getting Started with Deliberate Practice

Instructor: Mario D. Garrett, Ph.D. Phone: Office: Hepner Hall (HH) 100

Predicting Students Performance with SimStudent: Learning Cognitive Skills from Observation

Exploration. CS : Deep Reinforcement Learning Sergey Levine

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

A Case Study: News Classification Based on Term Frequency

Lecture 2: Quantifiers and Approximation

COMPUTER-ASSISTED INDEPENDENT STUDY IN MULTIVARIATE CALCULUS

OVERVIEW OF CURRICULUM-BASED MEASUREMENT AS A GENERAL OUTCOME MEASURE

Understanding and Interpreting the NRC s Data-Based Assessment of Research-Doctorate Programs in the United States (2010)

On the Polynomial Degree of Minterm-Cyclic Functions

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Analyzing sentiments in tweets for Tesla Model 3 using SAS Enterprise Miner and SAS Sentiment Analysis Studio

Session 2B From understanding perspectives to informing public policy the potential and challenges for Q findings to inform survey design

AQUA: An Ontology-Driven Question Answering System

Word Segmentation of Off-line Handwritten Documents

Software Maintenance

Finding Your Friends and Following Them to Where You Are

Activity Recognition from Accelerometer Data

Truth Inference in Crowdsourcing: Is the Problem Solved?

WELCOME! Of Social Competency. Using Social Thinking and. Social Thinking and. the UCLA PEERS Program 5/1/2017. My Background/ Who Am I?

Detecting English-French Cognates Using Orthographic Edit Distance

How to make an A in Physics 101/102. Submitted by students who earned an A in PHYS 101 and PHYS 102.

Go fishing! Responsibility judgments when cooperation breaks down

Managerial Decision Making

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

Simple Random Sample (SRS) & Voluntary Response Sample: Examples: A Voluntary Response Sample: Examples: Systematic Sample Best Used When

Modeling user preferences and norms in context-aware systems

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

PHONETIC DISTANCE BASED ACCENT CLASSIFIER TO IDENTIFY PRONUNCIATION VARIANTS AND OOV WORDS

Introduction to Questionnaire Design

Activities, Exercises, Assignments Copyright 2009 Cem Kaner 1

Learning goal-oriented strategies in problem solving

An Introduction to Simio for Beginners

BMC Medical Informatics and Decision Making 2012, 12:33

Transcription:

CS 4510/9010 Applied Machine Learning 1 Evaluation Paula Matuszek Fall, 2016

Evaluating Classifiers 2 With a decision tree, or with any classifier, we need to know how well our trained model performs on other data Train on sample data, evaluate on test data (why?) Some things to look at: classification accuracy: percent correctly classified confusion matrix; Type 1 and Type 2 or alpha and beta errors precision and recall Other measures

Evaluating Classifiers 3 Standard methodology: 1. Collect a large set of examples (all with correct classifications) 2. Determine training and test sets 3. Apply learning algorithm to training set 4. Measure performance with respect to test set This applies to any classification method

Kinds of Test Sets 4 Weka provides for four kinds of Test Options: Use the training set Supply a separate test set Cross-validation, with n folds Percentage Split There are more options under Test Options, but they are not kinds of test sets.

Use Training Set 5 If you choose to use the training set evaluation will be on exactly the same data you learned on. Good: Uses all of the data. Bad: Gives you no measure of generalization or overfitting. Minimal test! With consistent data, this should give extremely good accuracy. If you re not getting better accuracy than random guess, time to rethink your approach.

Use Separate Test Set 6 This assumes that you actually have two sets of data; you give Weka both, train with one, test with the other If the two sets are comparable, gives you a good measure of generalizability. Significant effort goes into creating two comparable sets of data, and you don t use as much data to train as you could This is actually unusual. Mostly occurs when: replicating other research which used both competitions assessing different methods

Split Test Sets 7 Percentage split: randomly choose a subset to be test cases easiest, and gives a good measure of generalizability. Does not use all data to estimate accuracy; will underestimate of split doesn t cover most important combinations best with a large number of cases and few features want as many training cases as possible; 90%? Weka default is 66% Stratified split: identify subclasses and choose splits within each subclass. (Can t do this in Weka with the Explorer) useful if classes are unbalanced less important with large number of instances

Cross-Validation 8 Split instances multiple times, run classifier multiple times, average the results In Weka, folds are the splits 10-fold means divide the data into 10 sets, stratified. run the classifier 10 times, using one set each time All instances are used each time, and each instance is used as a test instance once Computationally expensive Good use of smaller data sets

9 Summary Important: keep the training and test sets disjoint! Otherwise we don t get a measure of overfitting Typical is to use stratified cross-validation For large datasets percentage split will work well and be much less resource-intensive. Note that in a split or cross-validated evaluation, the actual model output is the one learned from all the data. Only if there is a separate test set will the model not include all data.

Confusion Matrix 10 Now we have tested our data; we want to look at how we did. We care about how many mistakes we make We also care about what kind of mistakes we make We can discuss several measures in terms of the confusion matrix. For two classes error can be: Called something a positive instance when it is negative Called sometime a negative instance when it is positive For multiple classes, can be mis-classified more than one way

Confusion Matrix 11 Decision Classified or predicted as Yes Classified or predicted as No Actually Yes A: True Positives (These are correct) B: False Negatives Actually No C: False Positives D: True Negatives (These are also correct)

Confusion Matrix Example 12 Confusion Matrix for weather.nominal, with J48 defaults. Should We Play Outside? Actually Yes Actually No Classified as Yes A: True Positives(5) C: False Positives (3) Classified as No B: False Negatives (4) D: True Negatives (2)

Accuracy 13 The simplest measure Percent of correctly classified instances All the instances correctly predicted (True positives + True Negatives) / All instances (A + D) / (A + B + C + D) For weather.nominal, (5 + 2) /(5+2+4+3) = 50% Accuracy

Concept Check 14 For binary classifiers A and B, for balanced data: Which is better: A is 80% accurate, B is 60% accurate Which is better: A has 90% precision, B has 70% precision Would you use a spam filter that was 80% accurate? Would you use a classifier for who needs major surgery that was 80% accurate? Would you ever use a two-class classifier that is 50% accurate?

Delving More Deeply 15 We may want to look at the individual cells of a confusion matrix, and the kind of mistakes being made. Depending on the problem domain, we may care a lot more about false positives or about false negatives than about overall accuracy. Spam. Zika-free test for blood donations. Hurricane Warning

Check 16 Would you choose a higher false positive rate or a higher false negative rate? Is this food spoiled? Does this software download contain a virus? Will this person succeed in this program? and I can accept everyone I think will make it and I can accept 10% of the applicants. Should this loan application be approved?

Evaluation: Precision and Recall 17 Sometimes we want more detailed measures of our classifier Consider searching medical records for who has tested positive for Zika. Ideally, We want to find all the cases that tested positive. Recall. We want to find only the cases that tested positive. Precision

Precision and Recall 18 Recall: % of instances in a class which are correctly classified as that class correctly classified as i / total which are i, or A/(A+B) Precision: % of instances classified in a class which are actually in that class: correctly classified as i / total classified as i or A/(A+C) Note that these are defined in terms of A, or what we consider positive. Who has tested negative for Zika? In this case the matrix is flipped, A becomes D, etc. The values change.

Evaluation: Precision and Recall 19 Recall: A/(A+B) Precision: A/(A+C) Recall for Play 5/(5+4) = 5/9 =.556 Precision for Play 5/(5+3) = 5/8 =.625 Should We Play Outside? Actually Yes Actually No Classified as Yes A: True Positives(5) C: False Positives (3) Classified as No B: False Negatives (4) D: True Negatives (2) Weka gives us results for either answer being the yes, and the average of both.

Check 20 Patients with rash: Do they have measles Test Says Measles Test Says Not Measles Have Measles (positives) 20 5 False Positives? False Negatives? Accuracy? Precision? Recall? Don t have measles (negatives) 25 50

Non-Binary Outcomes 21 You can also define a confusion matrix for multiple outcomes: Iris: Precision and recall are one vs all others:

Evaluation: Overfitting 22 Training a model = predicting classification for our training set given the data in the set Model may capture chance variations in set This leads to overfitting -- the model is too closely matched to the exact data set it s been given More likely with large number of features small training sets

Combined Effectiveness 23 Ideally, we want a measure that combines precision and recall, in addition to accuracy. The F Measure F: 2pr / p+r For perfect precision and recall, F = 1 If either precision or recall drops, so does F If either precision or recall reaches 0 so does F Typically important if we want to compare different classifiers or options

Another Combined Score 24 AUC ROC: Area Under the Curve for Receiver Operating Characteristic. How well the test separates the group being tested into positive and negative instances. Likelihood that our classifier will rank a randomly chosen positive example as more likely than a randomly chosen negative example. TP rate/fp rate for various thresholds http://gim.unmc.edu/dxtests/roc3.htm There is a good video explanation of ROC and AUC at http://www.dataschool.io/ roc-curves-and-auc-explained/

And Yet Another 25 MCC: Matthews correlation coefficient https://en.wikipedia.org/wiki/matthews_correlation_coefficient Correlation will range from +1 (perfect predictions) to -1 (exactly wrong predictions). Completely random predictions will give a value of 0. Less common than F or AUC, but useful for comparisons when you have very unbalanced groups

ETC 26 There are many more potential evaluation statistics If you are evaluating a specific classifier compare to the majority classifier think about how you will use the classifier look at the confusion matrix, accuracy, precision, recall If you are comparing classifier methods F measure, AUC and MCC can all be useful for comparisons still need to consider whether any of them are adequate. And use separate test cases. Stratified 10-fold crossvalidation is usually best choice.

One More Point on Evaluating Classifiers 27 We are training a classifier because there is some task we want to carry out. Is the classifier actually useful? Majority classifier: assign all cases to the most common class. In Weka, this is the ZeroR classifier. Compare trained classifier to this. Especially relevant for very unbalanced classes Consider classifying x-rays into cancer/non-cancer, with a cancer rate of 5% We train a classifier, and get 95% accuracy. Is this valuable?