Evalua&on Metrics & Methodology

Similar documents
CS Machine Learning

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Predicting Students Performance with SimStudent: Learning Cognitive Skills from Observation

Evaluating and Comparing Classifiers: Review, Some Recommendations and Limitations

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Disambiguation of Thai Personal Name from Online News Articles

Detecting English-French Cognates Using Orthographic Edit Distance

Reducing Features to Improve Bug Prediction

Linking the Ohio State Assessments to NWEA MAP Growth Tests *

Python Machine Learning

Australian Journal of Basic and Applied Sciences

Memory-based grammatical error correction

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Lecture 1: Machine Learning Basics

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Historical maintenance relevant information roadmap for a self-learning maintenance prediction procedural approach

Case study Norway case 1

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Linking Task: Identifying authors and book titles in verbose queries

Comparison of network inference packages and methods for multiple networks inference

Test How To. Creating a New Test

Calibration of Confidence Measures in Speech Recognition

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

Rule Learning With Negation: Issues Regarding Effectiveness

Getting Started with Deliberate Practice

Rule Learning with Negation: Issues Regarding Effectiveness

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Speech Recognition by Indexing and Sequencing

Using dialogue context to improve parsing performance in dialogue systems

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Software Maintenance

Fostering Success Coaching: Effective partnering with students from foster care. Maddy Day, MSW Jamie Crandell, MSW Courtney Maher

On Human Computer Interaction, HCI. Dr. Saif al Zahir Electrical and Computer Engineering Department UBC

How to Judge the Quality of an Objective Classroom Test

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Applications of data mining algorithms to analysis of medical data

Welcome to the Purdue OWL. Where do I begin? General Strategies. Personalizing Proofreading

The Flaws, Fallacies and Foolishness of Benchmark Testing

PHONETIC DISTANCE BASED ACCENT CLASSIFIER TO IDENTIFY PRONUNCIATION VARIANTS AND OOV WORDS

Detecting Wikipedia Vandalism using Machine Learning Notebook for PAN at CLEF 2011

Thesis-Proposal Outline/Template

Conversation Starters: Using Spatial Context to Initiate Dialogue in First Person Perspective Games

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Assignment 1: Predicting Amazon Review Ratings

Instructional Supports for Common Core and Beyond: FORMATIVE ASSESMENT

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

How to read a Paper ISMLL. Dr. Josif Grabocka, Carlotta Schatten

Analysis: Evaluation: Knowledge: Comprehension: Synthesis: Application:

Course Content Concepts

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

School Leadership Rubrics

Interpreting ACER Test Results

Running head: THE INTERACTIVITY EFFECT IN MULTIMEDIA LEARNING 1

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Learning From the Past with Experiment Databases

Probabilistic Latent Semantic Analysis

Physics 270: Experimental Physics

CS 446: Machine Learning

Generating Test Cases From Use Cases

Digital Fabrication and Aunt Sarah: Enabling Quadratic Explorations via Technology. Michael L. Connell University of Houston - Downtown

Process to Identify Minimum Passing Criteria and Objective Evidence in Support of ABET EC2000 Criteria Fulfillment

Automatic Pronunciation Checker

Truth Inference in Crowdsourcing: Is the Problem Solved?

*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN

Probability estimates in a scenario tree

An investigation of imitation learning algorithms for structured prediction

EDEXCEL FUNCTIONAL SKILLS PILOT TEACHER S NOTES. Maths Level 2. Chapter 4. Working with measures

A Case Study: News Classification Based on Term Frequency

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

INTERMEDIATE ALGEBRA Course Syllabus

Machine Learning and Development Policy

Chapter 2 Rule Learning in a Nutshell

Improving Conceptual Understanding of Physics with Technology

A Game-based Assessment of Children s Choices to Seek Feedback and to Revise

Numeracy Medium term plan: Summer Term Level 2C/2B Year 2 Level 2A/3C

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Individual Component Checklist L I S T E N I N G. for use with ONE task ENGLISH VERSION

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

M55205-Mastering Microsoft Project 2016

BMC Medical Informatics and Decision Making 2012, 12:33

Grade 4. Common Core Adoption Process. (Unpacked Standards)

On the Combined Behavior of Autonomous Resource Management Agents

Mathematics Success Level E

Test Effort Estimation Using Neural Network

ENGLISH. Progression Chart YEAR 8

CHAPTER 4: REIMBURSEMENT STRATEGIES 24

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

Word Segmentation of Off-line Handwritten Documents

Organizational Knowledge Distribution: An Experimental Evaluation

FCE Speaking Part 4 Discussion teacher s notes

Understanding and Supporting Dyslexia Godstone Village School. January 2017

Welcome to ACT Brain Boot Camp

Functional Maths Skills Check E3/L x

Managerial Decision Making

Transcription:

Evalua&on Metrics & Methodology

Why evalua&on? When a learning system is deployed in the real world, we need to be able to quan&fy the performance of the classifier How accurate will the classifier be When it is wrong, why is it wrong? This is very important as it is useful to decide which classifier to use in which situa&ons

Evaluating ML Algorithms Empirical Studies Correctness on novel examples (induc&ve learning) Time spent learning Time needed to apply result learned Speedup ager learning (explana&on- based learning) Space required Basic idea: repeatedly use train/test sets to es&mate future accuracy

Proper Experimental Methodology Can Have a Huge Impact! A 2002 paper in Nature (a major, major journal) needed to be corrected due to training on the tes&ng set Original report : 95% accuracy (5% error rate) Corrected report (which s&ll is buggy): 73% accuracy (27% error rate) Error rate increased over 400%!!! Most important thou shall not

Training and Test sets Split the available data into a training set and a test set Train the classifier on the training set and evaluate on the test set

Classifier Accuracy The accuracy of a classifier on a given test set is the percentage of test set examples that are correctly classified by the classifier Accuracy = (# correct classifica&ons)/ (Total # of examples) Error rate is the opposite of accuracy Error rate = 1 - Accuracy

Some Typical ML Experiments Empirical Learning Test set Accuracy Confidence Bars (from multiple runs) Algorithm1 Algorithm2 A learning curve # of Training Examples (or amount of noise or amount of missing features )

Some Typical ML Experiments Lesion Studies Testset Performance Full System 80% Without Module A 75% Without Module B 62%

Learning from Examples: Standard Methodology for Evalua&on 1) Start with a dataset of labeled examples 2) Randomly par&&on into N groups 3a) N &mes, combine N - 1 groups into a train set 3b) Provide train set to learning system 3c) Measure accuracy on leg out group (the test set) train test train train Called N - fold cross valida&on (typically N =10)

Using Tuning Sets OGen, an ML system has to choose when to stop learning, select among alterna&ve answers, etc. One wants the model that produces the highest accuracy on future examples ( overfieng avoidance ) It is a cheat to look at the test set while s&ll learning Beger method Set aside part of the training set Measure performance on this tuning data to es&mate future performance for a given set of parameters Use best parameter seengs, train with all training data (except test set) to es&mate future performance on new examples

Experimental Methodology: A Pictorial Overview collection of classified examples training examples testing examples Statistical techniques such as 10- fold cross validation and t-tests are used to get meaningful results LEARNER train set generate solutions tune set select best classifier expected accuracy on future examples

Parameter Seeng No&ce that each train/test fold may get different parameter seengs! That s fine (and proper) I.e., a parameterless * algorithm internally sets parameters for each data set it gets * Usually, though, some parameters have to be externally fixed (e.g. knowledge of the data, range of parameter seengs to try, etc)

Using Mul&ple Tuning Sets Using a single tuning set can be unreliable predictor, plus some data wasted. Hence, ogen the following is done: 1) For each possible set of parameters a) Divide training data into train and tune sets, using N- fold cross valida4on b) Score this set of parameter values: average tune set accuracy over the N folds 2) Use best set of parameter seengs and all (train + tune) examples 3) Apply resul&ng model to test set

Example

False Posi&ves & False Nega&ves Some&mes accuracy is not sufficient If 98% of examples are nega&ve (for a disease), the classifying everyone as nega&ve can get an accuracy of 98% When is the model wrong? False posi&ves and false nega&ves OGen there is a cost associated with false posi&ves and false nega&ves Diagnosis of diseases Some&mes beger safe than sorry

Confusion Matrix Is a device used to illustrate how a model is performing in terms of false posi&ves and false nega&ves It gives us more informa&on than a single accuracy figure It allows us to think about the cost of mistakes It can be extended to any number of classes

Confusion Matrix

Accuracy Measures Accuracy = Misclassification Rate = TP +TN TP + FP +TN + FN FP + FN TP + FP +TN + FN True Positive Rate(sensitivity) = TP TP + FN True Negative Rate(specificity) = TN TN + FP

ROC Curves ROC: Receiver Opera.ng Characteris.cs Started for radar research during WWII Judging algorithms on accuracy alone may not be good enough when geeng a posi&ve wrong costs more than geeng a nega&ve wrong (or vice versa) Eg, medical tests for serious diseases Eg, a movie- recommender (ala NetFlix) system

ROC Curves Graphically Prob (alg outputs + + is correct) 1.0 True positives rate Ideal Spot Alg 1 Alg 2 False positives rate 1.0 Prob (alg outputs + - is correct) Different algorithms can work better in different parts of ROC space. This depends on cost of false + vs false -

Algo for Crea&ng ROC Curves Step 1: Sort predic&ons on test set Step 2: Locate a threshold between examples with opposite categories Step 3: Compute TPR & FPR for each threshold of Step 2 Step 4: Connect the dots

Plotting ROC Curves - Example ML Algo Output (Sorted) Correct Category Ex 9.99 + Ex 7 TPR=(2/5), FPR=(0/5).98 + Ex 1.72 TPR=(2/5), FPR=(1/5) - Ex 2.70 + Ex 6 TPR=(4/5), FPR=(1/5).65 + Ex 10.51 - Ex 3.39 TPR=(4/5), FPR=(3/5) - Ex 5.24 TPR=(5/5), FPR=(3/5) + Ex 4.11 - Ex 8.01 TPR=(5/5), FPR=(5/5) - P(alg outputs + + is correct) 1.0 1.0 P(alg outputs + - is correct)

Area Under ROC Curve A common metric for experiments is to numerically integrate the ROC Curve 1.0 True positives False positives 1.0

Asymmetric Error Costs Assume that cost(fp) cost(fn) You would like to pick a threshold that mimimizes E(total cost) = cost(fp) x prob(fp) x (# of neg ex s) + cost(fn) x prob(fn) x (# of pos ex s) You could also have (maybe nega&ve) costs for TP and TN (assumed zero in above)

Precision vs. Recall (think about search engines) Precision = (# of relevant items retrieved) / (total # of items retrieved) = TP / (TP + FP) P(is pos called pos) Recall = (# of relevant items retrieved) / (# of relevant items that exist) = TP/(TP+FN) = TPR P(called pos is pos) No&ce that n(0,0) is not used in either formula Therefore you get no credit for filtering out irrelevant items