Machine Learning with Weka SLIDES BY (TOTAL 5 Session of 1.5 Hours Each) ANJALI GOYAL & ASHISH SUREKA (www.ashish-sureka.in) CS 309 INFORMATION RETRIEVAL COURSE ASHOKA UNIVERSITY NOTE: Slides created and edited using existing teaching resources on Internet
WEKA: the software Machine learning/data mining software written in Java (distributed under the GNU Public License) Used for research, education, and applications Main features: Comprehensive set of data pre-processing tools, learning algorithms and evaluation methods Graphical user interfaces (incl. data visualization) Environment for comparing learning algorithms 2
WEKA: download and install Go to website: https://www.cs.waikato.ac.nz/ml/weka/ 3
WEKA: download and install Go to website: https://www.cs.waikato.ac.nz/ml/weka/ 4
WEKA only deals with flat files @relation heart-disease-simplified @attribute age numeric @attribute sex { female, male} @attribute chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina} @attribute cholesterol numeric @attribute exercise_induced_angina { no, yes} @attribute class { present, not_present} @data 63,male,typ_angina,233,no,not_present 67,male,asympt,286,yes,present 67,male,asympt,229,yes,present 38,female,non_anginal,?,no,not_present... 5
WEKA only deals with flat files @relation heart-disease-simplified @attribute age numeric @attribute sex { female, male} @attribute chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina} @attribute cholesterol numeric @attribute exercise_induced_angina { no, yes} @attribute class {present, not_present} @data 63,male,typ_angina,233,no,not_present 67,male,asympt,286,yes,present 67,male,asympt,229,yes,present 38,female,non_anginal,?,no,not_present... 6
7
Explorer: pre-processing the data Data can be imported from a file in various formats: ARFF, CSV Data can also be read from a URL or from an SQL database (using JDBC) Pre-processing tools in WEKA are called filters WEKA contains filters for: Discretization, normalization, resampling, attribute selection, transforming and combining attributes, 8
12/27/2017 University of Waikato 9
12/27/2017 University of Waikato 10
Iris Dataset 11
Iris Dataset 12
Iris Dataset- Arff 13
Distinct is no. of distinct values i.e. total no. of instances if you removed all duplicates. Unique is no. of values that appear only once. What do you observe from this graph? 4.3-7.9? Colors? 5, 6,? What do they add to? Is sepallength a good predictor? 12/27/2017 University of Waikato 14
Check if sepalwidth is good predictor? 12/27/2017 University of Waikato 15
12/27/2017 University of Waikato 16
12/27/2017 University of Waikato 17
12/27/2017 University of Waikato 18
Which of the 4 attributes is better predictor? 12/27/2017 University of Waikato 19
Data Processing 12/27/2017 University of Waikato 20
Discretization Discretization is the process of putting values into buckets so that there are a limited number of possible states. (continuous to categorical ) Many classification algorithms produce better results on discretized data. 21
22
23
24
12/27/2017 University of Waikato 25
12/27/2017 University of Waikato 26
12/27/2017 University of Waikato 27
12/27/2017 University of Waikato 28
12/27/2017 University of Waikato 29
12/27/2017 University of Waikato 30
12/27/2017 University of Waikato 31
12/27/2017 University of Waikato 32
12/27/2017 University of Waikato 33
12/27/2017 University of Waikato 34
What should be the best no. of bins? 12/27/2017 University of Waikato 35
Explorer: data visualization Visualization very useful in practice: e.g. helps to determine difficulty of the learning problem WEKA can visualize single attributes (1-d) and pairs of attributes (2-d) To do: rotating 3-d visualizations (Xgobi-style) Color-coded class values Jitter option to deal with nominal attributes (and to detect hidden data points) Zoom-in function 36
12/27/2017 University of Waikato 37
Which two attributes are linearly correlated? 12/27/2017 University of Waikato 38
12/27/2017 University of Waikato 39
12/27/2017 University of Waikato 40
12/27/2017 University of Waikato 41
12/27/2017 University of Waikato 42
12/27/2017 University of Waikato 43
12/27/2017 University of Waikato 44
12/27/2017 University of Waikato 45
12/27/2017 University of Waikato 46
Explorer: attribute selection Panel that can be used to investigate which (subsets of) attributes are the most predictive ones Attribute selection methods contain two parts: A search method: best-first, forward selection, random, exhaustive, genetic algorithm, ranking An evaluation method: correlation-based, wrapper, information gain, chi-squared, Very flexible: WEKA allows (almost) arbitrary combinations of these two 47
12/27/2017 University of Waikato 48
12/27/2017 University of Waikato 49
12/27/2017 University of Waikato 50
12/27/2017 University of Waikato 51
12/27/2017 University of Waikato 52
12/27/2017 University of Waikato 53
12/27/2017 University of Waikato 54
12/27/2017 University of Waikato 55
12/27/2017 University of Waikato 56
Add a new feature to existing dataset such that new feature is most beneficial? Add a feature which has distinct values for all classes. Add a new feature to existing dataset such that new feature is least beneficial? Add a feature which has same values for all classes. 57
Lets try with Iris dataset! 12/27/2017 University of Waikato 58
12/27/2017 University of Waikato 59
12/27/2017 University of Waikato 60
12/27/2017 University of Waikato 61
Explorer: building classifiers Classifiers in WEKA are models for predicting nominal or numeric quantities Implemented learning schemes include: Decision trees and lists, instance-based classifiers, support vector machines, multi-layer perceptrons, logistic regression, Bayes nets, Meta -classifiers include: Bagging, boosting, stacking, error-correcting output codes, locally weighted learning, 62
12/27/2017 University of Waikato 63
12/27/2017 University of Waikato 64
12/27/2017 University of Waikato 65
12/27/2017 University of Waikato 66
12/27/2017 University of Waikato 67
12/27/2017 University of Waikato 68
12/27/2017 University of Waikato 69
12/27/2017 University of Waikato 70
12/27/2017 University of Waikato 71
12/27/2017 University of Waikato 72
12/27/2017 University of Waikato 73
12/27/2017 University of Waikato 74
Training data is again used for testing model. Training data is used for model development and an unseen set of data is used for testing model. It is held one out scheme. Train on a certain percentage of data and then test on rest of data. 12/27/2017 University of Waikato 75
12/27/2017 University of Waikato 76
Cross Validation Cross Validation is the method for estimating the accuracy of an inducer by dividing the data into K mutually exclusive subsets (folds) of approximately equal size. Simplest and most widely used method for estimating prediction error. 77
We use Cross Validation as follows: Divide data into K folds; hold-out one part and fit using the remaining data (compute error rate on hold-out data); repeat K times. CV Error Rate: average over the K errors we have computed. (Let us suppose, K = 5). 11 76 5 47 Original Data Testing Data Training Data K=1 K=2 K=3 K=4 K=5
How many folds needed (k=?) Large K: small bias, large variance as well as high computational time. Small K: Computational time reduced, small variance, large bias. A common choice for K is 5-10. 79
12/27/2017 University of Waikato 80
12/27/2017 University of Waikato 81
12/27/2017 University of Waikato 82
12/27/2017 University of Waikato 83
12/27/2017 University of Waikato 84
12/27/2017 University of Waikato 85
12/27/2017 University of Waikato 86
12/27/2017 University of Waikato 87
12/27/2017 University of Waikato 88
tp fn fp tn 12/27/2017 University of Waikato 89
tn fp fn tp 12/27/2017 University of Waikato 90
12/27/2017 University of Waikato 91
12/27/2017 University of Waikato 92
Add a new feature to existing dataset such that new feature is most beneficial? Add a feature which has distinct values for all classes. Add a new feature to existing dataset such that new feature is least beneficial? Add a feature which has same values for all classes. 93
Add a new feature to existing dataset such that new feature is most beneficial? Add a feature which has distinct values for all classes. Add a new feature to existing dataset such that new feature is least beneficial? Add a feature which has same values for all classes. 94
Lets try with Iris dataset! 12/27/2017 University of Waikato 95
12/27/2017 University of Waikato 96
12/27/2017 University of Waikato 97
12/27/2017 University of Waikato 98
12/27/2017 University of Waikato 99
12/27/2017 University of Waikato 100
12/27/2017 University of Waikato 101
12/27/2017 University of Waikato 102
12/27/2017 University of Waikato 103
Attribute Selection+ Classification (Weather.arff) 104
12/27/2017 University of Waikato 105
12/27/2017 University of Waikato 106
12/27/2017 University of Waikato 107
12/27/2017 University of Waikato 108
Discretization Discretization is the process of putting values into buckets so that there are a limited number of possible states. (continuous to categorical ) Many classification algorithms produce better results on discretized data. 109
110
111
112
12/27/2017 University of Waikato 113
12/27/2017 University of Waikato 114
12/27/2017 University of Waikato 115
12/27/2017 University of Waikato 116
12/27/2017 University of Waikato 117
12/27/2017 University of Waikato 118
12/27/2017 University of Waikato 119
12/27/2017 University of Waikato 120
12/27/2017 University of Waikato 121
12/27/2017 University of Waikato 122
Naïve Bayes Classifier Consider each attribute and class label as random variables Given a record with attributes (A 1, A 2,,A n ) Goal is to predict class C Specifically, we want to find the value of C that maximizes P(C A 1, A 2,,A n ) 123
Shape Dataset: 124
12/27/2017 University of Waikato 125
12/27/2017 University of Waikato 126
P(Triangle) = 5/14= 0.38 P(Square) = 9/14= 0.63 Color: Triangle Square Green 3 4 4/7 2 3 3/11 Original: P( A C) Laplace: P( A C) i N N COLORi ic OUTLINE classes DOT SHAPE c N N ic c 1 c c: number of GREEN DASHED NO? p: prior probability Yellow 0 1 1/7 4 5 5/11 Red 2 3 3/7 3 4 4/11 Outline: Triangle Square Dashed 4 5 5/7 3 4 4/11 Solid 1 2 2/7 6 7 7/11 4/7 *5/7 *3/7 *5/14 = 0.062 Dot: Triangle Square Yes 3 4 4/7 3 4 4/11 No 2 3 3/7 6 7 7/11 3/11 *4/11 *7/11 *9/14 = 0.041 127
COLOR OUTLINE DOT SHAPE GREEN DASHED NO? Shapetest.csv 128
12/27/2017 University of Waikato 129
tp fn Confusion Matrix: fp tn True positive rate(tpr)/ Sensitivity,= False positive rate(fpr)/ Specificity,= No.of true positives No.of true positives+no.of false negatives No.of true negatives No.of true negatives+no.of false positives
tp fn Confusion Matrix: fp tn MCC (Matthews Correlation Coefficient): Measure of quality of binary classification
tn fp Confusion Matrix: fn tp True positive rate(tpr)/ Sensitivity,= False positive rate(fpr)/ Specificity,= No.of true positives No.of true positives+no.of false negatives No.of true negatives No.of true negatives+no.of false positives
tn fp Confusion Matrix: fn tp MCC (Matthews Correlation Coefficient): Measure of quality of binary classification
Kappa Statistic: Cohen s kappa statistic measures interrater reliability (sometimes called inter-observer agreement). Interrater reliability, or precision, happens when your data raters (or collectors) give the same score to the same data item. Step 1: Calculate P o (Observed Agreement). P 0 = (1+6)/14= 0.5 Step 2: Calculate P e (Expected Agreement). P(Triangle)=(5/14)*(4/14) P(Square)=(9/14)*(10/14) P e = (90/196)+(20/196)= 0.561 K= (0.5-0.561)/(1-0.561)= -0.141 134
STATUS FLOOR DEPT. OFFICE-SIZE RECYCLING- BIN? faculty four CS medium yes student four EE large yes staff five CS medium no student three EE small yes staff four CS medium no STATUS=student, FLOOR=four, DEPT. =CS, OFFICE SIZE=small Recycling Bin=? 135
Lets try with Iris dataset! 12/27/2017 University of Waikato 136
12/27/2017 University of Waikato 137
12/27/2017 University of Waikato 138
ROC Curve ROC: Receiver Operating Characteristic. Developed by British in World War II as part of Chain Home radar system. Used to analyze radar data to differentiate between enemy aircraft and signal noise. It is a performance graphing method. A plot of True Positive Rates and False Positive Rates. Used for evaluating data mining schemes. 139
ROC Curve 140
Example ROC Curve 141
Example ROC Curve 142
Why we need ROC curve? Consider a scenario: Design a ML tool. Training Data: Training Data Class: Should be test conducted for cancer by doctor? Create model. Tool will assign the patient a score between 0 and 1. High Score-? Tool is confident about the risk that patient has cancer. Low Score-?Tool is confident that patient is not at risk of having cancer. Test model. What evaluation measure-?. Before you measure anything, make a choicefamily history, age, weight, etc. Patient end up having cancer or not. True Positive Rate: How many ill people were recommended test? False Positive Rate: How many not-ill people were recommended test? False Negative Rate: How many ill people were not recommended test? True Negative Rate: How many not-ill people were not recommended test? Goal: To maximize TP, TN Rate and to minimize FP, FN Rate. Should not be test conducted for cancer by doctor? what threshold score do you use to decide whether or not patient needs test? 143
Consider a scenario: Design a ML tool. Should be tested conducted for cancer by doctor Training Data: family history, age, weight, etc. Training Data Class: Patient end up having cancer or not. Create model. Tool will assign the patient a score between 0 and 1. High Score-? Tool is confident about the risk of having cancer Low Score-? Tool Tool is is confident confident that that patient patient is is not not at at risk risk of of having having cancer. cancer. Test model. Should not be tested conducted for cancer by doctor What evaluation measure-?. Goal: To maximize TP, TN Rate and to minimize FP, FN Rate. Before you measure anything, make a choice- what threshold score do you use to decide whether or not patient needs test? As everyone with non-zero score has some risk. Low Threshold-?. Lot of Tests. High Threshold-?Ȯnly people with cancer will get tested. But there would be false negatives as well. (A lot of people with cancer would not be tested)
Non-diseased cases Diseased cases Threshold Test result value or subjective judgement of likelihood that case is diseased 145
Non-diseased cases Diseased cases more typically: Test result value or subjective judgement of likelihood that case is diseased 146
TPF, sensitivity Non-diseased cases Diseased cases Threshold less aggressive mindset FPF, 1-specificity 147
TPF, sensitivity Non-diseased cases Threshold moderate mindset Diseased cases FPF, 1-specificity 148
TPF, sensitivity Non-diseased cases more aggressive mindset Threshold Diseased cases FPF, 1-specificity 149
TPF, sensitivity Non-diseased cases Entire ROC curve Threshold Diseased cases FPF, 1-specificity 150
TPF, sensitivity Entire ROC curve Reader Skill and/or Level of Technology FPF, 1-specificity 151
Sensitivity: Refers to the test's ability to correctly detect ill patients who have cancer. Sensitivity = No.of true positives No.of true positives+no.of false negatives = probability of positive test given that patient is ill Specificity: Refers to the test's ability to correctly reject healthy patients who do not have cancer. Specificity = No.of true negatives No.of true negatives+no.of false positives = probability of negative test given that patent is not ill. 152
True positive rate (TPR) = False positive rate (FPR) = No.of true positives No.of true positives+no.of false negatives No.of false positives No.of true negatives+no.of false positives Move threshold from high to low. True positive rate increases (you test a higher proportion of those who do actually have cancer ) False positive rate increases (you incorrectly tell more people to get tested when they don t need to). 153
As you step through the threshold values from high to low, you put dots on the above graph from left to right - joining up the dots gives the ROC curve. 12/27/2017 University of Waikato 154
Score: 155
As you step through the threshold values from high to low, you put dots on the above graph from left to right - joining up the dots gives the ROC curve. 12/27/2017 University of Waikato 157
Comparing different classifiers: ROC curves provide a better look at where different learners minimize cost Which curve is better? Area under ROC curve: depicts how good classifier is? 158
Precision-Recall Curve 159