Machine Learning with Weka - PDF Free Download

Machine Learning with Weka SLIDES BY (TOTAL 5 Session of 1.5 Hours Each) ANJALI GOYAL & ASHISH SUREKA (www.ashish-sureka.in) CS 309 INFORMATION RETRIEVAL COURSE ASHOKA UNIVERSITY NOTE: Slides created and edited using existing teaching resources on Internet

WEKA: the software Machine learning/data mining software written in Java (distributed under the GNU Public License) Used for research, education, and applications Main features: Comprehensive set of data pre-processing tools, learning algorithms and evaluation methods Graphical user interfaces (incl. data visualization) Environment for comparing learning algorithms 2

WEKA: download and install Go to website: https://www.cs.waikato.ac.nz/ml/weka/ 3

WEKA: download and install Go to website: https://www.cs.waikato.ac.nz/ml/weka/ 4

WEKA only deals with flat files @relation heart-disease-simplified @attribute age numeric @attribute sex { female, male} @attribute chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina} @attribute cholesterol numeric @attribute exercise_induced_angina { no, yes} @attribute class { present, not_present} @data 63,male,typ_angina,233,no,not_present 67,male,asympt,286,yes,present 67,male,asympt,229,yes,present 38,female,non_anginal,?,no,not_present... 5

WEKA only deals with flat files @relation heart-disease-simplified @attribute age numeric @attribute sex { female, male} @attribute chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina} @attribute cholesterol numeric @attribute exercise_induced_angina { no, yes} @attribute class {present, not_present} @data 63,male,typ_angina,233,no,not_present 67,male,asympt,286,yes,present 67,male,asympt,229,yes,present 38,female,non_anginal,?,no,not_present... 6

Explorer: pre-processing the data Data can be imported from a file in various formats: ARFF, CSV Data can also be read from a URL or from an SQL database (using JDBC) Pre-processing tools in WEKA are called filters WEKA contains filters for: Discretization, normalization, resampling, attribute selection, transforming and combining attributes, 8

12/27/2017 University of Waikato 9

12/27/2017 University of Waikato 10

Iris Dataset 11

Iris Dataset 12

Iris Dataset- Arff 13

Distinct is no. of distinct values i.e. total no. of instances if you removed all duplicates. Unique is no. of values that appear only once. What do you observe from this graph? 4.3-7.9? Colors? 5, 6,? What do they add to? Is sepallength a good predictor? 12/27/2017 University of Waikato 14

Check if sepalwidth is good predictor? 12/27/2017 University of Waikato 15

12/27/2017 University of Waikato 16

12/27/2017 University of Waikato 17

12/27/2017 University of Waikato 18

Which of the 4 attributes is better predictor? 12/27/2017 University of Waikato 19

Data Processing 12/27/2017 University of Waikato 20

Discretization Discretization is the process of putting values into buckets so that there are a limited number of possible states. (continuous to categorical ) Many classification algorithms produce better results on discretized data. 21

12/27/2017 University of Waikato 25

12/27/2017 University of Waikato 26

12/27/2017 University of Waikato 27

12/27/2017 University of Waikato 28

12/27/2017 University of Waikato 29

12/27/2017 University of Waikato 30

12/27/2017 University of Waikato 31

12/27/2017 University of Waikato 32

12/27/2017 University of Waikato 33

12/27/2017 University of Waikato 34

What should be the best no. of bins? 12/27/2017 University of Waikato 35

Explorer: data visualization Visualization very useful in practice: e.g. helps to determine difficulty of the learning problem WEKA can visualize single attributes (1-d) and pairs of attributes (2-d) To do: rotating 3-d visualizations (Xgobi-style) Color-coded class values Jitter option to deal with nominal attributes (and to detect hidden data points) Zoom-in function 36

12/27/2017 University of Waikato 37

Which two attributes are linearly correlated? 12/27/2017 University of Waikato 38

12/27/2017 University of Waikato 39

12/27/2017 University of Waikato 40

12/27/2017 University of Waikato 41

12/27/2017 University of Waikato 42

12/27/2017 University of Waikato 43

12/27/2017 University of Waikato 44

12/27/2017 University of Waikato 45

12/27/2017 University of Waikato 46

Explorer: attribute selection Panel that can be used to investigate which (subsets of) attributes are the most predictive ones Attribute selection methods contain two parts: A search method: best-first, forward selection, random, exhaustive, genetic algorithm, ranking An evaluation method: correlation-based, wrapper, information gain, chi-squared, Very flexible: WEKA allows (almost) arbitrary combinations of these two 47

12/27/2017 University of Waikato 48

12/27/2017 University of Waikato 49

12/27/2017 University of Waikato 50

12/27/2017 University of Waikato 51

12/27/2017 University of Waikato 52

12/27/2017 University of Waikato 53

12/27/2017 University of Waikato 54

12/27/2017 University of Waikato 55

12/27/2017 University of Waikato 56

Add a new feature to existing dataset such that new feature is most beneficial? Add a feature which has distinct values for all classes. Add a new feature to existing dataset such that new feature is least beneficial? Add a feature which has same values for all classes. 57

Lets try with Iris dataset! 12/27/2017 University of Waikato 58

12/27/2017 University of Waikato 59

12/27/2017 University of Waikato 60

12/27/2017 University of Waikato 61

Explorer: building classifiers Classifiers in WEKA are models for predicting nominal or numeric quantities Implemented learning schemes include: Decision trees and lists, instance-based classifiers, support vector machines, multi-layer perceptrons, logistic regression, Bayes nets, Meta -classifiers include: Bagging, boosting, stacking, error-correcting output codes, locally weighted learning, 62

12/27/2017 University of Waikato 63

12/27/2017 University of Waikato 64

12/27/2017 University of Waikato 65

12/27/2017 University of Waikato 66

12/27/2017 University of Waikato 67

12/27/2017 University of Waikato 68

12/27/2017 University of Waikato 69

12/27/2017 University of Waikato 70

12/27/2017 University of Waikato 71

12/27/2017 University of Waikato 72

12/27/2017 University of Waikato 73

12/27/2017 University of Waikato 74

Training data is again used for testing model. Training data is used for model development and an unseen set of data is used for testing model. It is held one out scheme. Train on a certain percentage of data and then test on rest of data. 12/27/2017 University of Waikato 75

12/27/2017 University of Waikato 76

Cross Validation Cross Validation is the method for estimating the accuracy of an inducer by dividing the data into K mutually exclusive subsets (folds) of approximately equal size. Simplest and most widely used method for estimating prediction error. 77

We use Cross Validation as follows: Divide data into K folds; hold-out one part and fit using the remaining data (compute error rate on hold-out data); repeat K times. CV Error Rate: average over the K errors we have computed. (Let us suppose, K = 5). 11 76 5 47 Original Data Testing Data Training Data K=1 K=2 K=3 K=4 K=5

How many folds needed (k=?) Large K: small bias, large variance as well as high computational time. Small K: Computational time reduced, small variance, large bias. A common choice for K is 5-10. 79

12/27/2017 University of Waikato 80

12/27/2017 University of Waikato 81

12/27/2017 University of Waikato 82

12/27/2017 University of Waikato 83

12/27/2017 University of Waikato 84

12/27/2017 University of Waikato 85

12/27/2017 University of Waikato 86

12/27/2017 University of Waikato 87

12/27/2017 University of Waikato 88

tp fn fp tn 12/27/2017 University of Waikato 89

tn fp fn tp 12/27/2017 University of Waikato 90

12/27/2017 University of Waikato 91

12/27/2017 University of Waikato 92

Lets try with Iris dataset! 12/27/2017 University of Waikato 95

12/27/2017 University of Waikato 96

12/27/2017 University of Waikato 97

12/27/2017 University of Waikato 98

12/27/2017 University of Waikato 99

12/27/2017 University of Waikato 100

12/27/2017 University of Waikato 101

12/27/2017 University of Waikato 102

12/27/2017 University of Waikato 103

Attribute Selection+ Classification (Weather.arff) 104

12/27/2017 University of Waikato 105

12/27/2017 University of Waikato 106

12/27/2017 University of Waikato 107

12/27/2017 University of Waikato 108

110

111

112

12/27/2017 University of Waikato 113

12/27/2017 University of Waikato 114

12/27/2017 University of Waikato 115

12/27/2017 University of Waikato 116

12/27/2017 University of Waikato 117

12/27/2017 University of Waikato 118

12/27/2017 University of Waikato 119

12/27/2017 University of Waikato 120

12/27/2017 University of Waikato 121

12/27/2017 University of Waikato 122

Naïve Bayes Classifier Consider each attribute and class label as random variables Given a record with attributes (A 1, A 2,,A n ) Goal is to predict class C Specifically, we want to find the value of C that maximizes P(C A 1, A 2,,A n ) 123

Shape Dataset: 124

12/27/2017 University of Waikato 125

12/27/2017 University of Waikato 126

P(Triangle) = 5/14= 0.38 P(Square) = 9/14= 0.63 Color: Triangle Square Green 3 4 4/7 2 3 3/11 Original: P( A C) Laplace: P( A C) i N N COLORi ic OUTLINE classes DOT SHAPE c N N ic c 1 c c: number of GREEN DASHED NO? p: prior probability Yellow 0 1 1/7 4 5 5/11 Red 2 3 3/7 3 4 4/11 Outline: Triangle Square Dashed 4 5 5/7 3 4 4/11 Solid 1 2 2/7 6 7 7/11 4/7 *5/7 *3/7 *5/14 = 0.062 Dot: Triangle Square Yes 3 4 4/7 3 4 4/11 No 2 3 3/7 6 7 7/11 3/11 *4/11 *7/11 *9/14 = 0.041 127

COLOR OUTLINE DOT SHAPE GREEN DASHED NO? Shapetest.csv 128

12/27/2017 University of Waikato 129

tp fn Confusion Matrix: fp tn True positive rate(tpr)/ Sensitivity,= False positive rate(fpr)/ Specificity,= No.of true positives No.of true positives+no.of false negatives No.of true negatives No.of true negatives+no.of false positives

tp fn Confusion Matrix: fp tn MCC (Matthews Correlation Coefficient): Measure of quality of binary classification

tn fp Confusion Matrix: fn tp True positive rate(tpr)/ Sensitivity,= False positive rate(fpr)/ Specificity,= No.of true positives No.of true positives+no.of false negatives No.of true negatives No.of true negatives+no.of false positives

tn fp Confusion Matrix: fn tp MCC (Matthews Correlation Coefficient): Measure of quality of binary classification

Kappa Statistic: Cohen s kappa statistic measures interrater reliability (sometimes called inter-observer agreement). Interrater reliability, or precision, happens when your data raters (or collectors) give the same score to the same data item. Step 1: Calculate P o (Observed Agreement). P 0 = (1+6)/14= 0.5 Step 2: Calculate P e (Expected Agreement). P(Triangle)=(5/14)*(4/14) P(Square)=(9/14)*(10/14) P e = (90/196)+(20/196)= 0.561 K= (0.5-0.561)/(1-0.561)= -0.141 134

STATUS FLOOR DEPT. OFFICE-SIZE RECYCLING- BIN? faculty four CS medium yes student four EE large yes staff five CS medium no student three EE small yes staff four CS medium no STATUS=student, FLOOR=four, DEPT. =CS, OFFICE SIZE=small Recycling Bin=? 135

Lets try with Iris dataset! 12/27/2017 University of Waikato 136

12/27/2017 University of Waikato 137

12/27/2017 University of Waikato 138

ROC Curve ROC: Receiver Operating Characteristic. Developed by British in World War II as part of Chain Home radar system. Used to analyze radar data to differentiate between enemy aircraft and signal noise. It is a performance graphing method. A plot of True Positive Rates and False Positive Rates. Used for evaluating data mining schemes. 139

ROC Curve 140

Example ROC Curve 141

Example ROC Curve 142

Why we need ROC curve? Consider a scenario: Design a ML tool. Training Data: Training Data Class: Should be test conducted for cancer by doctor? Create model. Tool will assign the patient a score between 0 and 1. High Score-? Tool is confident about the risk that patient has cancer. Low Score-?Tool is confident that patient is not at risk of having cancer. Test model. What evaluation measure-?. Before you measure anything, make a choicefamily history, age, weight, etc. Patient end up having cancer or not. True Positive Rate: How many ill people were recommended test? False Positive Rate: How many not-ill people were recommended test? False Negative Rate: How many ill people were not recommended test? True Negative Rate: How many not-ill people were not recommended test? Goal: To maximize TP, TN Rate and to minimize FP, FN Rate. Should not be test conducted for cancer by doctor? what threshold score do you use to decide whether or not patient needs test? 143

Consider a scenario: Design a ML tool. Should be tested conducted for cancer by doctor Training Data: family history, age, weight, etc. Training Data Class: Patient end up having cancer or not. Create model. Tool will assign the patient a score between 0 and 1. High Score-? Tool is confident about the risk of having cancer Low Score-? Tool Tool is is confident confident that that patient patient is is not not at at risk risk of of having having cancer. cancer. Test model. Should not be tested conducted for cancer by doctor What evaluation measure-?. Goal: To maximize TP, TN Rate and to minimize FP, FN Rate. Before you measure anything, make a choice- what threshold score do you use to decide whether or not patient needs test? As everyone with non-zero score has some risk. Low Threshold-?. Lot of Tests. High Threshold-?Ȯnly people with cancer will get tested. But there would be false negatives as well. (A lot of people with cancer would not be tested)

Non-diseased cases Diseased cases Threshold Test result value or subjective judgement of likelihood that case is diseased 145

Non-diseased cases Diseased cases more typically: Test result value or subjective judgement of likelihood that case is diseased 146

TPF, sensitivity Non-diseased cases Diseased cases Threshold less aggressive mindset FPF, 1-specificity 147

TPF, sensitivity Non-diseased cases Threshold moderate mindset Diseased cases FPF, 1-specificity 148

TPF, sensitivity Non-diseased cases more aggressive mindset Threshold Diseased cases FPF, 1-specificity 149

TPF, sensitivity Non-diseased cases Entire ROC curve Threshold Diseased cases FPF, 1-specificity 150

TPF, sensitivity Entire ROC curve Reader Skill and/or Level of Technology FPF, 1-specificity 151

Sensitivity: Refers to the test's ability to correctly detect ill patients who have cancer. Sensitivity = No.of true positives No.of true positives+no.of false negatives = probability of positive test given that patient is ill Specificity: Refers to the test's ability to correctly reject healthy patients who do not have cancer. Specificity = No.of true negatives No.of true negatives+no.of false positives = probability of negative test given that patent is not ill. 152

True positive rate (TPR) = False positive rate (FPR) = No.of true positives No.of true positives+no.of false negatives No.of false positives No.of true negatives+no.of false positives Move threshold from high to low. True positive rate increases (you test a higher proportion of those who do actually have cancer ) False positive rate increases (you incorrectly tell more people to get tested when they don t need to). 153

As you step through the threshold values from high to low, you put dots on the above graph from left to right - joining up the dots gives the ROC curve. 12/27/2017 University of Waikato 154

Score: 155

As you step through the threshold values from high to low, you put dots on the above graph from left to right - joining up the dots gives the ROC curve. 12/27/2017 University of Waikato 157

Comparing different classifiers: ROC curves provide a better look at where different learners minimize cost Which curve is better? Area under ROC curve: depicts how good classifier is? 158

Precision-Recall Curve 159