Machine Learning with Weka

Size: px

Start display at page:

Download "Machine Learning with Weka"

Anne Grant
6 years ago
Views:

1 Machine Learning with Weka SLIDES BY (TOTAL 5 Session of 1.5 Hours Each) ANJALI GOYAL & ASHISH SUREKA ( CS 309 INFORMATION RETRIEVAL COURSE ASHOKA UNIVERSITY NOTE: Slides created and edited using existing teaching resources on Internet

2 WEKA: the software Machine learning/data mining software written in Java (distributed under the GNU Public License) Used for research, education, and applications Main features: Comprehensive set of data pre-processing tools, learning algorithms and evaluation methods Graphical user interfaces (incl. data visualization) Environment for comparing learning algorithms 2

3 WEKA: download and install Go to website: 3

4 WEKA: download and install Go to website: 4

5 WEKA only deals with flat age sex { female, chest_pain_type { typ_angina, asympt, non_anginal, cholesterol exercise_induced_angina { no, class { present, 63,male,typ_angina,233,no,not_present 67,male,asympt,286,yes,present 67,male,asympt,229,yes,present 38,female,non_anginal,?,no,not_present... 5

6 WEKA only deals with flat age sex { female, chest_pain_type { typ_angina, asympt, non_anginal, cholesterol exercise_induced_angina { no, class {present, 63,male,typ_angina,233,no,not_present 67,male,asympt,286,yes,present 67,male,asympt,229,yes,present 38,female,non_anginal,?,no,not_present... 6

7 7

8 Explorer: pre-processing the data Data can be imported from a file in various formats: ARFF, CSV Data can also be read from a URL or from an SQL database (using JDBC) Pre-processing tools in WEKA are called filters WEKA contains filters for: Discretization, normalization, resampling, attribute selection, transforming and combining attributes, 8

9 12/27/2017 University of Waikato 9

10 12/27/2017 University of Waikato 10

11 Iris Dataset 11

12 Iris Dataset 12

13 Iris Dataset- Arff 13

14 Distinct is no. of distinct values i.e. total no. of instances if you removed all duplicates. Unique is no. of values that appear only once. What do you observe from this graph? ? Colors? 5, 6,? What do they add to? Is sepallength a good predictor? 12/27/2017 University of Waikato 14

15 Check if sepalwidth is good predictor? 12/27/2017 University of Waikato 15

16 12/27/2017 University of Waikato 16

17 12/27/2017 University of Waikato 17

18 12/27/2017 University of Waikato 18

19 Which of the 4 attributes is better predictor? 12/27/2017 University of Waikato 19

20 Data Processing 12/27/2017 University of Waikato 20

21 Discretization Discretization is the process of putting values into buckets so that there are a limited number of possible states. (continuous to categorical ) Many classification algorithms produce better results on discretized data. 21

22 22

23 23

24 24

25 12/27/2017 University of Waikato 25

26 12/27/2017 University of Waikato 26

27 12/27/2017 University of Waikato 27

28 12/27/2017 University of Waikato 28

29 12/27/2017 University of Waikato 29

30 12/27/2017 University of Waikato 30

31 12/27/2017 University of Waikato 31

32 12/27/2017 University of Waikato 32

33 12/27/2017 University of Waikato 33

34 12/27/2017 University of Waikato 34

35 What should be the best no. of bins? 12/27/2017 University of Waikato 35

36 Explorer: data visualization Visualization very useful in practice: e.g. helps to determine difficulty of the learning problem WEKA can visualize single attributes (1-d) and pairs of attributes (2-d) To do: rotating 3-d visualizations (Xgobi-style) Color-coded class values Jitter option to deal with nominal attributes (and to detect hidden data points) Zoom-in function 36

37 12/27/2017 University of Waikato 37

38 Which two attributes are linearly correlated? 12/27/2017 University of Waikato 38

39 12/27/2017 University of Waikato 39

40 12/27/2017 University of Waikato 40

41 12/27/2017 University of Waikato 41

42 12/27/2017 University of Waikato 42

43 12/27/2017 University of Waikato 43

44 12/27/2017 University of Waikato 44

45 12/27/2017 University of Waikato 45

46 12/27/2017 University of Waikato 46

47 Explorer: attribute selection Panel that can be used to investigate which (subsets of) attributes are the most predictive ones Attribute selection methods contain two parts: A search method: best-first, forward selection, random, exhaustive, genetic algorithm, ranking An evaluation method: correlation-based, wrapper, information gain, chi-squared, Very flexible: WEKA allows (almost) arbitrary combinations of these two 47

48 12/27/2017 University of Waikato 48

49 12/27/2017 University of Waikato 49

50 12/27/2017 University of Waikato 50

51 12/27/2017 University of Waikato 51

52 12/27/2017 University of Waikato 52

53 12/27/2017 University of Waikato 53

54 12/27/2017 University of Waikato 54

55 12/27/2017 University of Waikato 55

56 12/27/2017 University of Waikato 56

57 Add a new feature to existing dataset such that new feature is most beneficial? Add a feature which has distinct values for all classes. Add a new feature to existing dataset such that new feature is least beneficial? Add a feature which has same values for all classes. 57

58 Lets try with Iris dataset! 12/27/2017 University of Waikato 58

59 12/27/2017 University of Waikato 59

60 12/27/2017 University of Waikato 60

61 12/27/2017 University of Waikato 61

62 Explorer: building classifiers Classifiers in WEKA are models for predicting nominal or numeric quantities Implemented learning schemes include: Decision trees and lists, instance-based classifiers, support vector machines, multi-layer perceptrons, logistic regression, Bayes nets, Meta -classifiers include: Bagging, boosting, stacking, error-correcting output codes, locally weighted learning, 62

63 12/27/2017 University of Waikato 63

64 12/27/2017 University of Waikato 64

65 12/27/2017 University of Waikato 65

66 12/27/2017 University of Waikato 66

67 12/27/2017 University of Waikato 67

68 12/27/2017 University of Waikato 68

69 12/27/2017 University of Waikato 69

70 12/27/2017 University of Waikato 70

71 12/27/2017 University of Waikato 71

72 12/27/2017 University of Waikato 72

73 12/27/2017 University of Waikato 73

74 12/27/2017 University of Waikato 74

75 Training data is again used for testing model. Training data is used for model development and an unseen set of data is used for testing model. It is held one out scheme. Train on a certain percentage of data and then test on rest of data. 12/27/2017 University of Waikato 75

76 12/27/2017 University of Waikato 76

77 Cross Validation Cross Validation is the method for estimating the accuracy of an inducer by dividing the data into K mutually exclusive subsets (folds) of approximately equal size. Simplest and most widely used method for estimating prediction error. 77

We use Cross Validation as follows: Divide data into K folds; hold-out one part and fit using the remaining data (compute error rate on hold-out data); repeat K

78 We use Cross Validation as follows: Divide data into K folds; hold-out one part and fit using the remaining data (compute error rate on hold-out data); repeat K times. CV Error Rate: average over the K errors we have computed. (Let us suppose, K = 5) Original Data Testing Data Training Data K=1 K=2 K=3 K=4 K=5

79 How many folds needed (k=?) Large K: small bias, large variance as well as high computational time. Small K: Computational time reduced, small variance, large bias. A common choice for K is

80 12/27/2017 University of Waikato 80

81 12/27/2017 University of Waikato 81

82 12/27/2017 University of Waikato 82

83 12/27/2017 University of Waikato 83

84 12/27/2017 University of Waikato 84

85 12/27/2017 University of Waikato 85

86 12/27/2017 University of Waikato 86

87 12/27/2017 University of Waikato 87

88 12/27/2017 University of Waikato 88

89 tp fn fp tn 12/27/2017 University of Waikato 89

90 tn fp fn tp 12/27/2017 University of Waikato 90

91 12/27/2017 University of Waikato 91

92 12/27/2017 University of Waikato 92

93 Add a new feature to existing dataset such that new feature is most beneficial? Add a feature which has distinct values for all classes. Add a new feature to existing dataset such that new feature is least beneficial? Add a feature which has same values for all classes. 93

94 Add a new feature to existing dataset such that new feature is most beneficial? Add a feature which has distinct values for all classes. Add a new feature to existing dataset such that new feature is least beneficial? Add a feature which has same values for all classes. 94

95 Lets try with Iris dataset! 12/27/2017 University of Waikato 95

96 12/27/2017 University of Waikato 96

97 12/27/2017 University of Waikato 97

98 12/27/2017 University of Waikato 98

99 12/27/2017 University of Waikato 99

100 12/27/2017 University of Waikato 100

101 12/27/2017 University of Waikato 101

102 12/27/2017 University of Waikato 102

103 12/27/2017 University of Waikato 103

104 Attribute Selection+ Classification (Weather.arff) 104

105 12/27/2017 University of Waikato 105

106 12/27/2017 University of Waikato 106

107 12/27/2017 University of Waikato 107

108 12/27/2017 University of Waikato 108

109 Discretization Discretization is the process of putting values into buckets so that there are a limited number of possible states. (continuous to categorical ) Many classification algorithms produce better results on discretized data. 109

110 110

111 111

112 112

113 12/27/2017 University of Waikato 113

114 12/27/2017 University of Waikato 114

115 12/27/2017 University of Waikato 115

116 12/27/2017 University of Waikato 116

117 12/27/2017 University of Waikato 117

118 12/27/2017 University of Waikato 118

119 12/27/2017 University of Waikato 119

120 12/27/2017 University of Waikato 120

121 12/27/2017 University of Waikato 121

122 12/27/2017 University of Waikato 122

123 Naïve Bayes Classifier Consider each attribute and class label as random variables Given a record with attributes (A 1, A 2,,A n ) Goal is to predict class C Specifically, we want to find the value of C that maximizes P(C A 1, A 2,,A n ) 123

124 Shape Dataset: 124

125 12/27/2017 University of Waikato 125

126 12/27/2017 University of Waikato 126

127 P(Triangle) = 5/14= 0.38 P(Square) = 9/14= 0.63 Color: Triangle Square Green 3 4 4/ /11 Original: P( A C) Laplace: P( A C) i N N COLORi ic OUTLINE classes DOT SHAPE c N N ic c 1 c c: number of GREEN DASHED NO? p: prior probability Yellow 0 1 1/ /11 Red 2 3 3/ /11 Outline: Triangle Square Dashed 4 5 5/ /11 Solid 1 2 2/ /11 4/7 *5/7 *3/7 *5/14 = Dot: Triangle Square Yes 3 4 4/ /11 No 2 3 3/ /11 3/11 *4/11 *7/11 *9/14 =

128 COLOR OUTLINE DOT SHAPE GREEN DASHED NO? Shapetest.csv 128

129 12/27/2017 University of Waikato 129

130 tp fn Confusion Matrix: fp tn True positive rate(tpr)/ Sensitivity,= False positive rate(fpr)/ Specificity,= No.of true positives No.of true positives+no.of false negatives No.of true negatives No.of true negatives+no.of false positives

131 tp fn Confusion Matrix: fp tn MCC (Matthews Correlation Coefficient): Measure of quality of binary classification

132 tn fp Confusion Matrix: fn tp True positive rate(tpr)/ Sensitivity,= False positive rate(fpr)/ Specificity,= No.of true positives No.of true positives+no.of false negatives No.of true negatives No.of true negatives+no.of false positives

133 tn fp Confusion Matrix: fn tp MCC (Matthews Correlation Coefficient): Measure of quality of binary classification

134 Kappa Statistic: Cohen s kappa statistic measures interrater reliability (sometimes called inter-observer agreement). Interrater reliability, or precision, happens when your data raters (or collectors) give the same score to the same data item. Step 1: Calculate P o (Observed Agreement). P 0 = (1+6)/14= 0.5 Step 2: Calculate P e (Expected Agreement). P(Triangle)=(5/14)*(4/14) P(Square)=(9/14)*(10/14) P e = (90/196)+(20/196)= K= ( )/( )=

135 STATUS FLOOR DEPT. OFFICE-SIZE RECYCLING- BIN? faculty four CS medium yes student four EE large yes staff five CS medium no student three EE small yes staff four CS medium no STATUS=student, FLOOR=four, DEPT. =CS, OFFICE SIZE=small Recycling Bin=? 135

136 Lets try with Iris dataset! 12/27/2017 University of Waikato 136

137 12/27/2017 University of Waikato 137

138 12/27/2017 University of Waikato 138

139 ROC Curve ROC: Receiver Operating Characteristic. Developed by British in World War II as part of Chain Home radar system. Used to analyze radar data to differentiate between enemy aircraft and signal noise. It is a performance graphing method. A plot of True Positive Rates and False Positive Rates. Used for evaluating data mining schemes. 139

140 ROC Curve 140

141 Example ROC Curve 141

142 Example ROC Curve 142

143 Why we need ROC curve? Consider a scenario: Design a ML tool. Training Data: Training Data Class: Should be test conducted for cancer by doctor? Create model. Tool will assign the patient a score between 0 and 1. High Score-? Tool is confident about the risk that patient has cancer. Low Score-?Tool is confident that patient is not at risk of having cancer. Test model. What evaluation measure-?. Before you measure anything, make a choicefamily history, age, weight, etc. Patient end up having cancer or not. True Positive Rate: How many ill people were recommended test? False Positive Rate: How many not-ill people were recommended test? False Negative Rate: How many ill people were not recommended test? True Negative Rate: How many not-ill people were not recommended test? Goal: To maximize TP, TN Rate and to minimize FP, FN Rate. Should not be test conducted for cancer by doctor? what threshold score do you use to decide whether or not patient needs test? 143

144 Consider a scenario: Design a ML tool. Should be tested conducted for cancer by doctor Training Data: family history, age, weight, etc. Training Data Class: Patient end up having cancer or not. Create model. Tool will assign the patient a score between 0 and 1. High Score-? Tool is confident about the risk of having cancer Low Score-? Tool Tool is is confident confident that that patient patient is is not not at at risk risk of of having having cancer. cancer. Test model. Should not be tested conducted for cancer by doctor What evaluation measure-?. Goal: To maximize TP, TN Rate and to minimize FP, FN Rate. Before you measure anything, make a choice- what threshold score do you use to decide whether or not patient needs test? As everyone with non-zero score has some risk. Low Threshold-?. Lot of Tests. High Threshold-?Ȯnly people with cancer will get tested. But there would be false negatives as well. (A lot of people with cancer would not be tested)

145 Non-diseased cases Diseased cases Threshold Test result value or subjective judgement of likelihood that case is diseased 145

146 Non-diseased cases Diseased cases more typically: Test result value or subjective judgement of likelihood that case is diseased 146

147 TPF, sensitivity Non-diseased cases Diseased cases Threshold less aggressive mindset FPF, 1-specificity 147

148 TPF, sensitivity Non-diseased cases Threshold moderate mindset Diseased cases FPF, 1-specificity 148

149 TPF, sensitivity Non-diseased cases more aggressive mindset Threshold Diseased cases FPF, 1-specificity 149

150 TPF, sensitivity Non-diseased cases Entire ROC curve Threshold Diseased cases FPF, 1-specificity 150

151 TPF, sensitivity Entire ROC curve Reader Skill and/or Level of Technology FPF, 1-specificity 151

152 Sensitivity: Refers to the test's ability to correctly detect ill patients who have cancer. Sensitivity = No.of true positives No.of true positives+no.of false negatives = probability of positive test given that patient is ill Specificity: Refers to the test's ability to correctly reject healthy patients who do not have cancer. Specificity = No.of true negatives No.of true negatives+no.of false positives = probability of negative test given that patent is not ill. 152

153 True positive rate (TPR) = False positive rate (FPR) = No.of true positives No.of true positives+no.of false negatives No.of false positives No.of true negatives+no.of false positives Move threshold from high to low. True positive rate increases (you test a higher proportion of those who do actually have cancer ) False positive rate increases (you incorrectly tell more people to get tested when they don t need to). 153

154 As you step through the threshold values from high to low, you put dots on the above graph from left to right - joining up the dots gives the ROC curve. 12/27/2017 University of Waikato 154

155 Score: 155

156

157 As you step through the threshold values from high to low, you put dots on the above graph from left to right - joining up the dots gives the ROC curve. 12/27/2017 University of Waikato 157

158 Comparing different classifiers: ROC curves provide a better look at where different learners minimize cost Which curve is better? Area under ROC curve: depicts how good classifier is? 158

159 Precision-Recall Curve 159

CS Machine Learning

CS Machine Learning CS 478 - Machine Learning Projects Data Representation Basic testing and evaluation schemes CS 478 Data and Testing 1 Programming Issues l Program in any platform you want l Realize that you will be doing