Machine Learning with Weka

Similar documents
CS Machine Learning

Learning From the Past with Experiment Databases

Python Machine Learning

Rule Learning With Negation: Issues Regarding Effectiveness

Rule Learning with Negation: Issues Regarding Effectiveness

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Lecture 1: Machine Learning Basics

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Probability and Statistics Curriculum Pacing Guide

Australian Journal of Basic and Applied Sciences

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Applications of data mining algorithms to analysis of medical data

CS 446: Machine Learning

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Human Emotion Recognition From Speech

Assignment 1: Predicting Amazon Review Ratings

Reducing Features to Improve Bug Prediction

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

(Sub)Gradient Descent

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

Chapter 2 Rule Learning in a Nutshell

Grade 2: Using a Number Line to Order and Compare Numbers Place Value Horizontal Content Strand

Disambiguation of Thai Personal Name from Online News Articles

Issues in the Mining of Heart Failure Datasets

ScienceDirect. A Framework for Clustering Cardiac Patient s Records Using Unsupervised Learning Techniques

STT 231 Test 1. Fill in the Letter of Your Choice to Each Question in the Scantron. Each question is worth 2 point.

Word Segmentation of Off-line Handwritten Documents

12- A whirlwind tour of statistics

A Case Study: News Classification Based on Term Frequency

Using dialogue context to improve parsing performance in dialogue systems

Calibration of Confidence Measures in Speech Recognition

An Empirical Analysis of the Effects of Mexican American Studies Participation on Student Achievement within Tucson Unified School District

Dublin City Schools Mathematics Graded Course of Study GRADE 4

Comparison of EM and Two-Step Cluster Method for Mixed Data: An Application

Lecture 2: Quantifiers and Approximation

Linking the Ohio State Assessments to NWEA MAP Growth Tests *

Evaluation of Teach For America:

For Jury Evaluation. The Road to Enlightenment: Generating Insight and Predicting Consumer Actions in Digital Markets

Analyzing sentiments in tweets for Tesla Model 3 using SAS Enterprise Miner and SAS Sentiment Analysis Studio

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Active Learning. Yingyu Liang Computer Sciences 760 Fall

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

16.1 Lesson: Putting it into practice - isikhnas

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

On-the-Fly Customization of Automated Essay Scoring

Using Web Searches on Important Words to Create Background Sets for LSI Classification

End-of-Module Assessment Task

Lecture 1: Basic Concepts of Machine Learning

Historical maintenance relevant information roadmap for a self-learning maintenance prediction procedural approach

Detecting English-French Cognates Using Orthographic Edit Distance

A Decision Tree Analysis of the Transfer Student Emma Gunu, MS Research Analyst Robert M Roe, PhD Executive Director of Institutional Research and

TIMSS ADVANCED 2015 USER GUIDE FOR THE INTERNATIONAL DATABASE. Pierre Foy

Beyond the Pipeline: Discrete Optimization in NLP

Ricopili: Postimputation Module. WCPG Education Day Stephan Ripke / Raymond Walters Toronto, October 2015

Speech Emotion Recognition Using Support Vector Machine

Research Design & Analysis Made Easy! Brainstorming Worksheet

The Evolution of Random Phenomena

Quantitative analysis with statistics (and ponies) (Some slides, pony-based examples from Blase Ur)

Softprop: Softmax Neural Network Backpropagation Learning

Grade 6: Correlated to AGS Basic Math Skills

CS4491/CS 7265 BIG DATA ANALYTICS INTRODUCTION TO THE COURSE. Mingon Kang, PhD Computer Science, Kennesaw State University

Spinners at the School Carnival (Unequal Sections)

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

On-Line Data Analytics

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Evidence for Reliability, Validity and Learning Effectiveness

Semi-Supervised Face Detection

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Indian Institute of Technology, Kanpur

A study of speaker adaptation for DNN-based speech synthesis

Evaluating and Comparing Classifiers: Review, Some Recommendations and Limitations

Truth Inference in Crowdsourcing: Is the Problem Solved?

Data Structures and Algorithms

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Instructor: Mario D. Garrett, Ph.D. Phone: Office: Hepner Hall (HH) 100

AP Statistics Summer Assignment 17-18

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

GDP Falls as MBA Rises?

Switchboard Language Model Improvement with Conversational Data from Gigaword

Case study Norway case 1

A Coding System for Dynamic Topic Analysis: A Computer-Mediated Discourse Analysis Technique

An Empirical and Computational Test of Linguistic Relativity

Evolutive Neural Net Fuzzy Filtering: Basic Description

Large-Scale Web Page Classification. Sathi T Marath. Submitted in partial fulfilment of the requirements. for the degree of Doctor of Philosophy

What Different Kinds of Stratification Can Reveal about the Generalizability of Data-Mined Skill Assessment Models

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Using focal point learning to improve human machine tacit coordination

University of Cincinnati College of Medicine. DECISION ANALYSIS AND COST-EFFECTIVENESS BE-7068C: Spring 2016

VOL. 3, NO. 5, May 2012 ISSN Journal of Emerging Trends in Computing and Information Sciences CIS Journal. All rights reserved.

Multi-label classification via multi-target regression on data streams

Individual Component Checklist L I S T E N I N G. for use with ONE task ENGLISH VERSION

Using EEG to Improve Massive Open Online Courses Feedback Interaction

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN

Transcription:

Machine Learning with Weka SLIDES BY (TOTAL 5 Session of 1.5 Hours Each) ANJALI GOYAL & ASHISH SUREKA (www.ashish-sureka.in) CS 309 INFORMATION RETRIEVAL COURSE ASHOKA UNIVERSITY NOTE: Slides created and edited using existing teaching resources on Internet

WEKA: the software Machine learning/data mining software written in Java (distributed under the GNU Public License) Used for research, education, and applications Main features: Comprehensive set of data pre-processing tools, learning algorithms and evaluation methods Graphical user interfaces (incl. data visualization) Environment for comparing learning algorithms 2

WEKA: download and install Go to website: https://www.cs.waikato.ac.nz/ml/weka/ 3

WEKA: download and install Go to website: https://www.cs.waikato.ac.nz/ml/weka/ 4

WEKA only deals with flat files @relation heart-disease-simplified @attribute age numeric @attribute sex { female, male} @attribute chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina} @attribute cholesterol numeric @attribute exercise_induced_angina { no, yes} @attribute class { present, not_present} @data 63,male,typ_angina,233,no,not_present 67,male,asympt,286,yes,present 67,male,asympt,229,yes,present 38,female,non_anginal,?,no,not_present... 5

WEKA only deals with flat files @relation heart-disease-simplified @attribute age numeric @attribute sex { female, male} @attribute chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina} @attribute cholesterol numeric @attribute exercise_induced_angina { no, yes} @attribute class {present, not_present} @data 63,male,typ_angina,233,no,not_present 67,male,asympt,286,yes,present 67,male,asympt,229,yes,present 38,female,non_anginal,?,no,not_present... 6

7

Explorer: pre-processing the data Data can be imported from a file in various formats: ARFF, CSV Data can also be read from a URL or from an SQL database (using JDBC) Pre-processing tools in WEKA are called filters WEKA contains filters for: Discretization, normalization, resampling, attribute selection, transforming and combining attributes, 8

12/27/2017 University of Waikato 9

12/27/2017 University of Waikato 10

Iris Dataset 11

Iris Dataset 12

Iris Dataset- Arff 13

Distinct is no. of distinct values i.e. total no. of instances if you removed all duplicates. Unique is no. of values that appear only once. What do you observe from this graph? 4.3-7.9? Colors? 5, 6,? What do they add to? Is sepallength a good predictor? 12/27/2017 University of Waikato 14

Check if sepalwidth is good predictor? 12/27/2017 University of Waikato 15

12/27/2017 University of Waikato 16

12/27/2017 University of Waikato 17

12/27/2017 University of Waikato 18

Which of the 4 attributes is better predictor? 12/27/2017 University of Waikato 19

Data Processing 12/27/2017 University of Waikato 20

Discretization Discretization is the process of putting values into buckets so that there are a limited number of possible states. (continuous to categorical ) Many classification algorithms produce better results on discretized data. 21

22

23

24

12/27/2017 University of Waikato 25

12/27/2017 University of Waikato 26

12/27/2017 University of Waikato 27

12/27/2017 University of Waikato 28

12/27/2017 University of Waikato 29

12/27/2017 University of Waikato 30

12/27/2017 University of Waikato 31

12/27/2017 University of Waikato 32

12/27/2017 University of Waikato 33

12/27/2017 University of Waikato 34

What should be the best no. of bins? 12/27/2017 University of Waikato 35

Explorer: data visualization Visualization very useful in practice: e.g. helps to determine difficulty of the learning problem WEKA can visualize single attributes (1-d) and pairs of attributes (2-d) To do: rotating 3-d visualizations (Xgobi-style) Color-coded class values Jitter option to deal with nominal attributes (and to detect hidden data points) Zoom-in function 36

12/27/2017 University of Waikato 37

Which two attributes are linearly correlated? 12/27/2017 University of Waikato 38

12/27/2017 University of Waikato 39

12/27/2017 University of Waikato 40

12/27/2017 University of Waikato 41

12/27/2017 University of Waikato 42

12/27/2017 University of Waikato 43

12/27/2017 University of Waikato 44

12/27/2017 University of Waikato 45

12/27/2017 University of Waikato 46

Explorer: attribute selection Panel that can be used to investigate which (subsets of) attributes are the most predictive ones Attribute selection methods contain two parts: A search method: best-first, forward selection, random, exhaustive, genetic algorithm, ranking An evaluation method: correlation-based, wrapper, information gain, chi-squared, Very flexible: WEKA allows (almost) arbitrary combinations of these two 47

12/27/2017 University of Waikato 48

12/27/2017 University of Waikato 49

12/27/2017 University of Waikato 50

12/27/2017 University of Waikato 51

12/27/2017 University of Waikato 52

12/27/2017 University of Waikato 53

12/27/2017 University of Waikato 54

12/27/2017 University of Waikato 55

12/27/2017 University of Waikato 56

Add a new feature to existing dataset such that new feature is most beneficial? Add a feature which has distinct values for all classes. Add a new feature to existing dataset such that new feature is least beneficial? Add a feature which has same values for all classes. 57

Lets try with Iris dataset! 12/27/2017 University of Waikato 58

12/27/2017 University of Waikato 59

12/27/2017 University of Waikato 60

12/27/2017 University of Waikato 61

Explorer: building classifiers Classifiers in WEKA are models for predicting nominal or numeric quantities Implemented learning schemes include: Decision trees and lists, instance-based classifiers, support vector machines, multi-layer perceptrons, logistic regression, Bayes nets, Meta -classifiers include: Bagging, boosting, stacking, error-correcting output codes, locally weighted learning, 62

12/27/2017 University of Waikato 63

12/27/2017 University of Waikato 64

12/27/2017 University of Waikato 65

12/27/2017 University of Waikato 66

12/27/2017 University of Waikato 67

12/27/2017 University of Waikato 68

12/27/2017 University of Waikato 69

12/27/2017 University of Waikato 70

12/27/2017 University of Waikato 71

12/27/2017 University of Waikato 72

12/27/2017 University of Waikato 73

12/27/2017 University of Waikato 74

Training data is again used for testing model. Training data is used for model development and an unseen set of data is used for testing model. It is held one out scheme. Train on a certain percentage of data and then test on rest of data. 12/27/2017 University of Waikato 75

12/27/2017 University of Waikato 76

Cross Validation Cross Validation is the method for estimating the accuracy of an inducer by dividing the data into K mutually exclusive subsets (folds) of approximately equal size. Simplest and most widely used method for estimating prediction error. 77

We use Cross Validation as follows: Divide data into K folds; hold-out one part and fit using the remaining data (compute error rate on hold-out data); repeat K times. CV Error Rate: average over the K errors we have computed. (Let us suppose, K = 5). 11 76 5 47 Original Data Testing Data Training Data K=1 K=2 K=3 K=4 K=5

How many folds needed (k=?) Large K: small bias, large variance as well as high computational time. Small K: Computational time reduced, small variance, large bias. A common choice for K is 5-10. 79

12/27/2017 University of Waikato 80

12/27/2017 University of Waikato 81

12/27/2017 University of Waikato 82

12/27/2017 University of Waikato 83

12/27/2017 University of Waikato 84

12/27/2017 University of Waikato 85

12/27/2017 University of Waikato 86

12/27/2017 University of Waikato 87

12/27/2017 University of Waikato 88

tp fn fp tn 12/27/2017 University of Waikato 89

tn fp fn tp 12/27/2017 University of Waikato 90

12/27/2017 University of Waikato 91

12/27/2017 University of Waikato 92

Add a new feature to existing dataset such that new feature is most beneficial? Add a feature which has distinct values for all classes. Add a new feature to existing dataset such that new feature is least beneficial? Add a feature which has same values for all classes. 93

Add a new feature to existing dataset such that new feature is most beneficial? Add a feature which has distinct values for all classes. Add a new feature to existing dataset such that new feature is least beneficial? Add a feature which has same values for all classes. 94

Lets try with Iris dataset! 12/27/2017 University of Waikato 95

12/27/2017 University of Waikato 96

12/27/2017 University of Waikato 97

12/27/2017 University of Waikato 98

12/27/2017 University of Waikato 99

12/27/2017 University of Waikato 100

12/27/2017 University of Waikato 101

12/27/2017 University of Waikato 102

12/27/2017 University of Waikato 103

Attribute Selection+ Classification (Weather.arff) 104

12/27/2017 University of Waikato 105

12/27/2017 University of Waikato 106

12/27/2017 University of Waikato 107

12/27/2017 University of Waikato 108

Discretization Discretization is the process of putting values into buckets so that there are a limited number of possible states. (continuous to categorical ) Many classification algorithms produce better results on discretized data. 109

110

111

112

12/27/2017 University of Waikato 113

12/27/2017 University of Waikato 114

12/27/2017 University of Waikato 115

12/27/2017 University of Waikato 116

12/27/2017 University of Waikato 117

12/27/2017 University of Waikato 118

12/27/2017 University of Waikato 119

12/27/2017 University of Waikato 120

12/27/2017 University of Waikato 121

12/27/2017 University of Waikato 122

Naïve Bayes Classifier Consider each attribute and class label as random variables Given a record with attributes (A 1, A 2,,A n ) Goal is to predict class C Specifically, we want to find the value of C that maximizes P(C A 1, A 2,,A n ) 123

Shape Dataset: 124

12/27/2017 University of Waikato 125

12/27/2017 University of Waikato 126

P(Triangle) = 5/14= 0.38 P(Square) = 9/14= 0.63 Color: Triangle Square Green 3 4 4/7 2 3 3/11 Original: P( A C) Laplace: P( A C) i N N COLORi ic OUTLINE classes DOT SHAPE c N N ic c 1 c c: number of GREEN DASHED NO? p: prior probability Yellow 0 1 1/7 4 5 5/11 Red 2 3 3/7 3 4 4/11 Outline: Triangle Square Dashed 4 5 5/7 3 4 4/11 Solid 1 2 2/7 6 7 7/11 4/7 *5/7 *3/7 *5/14 = 0.062 Dot: Triangle Square Yes 3 4 4/7 3 4 4/11 No 2 3 3/7 6 7 7/11 3/11 *4/11 *7/11 *9/14 = 0.041 127

COLOR OUTLINE DOT SHAPE GREEN DASHED NO? Shapetest.csv 128

12/27/2017 University of Waikato 129

tp fn Confusion Matrix: fp tn True positive rate(tpr)/ Sensitivity,= False positive rate(fpr)/ Specificity,= No.of true positives No.of true positives+no.of false negatives No.of true negatives No.of true negatives+no.of false positives

tp fn Confusion Matrix: fp tn MCC (Matthews Correlation Coefficient): Measure of quality of binary classification

tn fp Confusion Matrix: fn tp True positive rate(tpr)/ Sensitivity,= False positive rate(fpr)/ Specificity,= No.of true positives No.of true positives+no.of false negatives No.of true negatives No.of true negatives+no.of false positives

tn fp Confusion Matrix: fn tp MCC (Matthews Correlation Coefficient): Measure of quality of binary classification

Kappa Statistic: Cohen s kappa statistic measures interrater reliability (sometimes called inter-observer agreement). Interrater reliability, or precision, happens when your data raters (or collectors) give the same score to the same data item. Step 1: Calculate P o (Observed Agreement). P 0 = (1+6)/14= 0.5 Step 2: Calculate P e (Expected Agreement). P(Triangle)=(5/14)*(4/14) P(Square)=(9/14)*(10/14) P e = (90/196)+(20/196)= 0.561 K= (0.5-0.561)/(1-0.561)= -0.141 134

STATUS FLOOR DEPT. OFFICE-SIZE RECYCLING- BIN? faculty four CS medium yes student four EE large yes staff five CS medium no student three EE small yes staff four CS medium no STATUS=student, FLOOR=four, DEPT. =CS, OFFICE SIZE=small Recycling Bin=? 135

Lets try with Iris dataset! 12/27/2017 University of Waikato 136

12/27/2017 University of Waikato 137

12/27/2017 University of Waikato 138

ROC Curve ROC: Receiver Operating Characteristic. Developed by British in World War II as part of Chain Home radar system. Used to analyze radar data to differentiate between enemy aircraft and signal noise. It is a performance graphing method. A plot of True Positive Rates and False Positive Rates. Used for evaluating data mining schemes. 139

ROC Curve 140

Example ROC Curve 141

Example ROC Curve 142

Why we need ROC curve? Consider a scenario: Design a ML tool. Training Data: Training Data Class: Should be test conducted for cancer by doctor? Create model. Tool will assign the patient a score between 0 and 1. High Score-? Tool is confident about the risk that patient has cancer. Low Score-?Tool is confident that patient is not at risk of having cancer. Test model. What evaluation measure-?. Before you measure anything, make a choicefamily history, age, weight, etc. Patient end up having cancer or not. True Positive Rate: How many ill people were recommended test? False Positive Rate: How many not-ill people were recommended test? False Negative Rate: How many ill people were not recommended test? True Negative Rate: How many not-ill people were not recommended test? Goal: To maximize TP, TN Rate and to minimize FP, FN Rate. Should not be test conducted for cancer by doctor? what threshold score do you use to decide whether or not patient needs test? 143

Consider a scenario: Design a ML tool. Should be tested conducted for cancer by doctor Training Data: family history, age, weight, etc. Training Data Class: Patient end up having cancer or not. Create model. Tool will assign the patient a score between 0 and 1. High Score-? Tool is confident about the risk of having cancer Low Score-? Tool Tool is is confident confident that that patient patient is is not not at at risk risk of of having having cancer. cancer. Test model. Should not be tested conducted for cancer by doctor What evaluation measure-?. Goal: To maximize TP, TN Rate and to minimize FP, FN Rate. Before you measure anything, make a choice- what threshold score do you use to decide whether or not patient needs test? As everyone with non-zero score has some risk. Low Threshold-?. Lot of Tests. High Threshold-?Ȯnly people with cancer will get tested. But there would be false negatives as well. (A lot of people with cancer would not be tested)

Non-diseased cases Diseased cases Threshold Test result value or subjective judgement of likelihood that case is diseased 145

Non-diseased cases Diseased cases more typically: Test result value or subjective judgement of likelihood that case is diseased 146

TPF, sensitivity Non-diseased cases Diseased cases Threshold less aggressive mindset FPF, 1-specificity 147

TPF, sensitivity Non-diseased cases Threshold moderate mindset Diseased cases FPF, 1-specificity 148

TPF, sensitivity Non-diseased cases more aggressive mindset Threshold Diseased cases FPF, 1-specificity 149

TPF, sensitivity Non-diseased cases Entire ROC curve Threshold Diseased cases FPF, 1-specificity 150

TPF, sensitivity Entire ROC curve Reader Skill and/or Level of Technology FPF, 1-specificity 151

Sensitivity: Refers to the test's ability to correctly detect ill patients who have cancer. Sensitivity = No.of true positives No.of true positives+no.of false negatives = probability of positive test given that patient is ill Specificity: Refers to the test's ability to correctly reject healthy patients who do not have cancer. Specificity = No.of true negatives No.of true negatives+no.of false positives = probability of negative test given that patent is not ill. 152

True positive rate (TPR) = False positive rate (FPR) = No.of true positives No.of true positives+no.of false negatives No.of false positives No.of true negatives+no.of false positives Move threshold from high to low. True positive rate increases (you test a higher proportion of those who do actually have cancer ) False positive rate increases (you incorrectly tell more people to get tested when they don t need to). 153

As you step through the threshold values from high to low, you put dots on the above graph from left to right - joining up the dots gives the ROC curve. 12/27/2017 University of Waikato 154

Score: 155

As you step through the threshold values from high to low, you put dots on the above graph from left to right - joining up the dots gives the ROC curve. 12/27/2017 University of Waikato 157

Comparing different classifiers: ROC curves provide a better look at where different learners minimize cost Which curve is better? Area under ROC curve: depicts how good classifier is? 158

Precision-Recall Curve 159