WEKA tutorial exercises

Similar documents
Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Learning From the Past with Experiment Databases

CS Machine Learning

Python Machine Learning

Lecture 1: Machine Learning Basics

CS 446: Machine Learning

Assignment 1: Predicting Amazon Review Ratings

Australian Journal of Basic and Applied Sciences

Experiment Databases: Towards an Improved Experimental Methodology in Machine Learning

Rule Learning with Negation: Issues Regarding Effectiveness

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

Rule Learning With Negation: Issues Regarding Effectiveness

Probabilistic Latent Semantic Analysis

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

(Sub)Gradient Descent

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

CSL465/603 - Machine Learning

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Disambiguation of Thai Personal Name from Online News Articles

Activity Recognition from Accelerometer Data

Reducing Features to Improve Bug Prediction

CS4491/CS 7265 BIG DATA ANALYTICS INTRODUCTION TO THE COURSE. Mingon Kang, PhD Computer Science, Kennesaw State University

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

Applications of data mining algorithms to analysis of medical data

Issues in the Mining of Heart Failure Datasets

Beyond the Pipeline: Discrete Optimization in NLP

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Indian Institute of Technology, Kanpur

For Jury Evaluation. The Road to Enlightenment: Generating Insight and Predicting Consumer Actions in Digital Markets

Improving Simple Bayes. Abstract. The simple Bayesian classier (SBC), sometimes called

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Human Emotion Recognition From Speech

Linking Task: Identifying authors and book titles in verbose queries

Speech Recognition by Indexing and Sequencing

Laboratorio di Intelligenza Artificiale e Robotica

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Cross-lingual Short-Text Document Classification for Facebook Comments

Feature Selection based on Sampling and C4.5 Algorithm to Improve the Quality of Text Classification using Naïve Bayes

Calibration of Confidence Measures in Speech Recognition

A Biological Signal-Based Stress Monitoring Framework for Children Using Wearable Devices

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Multivariate k-nearest Neighbor Regression for Time Series data -

School of Innovative Technologies and Engineering

A Case Study: News Classification Based on Term Frequency

Semi-Supervised Face Detection

Historical maintenance relevant information roadmap for a self-learning maintenance prediction procedural approach

Introduction to Causal Inference. Problem Set 1. Required Problems

Lecture 1: Basic Concepts of Machine Learning

Switchboard Language Model Improvement with Conversational Data from Gigaword

Content-based Image Retrieval Using Image Regions as Query Examples

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

SARDNET: A Self-Organizing Feature Map for Sequences

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

WHEN THERE IS A mismatch between the acoustic

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Pp. 176{182 in Proceedings of The Second International Conference on Knowledge Discovery and Data Mining. Predictive Data Mining with Finite Mixtures

Large-Scale Web Page Classification. Sathi T Marath. Submitted in partial fulfilment of the requirements. for the degree of Doctor of Philosophy

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Using dialogue context to improve parsing performance in dialogue systems

Speech Emotion Recognition Using Support Vector Machine

Welcome to. ECML/PKDD 2004 Community meeting

PHONETIC DISTANCE BASED ACCENT CLASSIFIER TO IDENTIFY PRONUNCIATION VARIANTS AND OOV WORDS

Universidade do Minho Escola de Engenharia

Mining Association Rules in Student s Assessment Data

Unit 3: Lesson 1 Decimals as Equal Divisions

ScienceDirect. A Framework for Clustering Cardiac Patient s Records Using Unsupervised Learning Techniques

Laboratorio di Intelligenza Artificiale e Robotica

AUTOMATED TROUBLESHOOTING OF MOBILE NETWORKS USING BAYESIAN NETWORKS

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Evolutive Neural Net Fuzzy Filtering: Basic Description

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Conference Presentation

Artificial Neural Networks written examination

Organizing Comprehensive Literacy Assessment: How to Get Started

Learning Distributed Linguistic Classes

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Houghton Mifflin Online Assessment System Walkthrough Guide

A new way to share, organize and learn from experiments

Process to Identify Minimum Passing Criteria and Objective Evidence in Support of ABET EC2000 Criteria Fulfillment

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Truth Inference in Crowdsourcing: Is the Problem Solved?

Ricopili: Postimputation Module. WCPG Education Day Stephan Ripke / Raymond Walters Toronto, October 2015

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

stateorvalue to each variable in a given set. We use p(x = xjy = y) (or p(xjy) as a shorthand) to denote the probability that X = x given Y = y. We al

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

A study of speaker adaptation for DNN-based speech synthesis

Evidence for Reliability, Validity and Learning Effectiveness

Modeling function word errors in DNN-HMM based LVCSR systems

Detecting English-French Cognates Using Orthographic Edit Distance

Memory-based grammatical error correction

Transcription:

WEKA tutorial exercises These tutorial exercises introduce WEKA and ask you to try out several machine learning, visualization, and preprocessing methods using a wide variety of datasets: Learners: decision tree learner (J48), instance-based learner (IBk), Naïve Bayes (NB), Naïve Bayes Multinomial (NBM), support vector machine (SMO), association rule learner (Apriori) Meta-learners: filtered classifier, attribute selected classifiers (CfsSubsetEval and WrapperSubsetEval) Visualization: visualize datasets, decision trees, decision boundaries, classification errors Preprocessing: remove attributes and instances, use supervised and unsupervised discretization, select features, convert strings to word vectors Testing: on training set, on supplied test set, using cross-validation, using TP and FP rates, ROC area, confidence and support of association rules Datasets: weather.nominal, iris, glass (with variants), vehicle (with variants), kr-vs-kp, waveform-5000, generated, sick, vote, mushroom, letter, ReutersCorn- Train and ReutersGrain-Train, supermarket Tutorial 1: Introduction to the WEKA Explorer Set up your environment and start the Explorer Look at the Preprocess, Classify, and Visualize panels In Preprocess: load a dataset (weather.nominal) and look at it use the Data Set Editor apply a filter (to remove attributes and instances). In Visualize: load a dataset (iris) and visualize it examine instance info (note discrepancy in numbering between instance info and dataset viewer) select instances and rectangles; save the new dataset to a file. In Classify: load a dataset (weather.nominal) and classify it with the J48 decision tree learner (test on training set) examine the tree in the Classifier output panel visualize the tree (by right-clicking the entry in the result list) interpret classification accuracy and confusion matrix test the classifier on a supplied test set visualize classifier errors (by right-clicking the entry in the result list) Answers to this tutorial are given.

Tutorial 2: Nearest neighbour learning and decision trees Introduce the glass dataset, plus variants glass-minusatt, glass-withnoise, glass-mininormalized, glass-mini-train and glass-mini-test Explain how classifier accuracy is measured, and what is meant by class noise and irrelevant attributes Experiment with the IBk classifier for nearest neighbour learning: load glass data; list attribute names and identify the class attribute classify using IBk, testing with cross-validation repeat using 10 and then 20 nearest neighbours repeat all this for the glass-minusatt dataset repeat all this for the glass-withnoise dataset interpret the results and draw conclusions about IBk. Perform nearest neighbour classification yourself: load glass-mini-normalized and view the data pretend that the last instance is a test instance and classify it (use the Visualize panel to help) verify your answer by running IBk on glass-mini-train and glass-mini-test Experiment with the J48 decision tree learner: load glass data and classify using J48 visualize the tree and simulate its effect on a particular test instance visualize the classifier errors and interpret one of them note J48 classification accuracy on glass, glass-minusatt and glass-withnoise. interpret the results and draw conclusions about J48. Compare nearest neighbour to decision tree learning: draw conclusions about relative performance of IBk and J48 s performance on glass, glass-minusatt and glass-withnoise. Tutorial 3: Naïve Bayes and support vector machines Introduce the boundary visualizer tool Introduce the datasets vehicle, kr-vs-kp, glass, waveform-5000 and generated. Apply Naïve Bayes (NB) and J48 on several datasets: apply NB to vehicle, kr-vs-kp, glass, waveform-5000 and generated, using 10- fold cross-validation. apply J48 to the same datasets. summarize the results draw an inference about the datasets where NB outperformed J48. Investigate linear support vector machines: introduce the datasets glass, glass-rina, vehicle and vehicle-sub apply a support vector machine learner (SMO) to glass-rina, evaluating on the training set

apply the classification boundary visualizer, and also visualize the classification errors (separately) describe the model built and explain the classification errors change SMO s complexity parameter c option and repeat comment on the difference c makes. Investigate linear and non-linear support vector machines: apply SMO to vehicle-sub, again evaluating on the training set apply the classification boundary visualizer, and visualize the classifier errors change the exponent option of the kernel PolyKernel from 1 to 2 and repeat explain the differences in the test results add/remove points in the boundary visualizer to change the decision boundary s shape. Tutorial 4: Preprocessing Introduce the datasets sick, vote, mushroom and letter. Apply discretization: explain what discretization is load the sick dataset and look at the attributes classify using NB, evaluating with cross-validation apply the supervised discretization filter and look at the effect (in the Preprocess panel) apply unsupervised discretization with different numbers of bins and look at the effect use the FilteredClassifier with NB and supervised discretization, evaluating with cross-validation repeat using unsupervised discretization with different numbers of bins compare and interpret the results. Apply feature selection using CfsSubsetEval: explain what feature selection is load the mushroom dataset and apply J48, IBk and NB, evaluating with crossvalidation select attributes using CfsSubsetEval and GreedyStepwise search interpret the results use AttributeSelectedClassifier (with CfsSubsetEval and GreedyStepwise search) for classifiers J48, IBk and NB, evaluating with cross-validation interpret the results. Apply feature selection using WrapperSubsetEval: load the vote dataset and apply J48, IBk and NB, evaluating with crossvalidation select attributes using WrapperSubsetEval with InfoGainAttributeEval and RankSearch, with the J48 classifier interpret the results use AttributeSelectedClassifier (with WrapperSubsetEval, InfoGainAttributeEval and RankSearch) with classifiers J48, IBk and NB, evaluating with cross-validation

interpret the results. Sampling a dataset: load the letter dataset and examine a particular (numeric) attribute apply the Resample filter to select half the dataset examine the same attribute and comment on the results. Tutorial 5: Text mining How to increase the memory size for Weka. Introduce the datasets ReutersCorn-Train and ReutersGrain-Train. Classify articles using binary attributes: load ReutersCorn-train apply StringToWordVector, with lower case tokens, alphabetic tokenizer, 2500 words to keep examine and interpret the result classify using NB and SMO, recording the TP and FP rates for positive instances, and the ROC area interpret the results to compare the classifiers discuss whether TP or FP is likely to be more important for this problem use AttributeSelectedClassifier (with InfoGain and Ranker search, selecting 100 attributes) with the same classifiers look at the words that have been retained, and comment compare the results for classification with and without attribute selection Classify articles using word count attributes: load ReutersCorn-train apply StringToWordVector, with lower case tokens, alphabetic tokenizer, 2500 words to keep, and wordcount set to true examine and interpret the results classify using Naïve Bayes Multinomial (NBM) and SMO, recording the same figures as above compare the results with those above for binary attributes undo StringToWordVector and reapply with wordcount set to false reclassify with AttributeSelectedClassifier (with InfoGain and Ranker search) using NB and SMO, with 100, 50, 25 attributes compare NB with and without attribute selection, and the same for SMO compare NB with binary attributes against NBM with word count attributes, and the same for SMO Classify unknown instances: use NBM models built from ReutersCorn-train and ReutersGrain-train to classify a mystery instance (Mystery1) repeat using SMO models comment on the findings use the same NBM models to classify a second mystery instance (Mystery2).

Tutorial 6: Association rules Introduce the datasets vote, weather.nominal and supermarket. Apply an association rule learner (Apriori): load vote, go to the Associate panel, and apply the Apriori learner discuss the meaning of the rules find out how a rule s confidence is computed identify the support and number of instances predicted correctly of certain rules change the number of rules in the output what is the criterion for best rules? find rules that mean certain things Finding association rules manually: load weather.nominal and look at the data find the support and confidence for a certain rule consider rules with multiple parts in the consequent Make association rules for the supermarket dataset: load supermarket generate 30 association rules and discuss some inferences you would make from them