Weka: Naïve Bayes Classifier(s)

Similar documents
CS Machine Learning

Houghton Mifflin Online Assessment System Walkthrough Guide

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Learning From the Past with Experiment Databases

Rule Learning with Negation: Issues Regarding Effectiveness

TIMSS ADVANCED 2015 USER GUIDE FOR THE INTERNATIONAL DATABASE. Pierre Foy

Rule Learning With Negation: Issues Regarding Effectiveness

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Lecture 1: Machine Learning Basics

An Introductory Blackboard (elearn) Guide For Parents

Python Machine Learning

i>clicker Setup Training Documentation This document explains the process of integrating your i>clicker software with your Moodle course.

Intel-powered Classmate PC. SMART Response* Training Foils. Version 2.0

Test How To. Creating a New Test

CS 446: Machine Learning

Outreach Connect User Manual

Using Blackboard.com Software to Reach Beyond the Classroom: Intermediate

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

Creating a Test in Eduphoria! Aware

(Sub)Gradient Descent

Welcome to California Colleges, Platform Exploration (6.1) Goal: Students will familiarize themselves with the CaliforniaColleges.edu platform.

ACCESSING STUDENT ACCESS CENTER

NCAA Eligibility Center High School Portal Instructions. Course Module

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Parent s Guide to the Student/Parent Portal

Assignment 1: Predicting Amazon Review Ratings

EdX Learner s Guide. Release

16.1 Lesson: Putting it into practice - isikhnas

Emporia State University Degree Works Training User Guide Advisor

The Revised Math TEKS (Grades 9-12) with Supporting Documents

Urban Analysis Exercise: GIS, Residential Development and Service Availability in Hillsborough County, Florida

The following information has been adapted from A guide to using AntConc.

Student Handbook. This handbook was written for the students and participants of the MPI Training Site.

Moodle MyFeedback update April 2017

MOODLE 2.0 GLOSSARY TUTORIALS

Schoology Getting Started Guide for Teachers

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Quick Reference for itslearning

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Experiment Databases: Towards an Improved Experimental Methodology in Machine Learning

IBCP Language Portfolio Core Requirement for the International Baccalaureate Career-Related Programme

U of S Course Tools. Open CourseWare (OCW)

CSL465/603 - Machine Learning

PowerTeacher Gradebook User Guide PowerSchool Student Information System

Online ICT Training Courseware

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Chapter 2 Rule Learning in a Nutshell

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Automating Outcome Based Assessment

IT Students Workshop within Strategic Partnership of Leibniz University and Peter the Great St. Petersburg Polytechnic University

Course Groups and Coordinator Courses MyLab and Mastering for Blackboard Learn

Preferences...3 Basic Calculator...5 Math/Graphing Tools...5 Help...6 Run System Check...6 Sign Out...8

Filing RTI Application by your own

Odyssey Writer Online Writing Tool for Students

Beyond the Pipeline: Discrete Optimization in NLP

DegreeWorks Advisor Reference Guide

Memory-based grammatical error correction

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

CHANCERY SMS 5.0 STUDENT SCHEDULING

TIPS PORTAL TRAINING DOCUMENTATION

Reducing Features to Improve Bug Prediction

Getting Started Guide

INTERMEDIATE ALGEBRA Course Syllabus

Office of Planning and Budgets. Provost Market for Fiscal Year Resource Guide

Using SAM Central With iread

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

GACE Computer Science Assessment Test at a Glance

CS177 Python Programming

How to analyze visual narratives: A tutorial in Visual Narrative Grammar

Driving Author Engagement through IEEE Collabratec

Introduction to the Revised Mathematics TEKS (2012) Module 1

THESIS GUIDE FORMAL INSTRUCTION GUIDE FOR MASTER S THESIS WRITING SCHOOL OF BUSINESS

Best Colleges Main Survey

For Jury Evaluation. The Road to Enlightenment: Generating Insight and Predicting Consumer Actions in Digital Markets

Model Ensemble for Click Prediction in Bing Search Ads

Multisensor Data Fusion: From Algorithms And Architectural Design To Applications (Devices, Circuits, And Systems)

Storytelling Made Simple

SECTION 12 E-Learning (CBT) Delivery Module

Field Experience Management 2011 Training Guides

Indian Institute of Technology, Kanpur

Switchboard Language Model Improvement with Conversational Data from Gigaword

Connect Microbiology. Training Guide

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Word Segmentation of Off-line Handwritten Documents

Poster Presentation Best Practices. Kuba Glazek, Ph.D. Methodology Expert National Center for Academic and Dissertation Excellence Los Angeles

Preparing a Research Proposal

LMS - LEARNING MANAGEMENT SYSTEM END USER GUIDE

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

MyUni - Turnitin Assignments

Truth Inference in Crowdsourcing: Is the Problem Solved?

Create Quiz Questions

Android App Development for Beginners

ecampus Basics Overview

Genre classification on German novels

Applications of data mining algorithms to analysis of medical data

Online Testing - Quick Troubleshooting Tips

Interactive Whiteboard

A Case Study: News Classification Based on Term Frequency

Dyslexia and Dyscalculia Screeners Digital. Guidance and Information for Teachers

Transcription:

Lecture 06: LAB Assignment Weka: Naïve Bayes Classifier(s) ACKNOWLEDGEMENTS: Our lab assignment today has been inspired by the following lab projects: past tense dataset + decision trees: < http://coltekin.net/cagri/ml08/lab3n.html > and spam dataset + naïve bayes < http://www.cs.uml.edu/~jlu1/ta/dm_fall2013/p1.html >. INFO Required reading for Lecture 6 and matching Lab Assignment: - - Daume (2015): Ch 7 from page 103 to page 110. Requirement: be open to discuss the main topics and/or the main problems of the lab assignment with one or more randomly- chosen classmate(s). If you have a problem or deep insight, do not keep it to yourself: share it! You might solve the problem or you might get an even deeper insight J Your original way of thinking will be enriched by discussions with peers. Execution time: approx. 2-3 hours. ATT: datasets can be downloaded from here: <http://stp.lingfil.uu.se/~santinim/ml/2015/datasets/> Learning objectives In this lab assignment you are going to: explore the behavior of Naïve Bayes classifier(s) (as implemented in weka) on linguistic data: o spam dataset, past tense dataset o Naïve Bayes o (NaiveBayesSimple) Pondering about our previous experiences In our previous weka lab assignments, we used two different datasets, namely the iris dataset and the past tense dataset. These two datasets represent: n iris flowers: small (150 instances), balanced distributions of instances across three classes (50 instances per class), numerical attributes/features (measurements in cm), categorical class labels (the names of the iris species) n past tense inflections: largish (4330 instances), many classes (42 class labels), nominal attributes (phonemes), highly unbalanced. You explored the past tense dataset and you correctly figured out the following facts: n It is list of: o verbs (ex: <http://coltekin.net/cagri/ml08/verbs.txt>)

o phonemes (ex:<http://coltekin.net/cagri/ml08/phonemes.txt>) o classes (<http://coltekin.net/cagri/ml08/classes.txt>) o For your convenience, you can now browse the list here (<http://coltekin.net/cagri/ml08/past- tense.dat>) and see if it really matches your intuitions about the dataset. n In summary we can say that the past tense dataset contains verb lemmas, where the class to be predicted is characterized by past tense formation rules. n This is a first example of how machine learning can help out in solving linguistic issues, although J48 does not seem the ideal classifier for this dataset. J48 implements a decision tree model following the C4.5 algorithm. C4.5 is an algorithm used to generate a decision tree, which was developed by Ross Quinlan. C4.5 is an extension of Quinlan's earlier ID3 algorithm. The ID3 algorithm uses "Information Gain" measure. The C4.5 uses "Gain Ratio" measure. (optional reading: (http://cis- linux1.temple.edu/~giorgio/cis587/readings/id3- c45.html) If you look at the picture below, you can see that weka cites the reference of the implemented classifier. J48 makes use of entropy, which gives us the information about "degree of doubt". J48 selects the attribute for classifying by comparing the information gains. The following is the quick summarize of the J48 algorithm: 1. For each attribute, compute its entropy with respect to the class attribute. 2. Compute and select the attribute (say A) with highest gain ratio. 3. Divide the data into separate sets according to the values of A. 4. Build a tree with each branch represents an attribute (A) value. 5. For each subtree, repeat this process from step 1. 6. At each iteration, one attribute gets removed from consideration. 7. The process stops when there are no attributes left to consider, or when all the data being considered in a subtree have the same value for the class attribute. 8. Inductive bias: Shorter trees are preferred over larger trees. Trees that place high gain ratio attributes close to the root are preferred over those that do not (cf Lecture 1: What is machine learning? Slide 31: J48/C4.5 and its antecedent ID3 have the same inductive bias). This means that the default parameters of J48 prune the trees drastically (read again the definition of inductive bias Lecture 1: What is machine learning? Slide 30. You have now experienced in practice what inductive bias is). It is possible to change these parameters and get an unpruned tree. We are not going into

this exploration right now but if you read carefully all the parameters you will see a confidencefactor (ie the C) and an unpruded parameters. (see picture below). J48 can deal with both nominal (past tense dataset) and numeric (iris dataset) attributes. However, please remember that this is not always the case, since some classifiers do not have this flexibility, for example linear classifiers. In today s lab assignment we will explore the behaviour of weka's Naive Bayes implementations. NB is neither a linear classifier, nor a divide and conquer classifier, it is a probabilistic classifier. How does NB behave with linguistic datasets? Let's carry out this exploration today... Preliminaries (Repetition) Preprocess Tab In the Preprocess tab, you can review the data you are working with. In the left section, it outlines the information and all attributes of the dataset. By selecting each attribute, the right section of the Explorer window will also give you the information about the data in that attribute. There is also a visual way of examining the data which you can see by clicking the Visualize All button. Sometimes, the visualization is a very powerful tool for reviewing the dataset. Classify Tab In the Classify tab, you can create a model by using Choose to select a model. If you hover on the name of a classifier when you press Choose, you will see a tooltip

containing the main information about the classifier you are hovering on (see picture below). After the desired model has been chosen, we have to tell WEKA how to evaluate the model that has been built. In the Test options frame, the option Use training set means that we want to use the data supplied in the ARFF file loaded. The other three choices are Supplied test set, where you can supply a different set of data to build the model; Cross- validation, which lets WEKA build a model based on subsets of the supplied data and then average them out to create a final model; and Percentage split, where WEKA takes a percentile subset of the supplied data to build a final model. Tasks G tasks: pls provide comprehensive answers to all the questions below: (1) Start Weka, launch the Explorer window and select the "Preprocess" tab. Open the past tense dataset. Select the Classify tab to get into the Classification tab of Weka. Click on Choose and hover on to NaiveBayes. Read the tooltip and then select the classifier. In Test options, select 10- fold cross validation and hit Start. Evaluate the performance (ie. read the output, ie the evaluation measures that we have studied so far) of the NB classifier on the past tense dataset. What are your conclusions? (2) Go back to your previous lab (when we used the decision tree classifier J48 on the past tense dataset). Compare the performance of J48 and NaiveBayes. What are your conclusions? If you should recommend one of the two classifiers based on weka output, which one would you recommend?

(3) Open the spambase dataset in the Preprocessing panel. How many classes, instances and attributes can be found in the spambase.arff dataset? What type of attributes and what kind of classes? (4) Run both J48 and NaïveBayes on the spambase dataset. Compare and discuss their behavior on the two different datasets. (5) Theoretical question: Why is naïve Bayes classification called naïve? Briefly outline the major ideas of naïve Bayes classification. VG tasks: pls provide comprehensive answers to all the questions below: (6) Run NaiveBayesSimple on the past tense dataset. You will get an error. Can you make a try and interpret what causes this error? (tips: google the error and compare the descriptions of the NaïveBayseSimple classifier against the description of the NaiveBayes classifier). What are your conclusions? (7) Run NaiveBayesSimple on the spambase dataset. You will get an error. Can you make a try and interpret what causes this error? (tips: google the error and compare the descriptions of the NaïveBayseSimple classifier against the description of the NaiveBayes classifier). What are your conclusions? To be submitted A written report (at least 1 page) containing the reasoned answers to the tasks and questions above and a short section where you summarize your reflections and experience. Submit the report in PDF format to santinim@stp.lingfil.uu.se no later than Fri 27 Nov 2015, 1pm (13:00). Naming conventions Please, name your pdf report in this way (it will be easier for me to organize and archive them): surname_name_lecturenumberlab_report.pdf (ex: santini_marina_lecture06lab_report.pdf).