Weka: Naïve Bayes Classifier(s)

Lecture 06: LAB Assignment Weka: Naïve Bayes Classifier(s) ACKNOWLEDGEMENTS: Our lab assignment today has been inspired by the following lab projects: past tense dataset + decision trees: < http://coltekin.net/cagri/ml08/lab3n.html > and spam dataset + naïve bayes < http://www.cs.uml.edu/~jlu1/ta/dm_fall2013/p1.html >. INFO Required reading for Lecture 6 and matching Lab Assignment: - - Daume (2015): Ch 7 from page 103 to page 110. Requirement: be open to discuss the main topics and/or the main problems of the lab assignment with one or more randomly- chosen classmate(s). If you have a problem or deep insight, do not keep it to yourself: share it! You might solve the problem or you might get an even deeper insight J Your original way of thinking will be enriched by discussions with peers. Execution time: approx. 2-3 hours. ATT: datasets can be downloaded from here: <http://stp.lingfil.uu.se/~santinim/ml/2015/datasets/> Learning objectives In this lab assignment you are going to: explore the behavior of Naïve Bayes classifier(s) (as implemented in weka) on linguistic data: o spam dataset, past tense dataset o Naïve Bayes o (NaiveBayesSimple) Pondering about our previous experiences In our previous weka lab assignments, we used two different datasets, namely the iris dataset and the past tense dataset. These two datasets represent: n iris flowers: small (150 instances), balanced distributions of instances across three classes (50 instances per class), numerical attributes/features (measurements in cm), categorical class labels (the names of the iris species) n past tense inflections: largish (4330 instances), many classes (42 class labels), nominal attributes (phonemes), highly unbalanced. You explored the past tense dataset and you correctly figured out the following facts: n It is list of: o verbs (ex: <http://coltekin.net/cagri/ml08/verbs.txt>)

o phonemes (ex:<http://coltekin.net/cagri/ml08/phonemes.txt>) o classes (<http://coltekin.net/cagri/ml08/classes.txt>) o For your convenience, you can now browse the list here (<http://coltekin.net/cagri/ml08/past- tense.dat>) and see if it really matches your intuitions about the dataset. n In summary we can say that the past tense dataset contains verb lemmas, where the class to be predicted is characterized by past tense formation rules. n This is a first example of how machine learning can help out in solving linguistic issues, although J48 does not seem the ideal classifier for this dataset. J48 implements a decision tree model following the C4.5 algorithm. C4.5 is an algorithm used to generate a decision tree, which was developed by Ross Quinlan. C4.5 is an extension of Quinlan's earlier ID3 algorithm. The ID3 algorithm uses "Information Gain" measure. The C4.5 uses "Gain Ratio" measure. (optional reading: (http://cis- linux1.temple.edu/~giorgio/cis587/readings/id3- c45.html) If you look at the picture below, you can see that weka cites the reference of the implemented classifier. J48 makes use of entropy, which gives us the information about "degree of doubt". J48 selects the attribute for classifying by comparing the information gains. The following is the quick summarize of the J48 algorithm: 1. For each attribute, compute its entropy with respect to the class attribute. 2. Compute and select the attribute (say A) with highest gain ratio. 3. Divide the data into separate sets according to the values of A. 4. Build a tree with each branch represents an attribute (A) value. 5. For each subtree, repeat this process from step 1. 6. At each iteration, one attribute gets removed from consideration. 7. The process stops when there are no attributes left to consider, or when all the data being considered in a subtree have the same value for the class attribute. 8. Inductive bias: Shorter trees are preferred over larger trees. Trees that place high gain ratio attributes close to the root are preferred over those that do not (cf Lecture 1: What is machine learning? Slide 31: J48/C4.5 and its antecedent ID3 have the same inductive bias). This means that the default parameters of J48 prune the trees drastically (read again the definition of inductive bias Lecture 1: What is machine learning? Slide 30. You have now experienced in practice what inductive bias is). It is possible to change these parameters and get an unpruned tree. We are not going into

this exploration right now but if you read carefully all the parameters you will see a confidencefactor (ie the C) and an unpruded parameters. (see picture below). J48 can deal with both nominal (past tense dataset) and numeric (iris dataset) attributes. However, please remember that this is not always the case, since some classifiers do not have this flexibility, for example linear classifiers. In today s lab assignment we will explore the behaviour of weka's Naive Bayes implementations. NB is neither a linear classifier, nor a divide and conquer classifier, it is a probabilistic classifier. How does NB behave with linguistic datasets? Let's carry out this exploration today... Preliminaries (Repetition) Preprocess Tab In the Preprocess tab, you can review the data you are working with. In the left section, it outlines the information and all attributes of the dataset. By selecting each attribute, the right section of the Explorer window will also give you the information about the data in that attribute. There is also a visual way of examining the data which you can see by clicking the Visualize All button. Sometimes, the visualization is a very powerful tool for reviewing the dataset. Classify Tab In the Classify tab, you can create a model by using Choose to select a model. If you hover on the name of a classifier when you press Choose, you will see a tooltip

containing the main information about the classifier you are hovering on (see picture below). After the desired model has been chosen, we have to tell WEKA how to evaluate the model that has been built. In the Test options frame, the option Use training set means that we want to use the data supplied in the ARFF file loaded. The other three choices are Supplied test set, where you can supply a different set of data to build the model; Cross- validation, which lets WEKA build a model based on subsets of the supplied data and then average them out to create a final model; and Percentage split, where WEKA takes a percentile subset of the supplied data to build a final model. Tasks G tasks: pls provide comprehensive answers to all the questions below: (1) Start Weka, launch the Explorer window and select the "Preprocess" tab. Open the past tense dataset. Select the Classify tab to get into the Classification tab of Weka. Click on Choose and hover on to NaiveBayes. Read the tooltip and then select the classifier. In Test options, select 10- fold cross validation and hit Start. Evaluate the performance (ie. read the output, ie the evaluation measures that we have studied so far) of the NB classifier on the past tense dataset. What are your conclusions? (2) Go back to your previous lab (when we used the decision tree classifier J48 on the past tense dataset). Compare the performance of J48 and NaiveBayes. What are your conclusions? If you should recommend one of the two classifiers based on weka output, which one would you recommend?

(3) Open the spambase dataset in the Preprocessing panel. How many classes, instances and attributes can be found in the spambase.arff dataset? What type of attributes and what kind of classes? (4) Run both J48 and NaïveBayes on the spambase dataset. Compare and discuss their behavior on the two different datasets. (5) Theoretical question: Why is naïve Bayes classification called naïve? Briefly outline the major ideas of naïve Bayes classification. VG tasks: pls provide comprehensive answers to all the questions below: (6) Run NaiveBayesSimple on the past tense dataset. You will get an error. Can you make a try and interpret what causes this error? (tips: google the error and compare the descriptions of the NaïveBayseSimple classifier against the description of the NaiveBayes classifier). What are your conclusions? (7) Run NaiveBayesSimple on the spambase dataset. You will get an error. Can you make a try and interpret what causes this error? (tips: google the error and compare the descriptions of the NaïveBayseSimple classifier against the description of the NaiveBayes classifier). What are your conclusions? To be submitted A written report (at least 1 page) containing the reasoned answers to the tasks and questions above and a short section where you summarize your reflections and experience. Submit the report in PDF format to santinim@stp.lingfil.uu.se no later than Fri 27 Nov 2015, 1pm (13:00). Naming conventions Please, name your pdf report in this way (it will be easier for me to organize and archive them): surname_name_lecturenumberlab_report.pdf (ex: santini_marina_lecture06lab_report.pdf).