Behavioral Data Mining Homework 1 Naïve Bayes Classifier Yin-Chia Yeh, Hanzhong Ye (Ayden) Overview: The goal of this assignment is to apply the naïve Bayes classifier to a data set of labeled textual movie reviews. We tried both Bernoulli and multinomial models to implement the classifier. To evaluate the classifier, we did a 10-fold cross-validation and applied an accuracy measure on correct ratio and F1 value. We also calculated the word frequency in our model, and discussed on the words with the top weights. Implementation: We have used both multinomial and Bernoulli models to implement our naïve Bayes classifier. The core part of our code is as following: Classifier def loglikelihood (wordcnt:double, totalcnt:double, totaldistinct:double):double = { val smoothing = 1.0; Math.log((wordCnt + smoothing) / (totalcnt + totaldistinct*smoothing)) } var negscore = 0.0; var posscore = 0.0; inarticle.foreachpair( (s,c) =>{ val negwordcnt = negdic.get(s) match { case Some(cnt) => cnt; case None => 0.0; }; val poswordcnt = posdic.get(s) match { case Some(cnt) => cnt; case None => 0.0; }; //For Multinomial Model negscore += c*loglikelihood(negwordcnt, negallwordcnt, negdistinctwordcnt); posscore += c*loglikelihood(poswordcnt, posallwordcnt, posdistinctwordcnt); // For Bernoulii Model negscore += loglikelihood(negwordcnt, negallwordcnt, negdistinctwordcnt); posscore += loglikelihood(poswordcnt, posallwordcnt, posdistinctwordcnt); })
After implementation, we did a 10-fold cross-validation and computed two statistics 1. Correct classification rate (# of correct classified samples / # of tested samples) 2. F1 value (derived from Precision value and Recall value). Model Type Comparisom We have implemented naïve Bayes classifier using both multinomial and Bernoulli models. Firstly we observe the influence of model type on correct classification rate and F1 value under three chosen Alpha value (0.5,1,2), as shown in Figure.1 and Figure.2. Figure.1 Influence of Model Type on Correct Ratio (with Alpha =0.5, 1, 2)
Figure.1 Influence of Model Type on F1 Value (with Alpha =0.5, 1, 2) From Figure.1 and Figure.2 we can know that under the condition with Alpha = 0.5, 1 and 2, neither correct ration nor F1 value is significantly influenced by the model type chosen. Both multinomial and Bernoulli models have their correct classification around 0.81, and the F1 value is around 0.79. We used multinomial model in the rest of this report. Smoothing Term Value Then, we tested the influence of Alpha value for smoothing when using multinomial model. We pick Alpha value from 0.015625 and following values doubling its previous one, which are evenly distributed on a X axis with value log(alpha)/log 2. Figure. 3 and Figure. 4 show the influence of Alpha value on correct ratio and on F1 value.
Figure 3. Influence of Alpha Value on Correct Ration (Multinomial Model) Figure 4. Influence of Alpha Value on F1 Value (Multinomial Model)
From the figures above we can conclude that from Alpha value <<1 to around Alpha value = 32, correct ration and F1 value are not largely influenced by Alpha value, with an value around 0.8 and 0.78 individually, then from Alpha value = 32 both values increse and at around Alpha value = 64 they reach a peak value both around 0.83. Afterwards both values drastically go down with increasing Alpha, and finally the system fails with Alpha value reaching 2048. Word Weights Analysis: We have also made a statistical review of the words with the top weights. We have made a word count from the training data of 900 positive reviews with 712103 words consisting of 38106 distinct words, and 900 negative reviews with 637078 words consisting of 35837 distinct words. The top 10 words with most weight for both are showed in Table 1. Table 1. Top 10 words with most weight in positive and negative training data. It is obvious that none of these top ranked words is carrying any useful information on reviewers attitude. However, when we looked into the top 500 most frequently used words in both pools, we do find some meaningful words, as showed in Table. 2.
Table. 2 Selected words from top 500 most frequently used words with attitude information From Table.2 we can see that many positive words such as like, good, great, interesting, perfect, etc. are used more often in positive reviews, while negative words such as bad, never, down, old, etc. have higher frequency of occurrence in negative reviews. We believe that these words make sense and also contribute to the classification process. These results also show that for future work, we can improve the accuracy of our system by removing stop words, which appear in the top ranked list but don t carry attitude information. Other strategies to improve our system include using stemming algorithm, processing n-grams, etc.