Naïve Bayes Classifier Yin-Chia Yeh, Hanzhong Ye (Ayden)

Similar documents
A Case Study: News Classification Based on Term Frequency

Rule Learning With Negation: Issues Regarding Effectiveness

Linking Task: Identifying authors and book titles in verbose queries

Lecture 1: Machine Learning Basics

Rule Learning with Negation: Issues Regarding Effectiveness

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Reducing Features to Improve Bug Prediction

Let s think about how to multiply and divide fractions by fractions!

Australian Journal of Basic and Applied Sciences

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

CS 446: Machine Learning

Using dialogue context to improve parsing performance in dialogue systems

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Applications of data mining algorithms to analysis of medical data

Memory-based grammatical error correction

A Bayesian Learning Approach to Concept-Based Document Classification

Probabilistic Latent Semantic Analysis

CS Machine Learning

A Comparison of Two Text Representations for Sentiment Analysis

Generative models and adversarial training

Grade 6: Correlated to AGS Basic Math Skills

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Assignment 1: Predicting Amazon Review Ratings

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Detecting Wikipedia Vandalism using Machine Learning Notebook for PAN at CLEF 2011

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

arxiv: v1 [cs.lg] 3 May 2013

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Multilingual Sentiment and Subjectivity Analysis

Dialog Act Classification Using N-Gram Algorithms

Improving Simple Bayes. Abstract. The simple Bayesian classier (SBC), sometimes called

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge

Bug triage in open source systems: a review

Universiteit Leiden ICT in Business

Python Machine Learning

CS4491/CS 7265 BIG DATA ANALYTICS INTRODUCTION TO THE COURSE. Mingon Kang, PhD Computer Science, Kennesaw State University

Analyzing sentiments in tweets for Tesla Model 3 using SAS Enterprise Miner and SAS Sentiment Analysis Studio

Verbal Behaviors and Persuasiveness in Online Multimedia Content

While you are waiting... socrative.com, room number SIMLANG2016

T2Ts, revised. Foundations

CHAPTER 4: REIMBURSEMENT STRATEGIES 24

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

Feature Selection based on Sampling and C4.5 Algorithm to Improve the Quality of Text Classification using Naïve Bayes

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Exploration. CS : Deep Reinforcement Learning Sergey Levine

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

CAN PICTORIAL REPRESENTATIONS SUPPORT PROPORTIONAL REASONING? THE CASE OF A MIXING PAINT PROBLEM

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Grammar Lesson Plan: Yes/No Questions with No Overt Auxiliary Verbs

Calibration of Confidence Measures in Speech Recognition

Learning From the Past with Experiment Databases

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Switchboard Language Model Improvement with Conversational Data from Gigaword

Loughton School s curriculum evening. 28 th February 2017

University of Waterloo School of Accountancy. AFM 102: Introductory Management Accounting. Fall Term 2004: Section 4

Radius STEM Readiness TM

Linking the Ohio State Assessments to NWEA MAP Growth Tests *

2 nd grade Task 5 Half and Half

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Hi I m Ryan O Donnell, I m with Florida Tech s Orlando Campus, and today I am going to review a book titled Standard Celeration Charting 2002 by

Applying Fuzzy Rule-Based System on FMEA to Assess the Risks on Project-Based Software Engineering Education

Multi-Lingual Text Leveling

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

South Carolina English Language Arts

Cross Language Information Retrieval

Cal s Dinner Card Deals

Numeracy Medium term plan: Summer Term Level 2C/2B Year 2 Level 2A/3C

Introducing the New Iowa Assessments Mathematics Levels 12 14

(Musselwhite, 2008) classrooms.

Identification of Opinion Leaders Using Text Mining Technique in Virtual Community

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Indian Institute of Technology, Kanpur

Mathematics Success Level E

Disambiguation of Thai Personal Name from Online News Articles

EXAMINING THE DEVELOPMENT OF FIFTH AND SIXTH GRADE STUDENTS EPISTEMIC CONSIDERATIONS OVER TIME THROUGH AN AUTOMATED ANALYSIS OF EMBEDDED ASSESSMENTS

Julia Smith. Effective Classroom Approaches to.

Online Updating of Word Representations for Part-of-Speech Tagging

Introduction to the Practice of Statistics

Data Fusion Through Statistical Matching

Introduction to Questionnaire Design

Rendezvous with Comet Halley Next Generation of Science Standards

1.11 I Know What Do You Know?

YMCA SCHOOL AGE CHILD CARE PROGRAM PLAN

Edexcel GCSE. Statistics 1389 Paper 1H. June Mark Scheme. Statistics Edexcel GCSE

Automatic document classification of biological literature

Cross-lingual Short-Text Document Classification for Facebook Comments

Speech Emotion Recognition Using Support Vector Machine

Using EEG to Improve Massive Open Online Courses Feedback Interaction

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

B. How to write a research paper

Science Fair Project Handbook

Creating Meaningful Assessments for Professional Development Education in Software Architecture

Maths Games Resource Kit - Sample Teaching Problem Solving

Many instructors use a weighted total to calculate their grades. This lesson explains how to set up a weighted total using categories.

BENGKEL 21ST CENTURY LEARNING DESIGN PERINGKAT DAERAH KUNAK, 2016

Spoken Language Parsing Using Phrase-Level Grammars and Trainable Classifiers

How to Judge the Quality of an Objective Classroom Test

Data Structures and Algorithms

Transcription:

Behavioral Data Mining Homework 1 Naïve Bayes Classifier Yin-Chia Yeh, Hanzhong Ye (Ayden) Overview: The goal of this assignment is to apply the naïve Bayes classifier to a data set of labeled textual movie reviews. We tried both Bernoulli and multinomial models to implement the classifier. To evaluate the classifier, we did a 10-fold cross-validation and applied an accuracy measure on correct ratio and F1 value. We also calculated the word frequency in our model, and discussed on the words with the top weights. Implementation: We have used both multinomial and Bernoulli models to implement our naïve Bayes classifier. The core part of our code is as following: Classifier def loglikelihood (wordcnt:double, totalcnt:double, totaldistinct:double):double = { val smoothing = 1.0; Math.log((wordCnt + smoothing) / (totalcnt + totaldistinct*smoothing)) } var negscore = 0.0; var posscore = 0.0; inarticle.foreachpair( (s,c) =>{ val negwordcnt = negdic.get(s) match { case Some(cnt) => cnt; case None => 0.0; }; val poswordcnt = posdic.get(s) match { case Some(cnt) => cnt; case None => 0.0; }; //For Multinomial Model negscore += c*loglikelihood(negwordcnt, negallwordcnt, negdistinctwordcnt); posscore += c*loglikelihood(poswordcnt, posallwordcnt, posdistinctwordcnt); // For Bernoulii Model negscore += loglikelihood(negwordcnt, negallwordcnt, negdistinctwordcnt); posscore += loglikelihood(poswordcnt, posallwordcnt, posdistinctwordcnt); })

After implementation, we did a 10-fold cross-validation and computed two statistics 1. Correct classification rate (# of correct classified samples / # of tested samples) 2. F1 value (derived from Precision value and Recall value). Model Type Comparisom We have implemented naïve Bayes classifier using both multinomial and Bernoulli models. Firstly we observe the influence of model type on correct classification rate and F1 value under three chosen Alpha value (0.5,1,2), as shown in Figure.1 and Figure.2. Figure.1 Influence of Model Type on Correct Ratio (with Alpha =0.5, 1, 2)

Figure.1 Influence of Model Type on F1 Value (with Alpha =0.5, 1, 2) From Figure.1 and Figure.2 we can know that under the condition with Alpha = 0.5, 1 and 2, neither correct ration nor F1 value is significantly influenced by the model type chosen. Both multinomial and Bernoulli models have their correct classification around 0.81, and the F1 value is around 0.79. We used multinomial model in the rest of this report. Smoothing Term Value Then, we tested the influence of Alpha value for smoothing when using multinomial model. We pick Alpha value from 0.015625 and following values doubling its previous one, which are evenly distributed on a X axis with value log(alpha)/log 2. Figure. 3 and Figure. 4 show the influence of Alpha value on correct ratio and on F1 value.

Figure 3. Influence of Alpha Value on Correct Ration (Multinomial Model) Figure 4. Influence of Alpha Value on F1 Value (Multinomial Model)

From the figures above we can conclude that from Alpha value <<1 to around Alpha value = 32, correct ration and F1 value are not largely influenced by Alpha value, with an value around 0.8 and 0.78 individually, then from Alpha value = 32 both values increse and at around Alpha value = 64 they reach a peak value both around 0.83. Afterwards both values drastically go down with increasing Alpha, and finally the system fails with Alpha value reaching 2048. Word Weights Analysis: We have also made a statistical review of the words with the top weights. We have made a word count from the training data of 900 positive reviews with 712103 words consisting of 38106 distinct words, and 900 negative reviews with 637078 words consisting of 35837 distinct words. The top 10 words with most weight for both are showed in Table 1. Table 1. Top 10 words with most weight in positive and negative training data. It is obvious that none of these top ranked words is carrying any useful information on reviewers attitude. However, when we looked into the top 500 most frequently used words in both pools, we do find some meaningful words, as showed in Table. 2.

Table. 2 Selected words from top 500 most frequently used words with attitude information From Table.2 we can see that many positive words such as like, good, great, interesting, perfect, etc. are used more often in positive reviews, while negative words such as bad, never, down, old, etc. have higher frequency of occurrence in negative reviews. We believe that these words make sense and also contribute to the classification process. These results also show that for future work, we can improve the accuracy of our system by removing stop words, which appear in the top ranked list but don t carry attitude information. Other strategies to improve our system include using stemming algorithm, processing n-grams, etc.