Ronny Kohavi. Ronny Kohavi

Similar documents
Improving Simple Bayes. Abstract. The simple Bayesian classier (SBC), sometimes called

Generation of Attribute Value Taxonomies from Data for Data-Driven Construction of Accurate and Compact Classifiers

Rule Learning with Negation: Issues Regarding Effectiveness

Rule Learning With Negation: Issues Regarding Effectiveness

Learning From the Past with Experiment Databases

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Lecture 1: Machine Learning Basics

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Data Stream Processing and Analytics

A Case Study: News Classification Based on Term Frequency

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Python Machine Learning

CS Machine Learning

Semi-Supervised Face Detection

Detecting English-French Cognates Using Orthographic Edit Distance

Content-based Image Retrieval Using Image Regions as Query Examples

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Experiment Databases: Towards an Improved Experimental Methodology in Machine Learning

Reducing Features to Improve Bug Prediction

Australian Journal of Basic and Applied Sciences

An Empirical Analysis of the Effects of Mexican American Studies Participation on Student Achievement within Tucson Unified School District

Applications of data mining algorithms to analysis of medical data

Softprop: Softmax Neural Network Backpropagation Learning

Beyond the Pipeline: Discrete Optimization in NLP

Switchboard Language Model Improvement with Conversational Data from Gigaword

Large-Scale Web Page Classification. Sathi T Marath. Submitted in partial fulfilment of the requirements. for the degree of Doctor of Philosophy

(Sub)Gradient Descent

ScienceDirect. A Framework for Clustering Cardiac Patient s Records Using Unsupervised Learning Techniques

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Assignment 1: Predicting Amazon Review Ratings

Multi-label classification via multi-target regression on data streams

Grade 6: Correlated to AGS Basic Math Skills

CS4491/CS 7265 BIG DATA ANALYTICS INTRODUCTION TO THE COURSE. Mingon Kang, PhD Computer Science, Kennesaw State University

Software Maintenance

A Version Space Approach to Learning Context-free Grammars

Learning goal-oriented strategies in problem solving

Activity Recognition from Accelerometer Data

CS 1103 Computer Science I Honors. Fall Instructor Muller. Syllabus

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

Cal s Dinner Card Deals

Online Updating of Word Representations for Part-of-Speech Tagging

An Empirical Comparison of Supervised Ensemble Learning Approaches

IT Students Workshop within Strategic Partnership of Leibniz University and Peter the Great St. Petersburg Polytechnic University

Impact of Cluster Validity Measures on Performance of Hybrid Models Based on K-means and Decision Trees

A Biological Signal-Based Stress Monitoring Framework for Children Using Wearable Devices

How do adults reason about their opponent? Typologies of players in a turn-taking game

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Comparison of EM and Two-Step Cluster Method for Mixed Data: An Application

Unpacking a Standard: Making Dinner with Student Differences in Mind

Using Web Searches on Important Words to Create Background Sets for LSI Classification

INPE São José dos Campos

Developing a TT-MCTAG for German with an RCG-based Parser

CSL465/603 - Machine Learning

Multi-label Classification via Multi-target Regression on Data Streams

Human Emotion Recognition From Speech

Using dialogue context to improve parsing performance in dialogue systems

Word learning as Bayesian inference

Defragmenting Textual Data by Leveraging the Syntactic Structure of the English Language

EXPLOITING DOMAIN AND TASK REGULARITIES FOR ROBUST NAMED ENTITY RECOGNITION

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Artificial Neural Networks written examination

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

GACE Computer Science Assessment Test at a Glance

Does the Difficulty of an Interruption Affect our Ability to Resume?

CS 446: Machine Learning

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

Team Dispersal. Some shaping ideas

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Getting Started with TI-Nspire High School Science

A Decision Tree Analysis of the Transfer Student Emma Gunu, MS Research Analyst Robert M Roe, PhD Executive Director of Institutional Research and

Comparison of network inference packages and methods for multiple networks inference

Indian Institute of Technology, Kanpur

Preference Learning in Recommender Systems

AQUA: An Ontology-Driven Question Answering System

Genre classification on German novels

Combining Proactive and Reactive Predictions for Data Streams

San Francisco County Weekly Wages

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

ABSTRACT. A major goal of human genetics is the discovery and validation of genetic polymorphisms

Spoken Language Parsing Using Phrase-Level Grammars and Trainable Classifiers

Australia s tertiary education sector

Setting Up Tuition Controls, Criteria, Equations, and Waivers

On-Line Data Analytics

Automatic Discretization of Actions and States in Monte-Carlo Tree Search

Handling Concept Drifts Using Dynamic Selection of Classifiers

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Cooperative evolutive concept learning: an empirical study

Statewide Framework Document for:

Analyzing sentiments in tweets for Tesla Model 3 using SAS Enterprise Miner and SAS Sentiment Analysis Studio

Word Segmentation of Off-line Handwritten Documents

JONATHAN H. WRIGHT Department of Economics, Johns Hopkins University, 3400 N. Charles St., Baltimore MD (410)

SARDNET: A Self-Organizing Feature Map for Sequences

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

How long did... Who did... Where was... When did... How did... Which did...

arxiv: v1 [cs.lg] 15 Jun 2015

Transcription:

Scaling Up the Accuracy of Naive Bayes Classifiers: a Decision Tree Hybrid Ronny Kohavi Ronny Kohavi Data Mining and Visualization Group Silicon Graphics, Inc.

The Naive Bayes Classifier The Naive Bayes classifier computes the probabilities of each label value given the record, assuming attributes are conditionally independent given the label. 2 The assumption seems very strong but: Naive Bayes performs surprisingly well in experiments [Kononko 1993; Langley & Sage 1994; Kohavi & Sommerfield 1995]. Correct classification does not require accurate estimates of probabilities [Friedman 1996; Domingos & Pazzani 1996]

Interpretability 3 Census Bureau data on working adults in 1994. Classification: who makes over $50K

Sometimes It Even Scales! 4 96 94 92 90 88 86 84 82 80 78 76 74 DNA 0 500 1000 1500 2000 2500 82 80 78 76 74 72 70 68 66 waveform-40 0 1000 2000 3000 4000 Two semi large datasets showing Naive Bayes significantly outperforms (decision trees).

But Often it does Not 100 98 96 94 92 90 88 86 84 82 80 100 99.9 99.8 99.7 99.6 99.5 99.4 99.3 99.2 99.1 99 chess 0 500 1000 1500 2000 2500 shuttle 0 15,000 30,000 45,000 60,000 100 99 98 97 96 95 94 mushroom 93 0 2,000 4,000 6,000 8,000 86.5 86 85.5 85 84.5 84 83.5 adult 83 82.5 82 0 15,000 30,000 45,000 5

And NB Asymptotes Early satimage 88 86 A cross over. 6 84 82 80 78 76 0 1000 2000 3000 4000 5000 6000 90 85 80 75 70 65 letter 60 0 5,000 10,000 15,000 20,000 Naive Bayes starts better but does not improve and asymptotes early. is still improving while Naive Bayes asymptoted early.

When is Naive Bayes Better? Many irrelevant features. Naive Bayes is very robust to irrelevant features. The conditional probabilities for irrelevant features equalize (hence do not affect prediction) fast. 7 Predictions require taking into account many features. Decision trees suffer from fragmentation in these cases. The assumptions hold, i.e., when features are conditionally independent and equally important (e.g., medical domains).

When are Decision Trees Better? Serial tasks: once the value of a key feature is known, dependencies and distributions change. A good example is chess. Another view of this: when segmenting the data into subpopulations gives "easier" subproblems. 8 There are key features: some features are much more important than others. In the mushroom dataset, the odor attribute alone gives you over 98%. Naive Bayes never got to this level.

NBTree: a Hybrid Use the decision tree to segment the data into subproblems and apply Naive Bayes to each one. 9 Decision nodes will test attributes as with regular decision trees, but the leaves will contain Naive Bayes classifiers. Since NB is good at handling many features with relatively little data, it is used where it is most useful: the leaves.

How to Segment the Data Observation: Naive Bayes is an incremental induction algorithm, which means cross validation can be done fast (linear in the number of ) by deleting folds, testing them, and inserting them again. 10 Instead of finding a direct splitting criteria such as mutual info/gini/gain ratio, we use cross validation to estimate how much a split would help versus creating an NB leaf. We don t attempt to fundamentally derive when a split is useful; we try it out.

Results: Absolute Differences Difference in between NBTree and, and NBTree and Naive Bayes. Above the zero lines means NBTree is better. 11 NBTree - NBTree - NB tic-tac-toe chess letter vehicle vote monk1 segment satimage flare iris led24 mushroom vote1 adult shuttle soybean-large DNA ionosphere breast (L) crx breast (W) german pima heart glass cleve waveform-40 glass2 primary-tumor Accuracy difference 30.00 20.00 10.00 0.00-10.00

Results: Relative Differences Relative difference in between NBTree and, and NBTree and Naive Bayes. Below 1.0 means NBTree is better. 12 1.50 Error Ratio 1.00 0.50 0.00 tic-tac-toe chess letter vehicle vote monk1 segment satimage flare iris led24 mushroom vote1 adult shuttle soybean-large DNA ionosphere breast (L) crx breast (w) german pima heart glass cleve waveform-40 glass2 primary-tumor NBTree/ NBTree/ NB

Interpretability The resulting structure is relatively easy to interpret. 13 While NBTrees have complex leaves, there are fewer nodes overall: Letter: 2109 nodes () versus 251 (NBTree) Adult: 2213 versus 137 DNA: 31 versus 3 LED24: 49 versus 1 Many leaves end up as regular decision tree leaves because they contain a single class.

Summary 14 NBTree combines decision tree based segmentation of the data with Naive Bayes at the leaves. Induction time is slower, but the complexity is the same (constants are bigger). Scales well: the is good for large files. On the three largest files (shuttle, adult, letter), NBTree outperformed both and Naive Bayes.