Scaling Up the Accuracy of Naive Bayes Classifiers: a Decision Tree Hybrid Ronny Kohavi Ronny Kohavi Data Mining and Visualization Group Silicon Graphics, Inc.
The Naive Bayes Classifier The Naive Bayes classifier computes the probabilities of each label value given the record, assuming attributes are conditionally independent given the label. 2 The assumption seems very strong but: Naive Bayes performs surprisingly well in experiments [Kononko 1993; Langley & Sage 1994; Kohavi & Sommerfield 1995]. Correct classification does not require accurate estimates of probabilities [Friedman 1996; Domingos & Pazzani 1996]
Interpretability 3 Census Bureau data on working adults in 1994. Classification: who makes over $50K
Sometimes It Even Scales! 4 96 94 92 90 88 86 84 82 80 78 76 74 DNA 0 500 1000 1500 2000 2500 82 80 78 76 74 72 70 68 66 waveform-40 0 1000 2000 3000 4000 Two semi large datasets showing Naive Bayes significantly outperforms (decision trees).
But Often it does Not 100 98 96 94 92 90 88 86 84 82 80 100 99.9 99.8 99.7 99.6 99.5 99.4 99.3 99.2 99.1 99 chess 0 500 1000 1500 2000 2500 shuttle 0 15,000 30,000 45,000 60,000 100 99 98 97 96 95 94 mushroom 93 0 2,000 4,000 6,000 8,000 86.5 86 85.5 85 84.5 84 83.5 adult 83 82.5 82 0 15,000 30,000 45,000 5
And NB Asymptotes Early satimage 88 86 A cross over. 6 84 82 80 78 76 0 1000 2000 3000 4000 5000 6000 90 85 80 75 70 65 letter 60 0 5,000 10,000 15,000 20,000 Naive Bayes starts better but does not improve and asymptotes early. is still improving while Naive Bayes asymptoted early.
When is Naive Bayes Better? Many irrelevant features. Naive Bayes is very robust to irrelevant features. The conditional probabilities for irrelevant features equalize (hence do not affect prediction) fast. 7 Predictions require taking into account many features. Decision trees suffer from fragmentation in these cases. The assumptions hold, i.e., when features are conditionally independent and equally important (e.g., medical domains).
When are Decision Trees Better? Serial tasks: once the value of a key feature is known, dependencies and distributions change. A good example is chess. Another view of this: when segmenting the data into subpopulations gives "easier" subproblems. 8 There are key features: some features are much more important than others. In the mushroom dataset, the odor attribute alone gives you over 98%. Naive Bayes never got to this level.
NBTree: a Hybrid Use the decision tree to segment the data into subproblems and apply Naive Bayes to each one. 9 Decision nodes will test attributes as with regular decision trees, but the leaves will contain Naive Bayes classifiers. Since NB is good at handling many features with relatively little data, it is used where it is most useful: the leaves.
How to Segment the Data Observation: Naive Bayes is an incremental induction algorithm, which means cross validation can be done fast (linear in the number of ) by deleting folds, testing them, and inserting them again. 10 Instead of finding a direct splitting criteria such as mutual info/gini/gain ratio, we use cross validation to estimate how much a split would help versus creating an NB leaf. We don t attempt to fundamentally derive when a split is useful; we try it out.
Results: Absolute Differences Difference in between NBTree and, and NBTree and Naive Bayes. Above the zero lines means NBTree is better. 11 NBTree - NBTree - NB tic-tac-toe chess letter vehicle vote monk1 segment satimage flare iris led24 mushroom vote1 adult shuttle soybean-large DNA ionosphere breast (L) crx breast (W) german pima heart glass cleve waveform-40 glass2 primary-tumor Accuracy difference 30.00 20.00 10.00 0.00-10.00
Results: Relative Differences Relative difference in between NBTree and, and NBTree and Naive Bayes. Below 1.0 means NBTree is better. 12 1.50 Error Ratio 1.00 0.50 0.00 tic-tac-toe chess letter vehicle vote monk1 segment satimage flare iris led24 mushroom vote1 adult shuttle soybean-large DNA ionosphere breast (L) crx breast (w) german pima heart glass cleve waveform-40 glass2 primary-tumor NBTree/ NBTree/ NB
Interpretability The resulting structure is relatively easy to interpret. 13 While NBTrees have complex leaves, there are fewer nodes overall: Letter: 2109 nodes () versus 251 (NBTree) Adult: 2213 versus 137 DNA: 31 versus 3 LED24: 49 versus 1 Many leaves end up as regular decision tree leaves because they contain a single class.
Summary 14 NBTree combines decision tree based segmentation of the data with Naive Bayes at the leaves. Induction time is slower, but the complexity is the same (constants are bigger). Scales well: the is good for large files. On the three largest files (shuttle, adult, letter), NBTree outperformed both and Naive Bayes.