Decision Trees Vibhav Gogate The University of Texas at Dallas
Recap Supervised learning Given: Training data with desired output Assumption: There exists a function f which transforms input x into output f(x). To do: find an approximation to f Classification: Output, i.e., f(x) is discrete What makes learning hard? Issues.
Notes: A discrete feature can appear only once (or not appear at all) along the unique path from the root to a leaf. Question: Can I test on Humidity with a threshold of 95? YES (it is a different discrete feature).
x2<5
x2<5
x2<5
x2<5
x2<5
Can you put a bound on the number of leaf nodes?
The following questions may arise in your mind! How to choose the best attribute? Which property to test at a node When to declare a particular node as leaf? What types of trees should we prefer, smaller, larger, balanced, etc? If a leaf node is impure (has both positive and negative classes), what should we do? What if some attribute value is missing?
Choosing the best Attribute? Fundamental principle underlying tree creation Simplicity (prefer smaller trees) Occam s Razor: Simplest model that explains the data should be preferred Each node divides the data into subsets Heuristic: Make each subset as pure as possible.
Choosing the best Attribute: Information Gain Heuristic Entropy, denoted by H is a measure of impurity Gain = Current impurity New impurity Reduction in impurity Maximize gain Second term actually gives expected entropy (weigh each bin by the amount of data in it)
50% positive and 50% negative examples All examples are negative All examples are positive
When do I play tennis?
When do I play tennis?
Decision Tree
Is the decision tree correct? Let s check whether the split on Wind attribute is correct. We need to show that Wind attribute has the highest information gain.
When do I play tennis?
Wind attribute 5 records match Note: calculate the entropy only on examples that got routed in our branch of the tree (Outlook=Rain)
Practical Issues in Decision Tree Learning Overfitting When to stop growing a tree? Handling non-boolean attributes Handling missing attribute values
Noise Sources of Overfitting Small number of examples associated with each leaf What if only one example is associated with a leaf. Can you believe it? Coincidental regularities Generalization is the most important criteria Your method should work well on examples which you have not seen before.
Avoiding Overfitting Two approaches Stop growing the tree when data split is not statistically significant Grow tree fully, then post-prune Key Issue: What is the correct tree size? Divide data into training and validation set Random noise in two sets might be different Apply statistical test to estimate whether expanding a particular node is likely to produce an improvement beyond the training set Add a complexity penalty
Rule Post Pruning Induce the decision tree using the full training set (allowing it to overfit) Convert the decision tree to a set of rules Prune each rule by removing pre-conditions that improve the estimated accuracy Estimate accuracy using a validation set Sort the rules using their estimated accuracy Classify new instances using the sorted sequence
Handling Missing Values Some attribute-values are missing Example: patient data. You don t expect blood test results for everyone. Treat the missing value as another value Ignore instances having missing values Problematic because throwing away data Assign it the most common value Assign it the most common value based on the class that the example belongs to.
Handling Missing Values: Probabilistic approach
Summary: Decision Trees Representation Tree growth Choosing the best attribute Overfitting and pruning Special cases: Missing Attributes and Continuous Attributes Many forms in practice: CART, ID3, C4.5