CS480 Introduction to Machine Learning Decision Trees. Edith Law

CS480 Introduction to Machine Learning Decision Trees Edith Law

Frameworks of machine learning Classification Supervised Learning Unsupervised Learning Reinforcement Learning 2

Overview What is the idea behind decision trees? What kind of functions are we learning with the decision trees? What is the training and testing procedure for decision trees? What can we do to ensure that the learned decision tree generalizes to future examples? What is the inductive bias of decision trees? What are the pros and cons of decision trees? 3

Prediction is about finding questions that matter Suppose we are given some data about students preferences for courses. course type course time difficulty grade rating s1 AI morning easy 90 like s1 ML afternoon easy 87 like s2 AI morning hard 72 nah s3 theory morning hard 79 nah s3 systems evening hard 85 nah s4 systems morning hard 66 like 5

Prediction is about finding questions that matter You: Is the course under consideration a ML course? Me: Yes You: Has this student taken any other AI courses? Me: Yes You: Has this student liked most AI courses? Me: No You: I predict this student will not like this course. 6

Prediction is about finding questions that matter isml? no yes You: Is the course under consideration a ML course? Me: Yes You: Has this student taken any other AI courses? Me: Yes You: Has this student liked most AI courses? Me: No You: I predict this student will not like this course. nah takenotherai? no yes morning? likedotherai? no yes no yes like nah nah like 7

Learning Decision Trees Given a set of training data in the form of examples (e.g., <user, course>, rating), construct questions that you can ask. In machine learning language, example = a set of features values < AI, morning, easy, 90, like > question = constructed based on features takenotherai? label / target class rating grade > 80%? class time? answer to the question = determined by the feature values yes/no categorical (e.g., morning, afternoon, evening ) 8

Learning Decision Trees Learning is about searching for the best tree to describe the data. - We could enumerate all possible trees, and evaluate each tree using the training set or test set - How many trees are there given 3 features? - There are too many possible trees! (NP Hard problem) It is computational infeasible to consider all the trees, so decision trees must be built greedily by asking - If I could ask one question, what question would I ask? - What is the question that would be most helpful in helping me guess whether the student will enjoy the course? Each node represents a question that split the data; so learning a decision tree amounts to choosing what the internal nodes should be. 9

Learning Decision Trees These questions can take many forms: radius > 17.5 radius in [12, 18] grade is {A, B, C, D, F} grade >= B color is RED 2*radius-3*texture > 16 10

Decision Tree 11

Decision Tree Uninformative Informative 12

Supervised Learning Problem Setting: Set of possible instances Unknown target function Set of function hypotheses H = {h h : X Y} X f : X Y [D] Figure1.1 The learning algorithm: input: training examples x i, y i output: hypothesis h H that best approximates the target function f The set of all hypotheses that can be outputted by a learning algorithm is called the hypothesis space. 14

Decision Tree Learning The learning algorithm: input: training examples Problem Setting: Set of possible instances Unknown target function Set of function hypotheses x i, y i output: hypothesis h H that best approximates the target function f X Each instance is a feature vector H = {h h : X Y} f : X Y y = 1 if a student likes the course, y = 0 if not Each hypothesis h is a decision tree. 15

Interpreting Decision Trees (Outlook = Sunny Humidity = Normal) (Outlook = Overcast) (Outlook = Rain Wind = Weak) 16

Another Example: Cancer Recurrence Prediction Wisconsin Breast Cancer (Prognosis) Dataset consists of 30 features of the cancer cells nuclei. There are 10 features related to radius, texture, perimeter, smoothness, concavity, etc., of the nuclei. For each feature, there is the mean, standard error and max of the feature values. For each example, the outcomes are N=no recurrence or R=recurrence. radius texture perimeter outcome 18.02 17.99 23,51 27.6 10.38 24.27 117.5 112.8 155.1 N N R 17

Example: Cancer Recurrence Prediction What does a node present? a partitioning of the input space internal nodes: a test or question - discrete features: branch on all values - real features: branch on threshold value leaf nodes: include examples that satisfy the tests along the branch R = recurrence N = no recurrence Each example falls in precisely one leaf. Each leaf typically contains more than one example. 18

Interpreting Decision Trees can always convert a decision tree into an equivalent set of if-then rules. R = recurrence N = no recurrence 19

Interpreting Decision Trees can always convert a decision tree into an equivalent set of if-then rules, as well as calculate an estimated probability of recurrence. R = recurrence N = no recurrence 20

Interpreting Decision Trees x 1 > θ 1 x 2 E x 2 θ 2 x 2 > θ 3 θ 3 B x 1 θ 4 θ 2 C D A A B C D E θ 1 θ 4 x 1 21

Interpreting Decision Trees 22

Interpreting Decision Trees Ishwaran H. and Rao J.S. (2009) 23

Which kinds of functions can decision trees express? For decision tress, the hypothesis space is the set of all possible finite discrete functions (i.e., functions whose output is a finite set of categories) that can be learned based on the data. Every finite discrete function can be represented by some decision tree. 24

Which kinds of functions can decision trees express? boolean function can be fully expressed - each entry in the truth table can be one path (very inefficient!) - most boolean functions can be encoded more compactly. Some functions are harder to encode - parity functions: returns 1 iff an even number of inputs are 1 - an exponentially big decision tree O(2 M ) would be needed - majority function: returns 1 if more than half the inputs are 1 Many other functions can be approximated by a Boolean function With real-valued features, decision trees are good at problems in which the class label is constant in large connected axis-orthogonal regions of the input space. 25

Decision Boundaries for Real-Valued Features 26

Example: Cancer Outcome Prediction Suppose we get a new instance radius = 16, texture = 12 How do we classify it? Simple procedure: R = recurrence N = no recurrence at every node, test the correspond attribute follow the appropriate branch of the tree at a leaf, either predict the class of the majority of the examples for that test, or sample from the probabilities of the two classes. 28

Decision Tree Testing Algorithm DecisionTreeTest(tree, test point) If tree is of the form LEAF(guess) then return guess else if tree is of the form NODE(f, left, right) then if f = no in test point then return DecisionTreeTest(left, test point) else return DecisionTreeTest(right, test point) end if end if 29

Learning Decision Trees Most algorithms developed for learning decision trees are variations on core algorithm that employs a recursive, top-down procedure that grows a tree (possibly until it classifies all training data correctly). C4.5 (Quinlan, 1993) ID3 (Quinlan 1986) 30

Decision Tree Training Algorithm DecisionTreeTrain(data, remaining features) guess most frequent answer in data If the labels in data are ambiguous then return LEAF(guess) else if remaining features is empty then return LEAF(guess) else for all f remaining features do NO the subset of data on which f = no YES the subset of data on which f = yes score(f) # of majority vote answer in NO + # of majority vote answer in YES end for f the feature with maximal score (f) NO the subset of data on which f=no YES the subset of data on which f=yes left DecisionTreeTrain(NO, remaining features \ {f}) right DecisionTreeTrain(YES, remaining features \ {f}) return NODE(f, left, right) end if 32

What is a good test? The test should provide information about the class label. e.g., you are given 40 examples, 30 positives, 10 negative. Consider two tests that would split the examples as follows: Which is best? Intuitively, we prefer an attribute that separates the training instances as well as possible. How would we quantify this mathematically? 33

Notion of information Consider three cases: dice coin biased coin There is different amounts of uncertainty to the observed outcomes. 34

Information Content Let E be an event that occurs with probability P(E). If we are told that E has occurred with certainty, then we receive I(E) bits of information. I(E) = log 2 1 P(E) You can also think of information as the amount of surprise in the outcome. For example, if P(E) = 1, then I(E=0) fair coin flip provides log2 2 = 1 bit of information fair dice roll provides log2 6 = 2.58 bits of information 35

Information Content For example, below is a list of probabilities that a certain letter xi appears in the English alphabet. The lower the probability, the higher the information content / surprise. I(x i ) 36

Entropy Given an information source S which emits k symbols from an alphabet {s1,, sk} with probabilities {p1,, pk}, where each emission is independent of the others. What is the average amount of information we expect from the output of S? H(S) = k i p i I(s i ) = k i p i log 1 p i = k i p i log p i H(S) is the entropy of S. 37

Entropy H(S) = i p i log 1 p i Several ways to think about entropy: average amount of information per symbol average amount of surprise when observing the symbol uncertainty the observer has before seeing the symbol average number of bits needed to communicate the symbols. 38

Binary Classification We try to classify a sample of the data S using a decision tree. Suppose we have p positive samples and n negative samples What is the entropy of this dataset? H(S) = p log p p log p = p p + n log 2 p p + n n p + n log 2 n p + n 39

Binary Classification e.g., you are given 40 examples, 30 positives, 10 negative. H(S) = p log p p log p = p p + n log 2 p p + n n p + n log 2 n p + n = 3 4 log 2 3 4 1 4 log 2 1 4 = 0.811 40

Binary Classification H(S) = p log p p log p Entropy measures the impurity of S. Entropy is 0 if all members of S belong to the same class, Entropy is 1 if there is an equal number of pos and neg examples. 41

Conditional Entropy The conditional entropy, H(y x), is the average specific conditional entropy of y given the values of x: H(y x) = v P(x = v)h(y x = v) Interpretation: the expected number of bits needed to transmit y if both the emitter and receive know the possible values of x (but before they were told x s specific value). 42

What is a good test? A good test/question should provide information about the class label. e.g., you are given 40 examples, 30 positives, 10 negative. T1 T2 H(S) = i p i log 1 p i H(y x) = v P(x = v) H(y x = v) H(S) = 3 4 log 2 H(S T1) = 30 40 [ 20 30 log 2 3 4 1 4 log 2 1 4 = 0.811 20 30 10 30 log 2 10 30 ] + 10 [0] = 0.688 40 H(S T2) = 22 40 [ 15 22 log 2 15 22 7 22 log 2 7 22 ] + 18 40 [ 15 18 log 2 15 18 3 18 log 2 3 18 ] = 0.788 43

Information Gain Suppose there is a feature called x. The reduction in entropy that would be obtained by knowing the values of x: IG(S, x) = H(S) H(S x) Equivalently, suppose one has to transmit y: How many bits (on average) would it save if both the transmitter and emitter knew x? 44

Information Gain A good test/question should provide information about the class label. e.g., you are given 40 examples, 30 positives, 10 negative. T1 T2 Which test/question gives a higher information gain? 45

Information Gain Which attribute is the best classifier? S={9+,5-} H(S)=0.94 humidity S={9+,5-} H(S)=0.94 wind high normal weak strong S={3+,4-} S={6+,1-} S={6+,2-} S={3+,3-} H(S high)=0.985 H(S normal)=0.592 H(S weak)=0.811 H(S strong)=1.00 46

Information Gain Which attribute is the best classifier? S={9+,5-} H(S)=0.94 S={9+,5-} H(S)=0.94 humidity wind high normal weak strong S={3+,4-} S={6+,1-} S={6+,2-} S={3+,3-} H(S high)=0.985 H(S normal)=0.592 H(S weak)=0.811 H(S strong)=1.00 Gain(S, Humidity)= = 0.940 (7/14).985 (7/14).592 = 0.151 Gain(S, Wind)= = 0.940 (8/14).811 (6/14)1.0 = 0.048 47

Decision Tree Training Algorithm Given a set of labeled training instances: 1. if all the training instances have the same class, create a leaf with that class label and exit. Else 2. Pick the best test to split the data on. 3. Split the training set according to the value of the outcome of the test. 4. Recursively repeat step 1-3 on each subset of the training data. How do we pick the best test?.for classification, choose the test with the highest information gain. for regression: choose the test with the lowest mean-squared error. 48

Decision Tree Training as Search We can think of decision tree learning as searching in a space of hypotheses for one that fits the training examples. The hypothesis space searched by the algorithm is the set of possible trees. It begins with an empty tree, then considering progressively more elaborate hypotheses in search for a decision tree that correctly classifies the training data. 49

Example: ID3 ID3 is top-down greedy search algorithm for decision trees. It maintains only a single current hypothesis as it searches, and does not do any backtracking. has no ability to determine alternative decision trees that are consistent with the data, and so runs the risk of converging to a locally optimal solution. handle noisy data by accepting hypotheses that imperfectly fit the data. 50

Special Cases What if the outcome of the test is not binary? The number of possible values influences the information gain: the more possible values, the higher the gain nonetheless, the attributes could be irrelevant We could transform attribute into one (or many) binary attributes. C4.5 (the most popular decision tree construction algorithm) uses only binary tests, i.e., attribute = value (discrete) or attribute < value (continuous). 51

Special Cases Suppose feature j are real-values. How do we choose a finite set of thresholds of the form xj > c? choose midpoints of the observed data values x1,j,, xm,j choose midpoints of the observed data values with different y values It can be shown (Fayyad 1992) that the value of the threshold that maximizes information gain must lie at the boundary between adjacent examples (in sorted list) that differ in their target classification. 52

Longer Trees = Worse Performance Decision trees construction proceeds until all leaves are pure (i.e., all examples are from the same class. As the tree grows, the performance on the test set (generalization performance) can start to degrade. 54

Overfitting We say that a hypothesis overfits the training examples, if some other hypothesis that fits the training examples less well actually perform better over the entire distribution of instances (including instances beyond the training set). Definition: Given a hypothesis space H, a hypothesis h in H is said to overfit the training data if there exist some alternative hypothesis h in H, such that h has a smaller error than h over the training examples, but h has a smaller error than h over the entire distribution of instances. Why does this happen? How can it be possible for the tree h to fit the training examples better than h, but for it to perform more poorly over subsequent examples? 55

Overfitting This can happen when the training examples contain random errors or noise. when training data is noise free, but there is coincidental regularities in the dataset, e.g., some irrelevant attribute happens to partition the examples very well, despite being unrelated to the actual target function. In one study (Mingers, 1989b), overfitting was found to decrease the accuracy of the learned decision trees on 10-25% of the problems. 56

Overfitting: How to Avoid Remove some nodes to get better generalization! Early stopping: stop growing the tree when further splitting the data does not improve information gain of the validation set. Post pruning: Grow a full tree, then prune the tree by eliminating lower nodes that have low information gain on the validation set. In general, post pruning is better: it allows you to deal with cases where a single attribute is not informative, but a combination of attributes is informative. 57

Early Stopping: Criteria Maximum depth exceeded Maximum running time exceeded All children nodes are sufficiently homogeneous All children noes have too few training examples Cross-Validation Reduction in cost is small 58

Post Pruning (or Reduced Error Pruning) Split the data set into a training set and a validation set Grow a large tree (e.g. until each leaf is pure) For each node: - Evaluate the validation set accuracy of pruning the subtree rooted at the node - Greedily remove the node such that the removal most improves validation set accuracy, with its corresponding subtree - Replace the removed node by a leaf with the majority class of the corresponding examples (assigning it the most common classification of the training examples affiliated with that node). Stop when pruning starts hurting the accuracy on validation set. 59

Reduced Error Pruning Pruning: Effects Any leaf nodes added due to coincidental regularities is likely to be pruned because the same regularities are unlikely to be in the validation set. 60

Inductive Bias 62

Inductive Bias A B B A A B The type of solutions are we more likely to prefer In the absence of data that narrow down the relevant concept. 63

Inductive Bias There are many trees consistent with the examples. How does greedy decision tree algorithms (e.g., ID3) choose? - It goes from simple to complex. - select trees that place attributes with higher information gain closer to the root. - prefer to make decisions by only looking at as few features as possible. So, for decision trees, the inductive biases are - Occam s razor: prefer the simplest hypothesis that fits the data. Shorter trees are better than Longer trees. - Trees that place high information gain attributes closer to the root are preferred over those that do not. 64

Inductive Bias The set of assumptions that the learner makes about the target function (i.e., the function that maps input to output) that enables it to generalize to future instances. Restriction Bias: limit the hypothesis space (e.g., linear regression models) Preference Bias: impose ordering on the hypothesis space (e.g., decision tree) 65

Advantages of Decision Trees provide a general representation of classification rules the learned function is easy to interpret fast learning algorithms good accuracy in practice and many applications in industry 67

Limitations of Decision Trees Sensitivity: - exact tree output may be sensitive to small changes - with many features, tests may not be meaningful good for learning (non-linear) piecewise axis-orthogonal decision boundaries, but not for learning functions with smooth, curvilinear boundaries. 68

What you should know How to use a decision tree to classify a new example How to build a decision tree using an information-theoretic approach How to detect (and fix) overfitting in decision trees How to handle both discrete and continuous attributes and outputs What inductive bias means 69