Decision Tree. Machine Learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Size: px

Start display at page:

Download "Decision Tree. Machine Learning. Hamid Beigy. Sharif University of Technology. Fall 1396"

Eustace White
5 years ago
Views:

1 Decision Tree Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

2 Table of contents 1 Introduction 2 Decision tree classification 3 Building decision trees 4 ID3 Algorithm Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

3 Introduction 1 The decision tree is a classic and natural model of learning. Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

4 Introduction 1 The decision tree is a classic and natural model of learning. 2 It is closely related to the notion of divide and conquer. Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

5 Introduction 1 The decision tree is a classic and natural model of learning. 2 It is closely related to the notion of divide and conquer. 3 A decision tree partitions the instance space into axis-parallel regions, labeled with class value Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

6 Introduction 1 The decision tree is a classic and natural model of learning. 2 It is closely related to the notion of divide and conquer. 3 A decision tree partitions the instance space into axis-parallel regions, labeled with class value 4 Why decsion trees? Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

7 Introduction 1 The decision tree is a classic and natural model of learning. 2 It is closely related to the notion of divide and conquer. 3 A decision tree partitions the instance space into axis-parallel regions, labeled with class value 4 Why decsion trees? Interpretable, popular in medical applications because they mimic the way a doctor thinks Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

8 Introduction 1 The decision tree is a classic and natural model of learning. 2 It is closely related to the notion of divide and conquer. 3 A decision tree partitions the instance space into axis-parallel regions, labeled with class value 4 Why decsion trees? Interpretable, popular in medical applications because they mimic the way a doctor thinks Can model discrete outcomes nicely Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

9 Introduction 1 The decision tree is a classic and natural model of learning. 2 It is closely related to the notion of divide and conquer. 3 A decision tree partitions the instance space into axis-parallel regions, labeled with class value 4 Why decsion trees? Interpretable, popular in medical applications because they mimic the way a doctor thinks Can model discrete outcomes nicely Can be very powerful, can be as complex as you need them Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

10 Introduction 1 The decision tree is a classic and natural model of learning. 2 It is closely related to the notion of divide and conquer. 3 A decision tree partitions the instance space into axis-parallel regions, labeled with class value 4 Why decsion trees? Interpretable, popular in medical applications because they mimic the way a doctor thinks Can model discrete outcomes nicely Can be very powerful, can be as complex as you need them C4.5 and CART decision trees are very popular. Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

11 Decision tree classification 1 Structure of decsion trees Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

12 Decision tree classification 1 Structure of decsion trees Each internal node tests an attribute Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

13 Decision tree classification 1 Structure of decsion trees Each internal node tests an attribute Each branch corresponds to attribute value Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

14 Decision tree classification 1 Structure of decsion trees Each internal node tests an attribute Each branch corresponds to attribute value Each leaf node assigns a classification. Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

15 can Decision not be tree trained classification incrementally., ID5, ID5R are samples of incremental induction of decision trees 1 ding Structure of decsion trees Each internal node tests an attribute P.E. Utgoff, Each Incremental branch corresponds Induction to attribute of Decision value Trees, Machine Learning, Vol. 4, p 186,1989. Each leaf node assigns a classification. 2 Decision Tree for PlayTennis [9+,5-] Outlook? Sunny Overcast Rain [2+,3-] Humidity? Yes Wind? [4+,0-] High Normal Strong Light [3+,2-] No Yes [0+,3-] [2+,0-] No Yes [0+,2-] [3+,0-] Machine Learning Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

16 e idea of binary classification trees is not unlike that of the histogram - partition the featur Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24 Decision surface Histogram Classifier Linear Classifier 0 1 Tree Classifier (a) (b) (c) Figure 9.2: (a) Histogram classifier ; (b) Linear classifier; (c)tree classifier. 3 Binary Classification Trees

17 poor generalization characteristics, and then prune this tree, to avoid overfitting. Building decision trees Growing Trees The growing process is based on recursively subdividing the feature space. Usually the subdivisions are splits of existing regions into two smaller regions (i.e., binary splits). For simplicity, the splits are perpendicular to one of the feature axis. An example of such construction is depicted in Figure Decsion trees recursively subdivide the feature space. and so on... Figure 9.3: Growing a recursive binary tree (X =[0, 1] 2 ). Often the splitting process is based on the training data, and is designed to separate data with di erent labels as much as possible. In such constructions, the splits, and hence the treestructure itself, are data dependent. This causes major di culties for the analysis (and tunning) of these methods. Alternatively, the splitting and subdivision could be taken independent from the training data. The latter approach is the one we are going to investigate in detail, since it is more amenable to analysis, and we will consider Dyadic Decision Trees and Recursive Dyadic Partitions (depicted in Figure 9.4) in particular. 84 Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

18 poor generalization characteristics, and then prune this tree, to avoid overfitting. Building decision trees Growing Trees The growing process is based on recursively subdividing the feature space. Usually the subdivisions are splits of existing regions into two smaller regions (i.e., binary splits). For simplicity, the Until now we have been referring to trees, but did not made clear how do trees relate to splits are perpendicular to one of the feature axis. An example of such construction is depicted partitions. It turns out that any decision tree can be associated with a partition of the input in Figure 9.3. space X and vice-versa. In particular, a Recursive Dyadic Partition (RDP) can be associated with a (binary) tree. In fact, this is the most e cient way of describing a RDP. In Figure 9.4 we illustrate the procedure. Each leaf of the tree corresponds to a cell of the partition. The nodes in the tree correspond to the various partition cells that are generated in the construction and so on... of the tree. The orientation of the dyadic split alternates between the levels of the tree (for the example of Figure 9.4, at the root level the split is done in the horizontal axis, at the level below that (the level of nodes 2 and 3) the split is done in the vertical axis, and so on...). The tree is called dyadic because the splits of cells are always at the midpoint along one coordinate axis, and consequently the Figure sidelengths 9.3: Growing of all cells a recursive are dyadic binary (i.e., tree powers (X of =[0, 2). 1] 2 ). 1 Decsion trees recursively subdivide the feature space. 2 The test variable specifies the division Often the splitting process is based on the training data, and is designed to separate 1 data 1 with di erent labels as much as possible. In such constructions, 1 the splits, 4 and hence the treestructure itself, are data dependent. This causes major di culties for the analysis (and tunning) of these methods. Alternatively, the splitting and subdivision could be taken independent from 4 5 the training data. The latter approach is the one we are going to investigate in detail, since it is more amenable to analysis, and we will consider Dyadic Decision Trees and Recursive Dyadic Partitions (depicted in Figure 9.4) in particular Figure 9.4: Example of Recursive Dyadic Partition (RDP) growing (X =[0, 1] 2 ). In the following we are going to consider the 2-dimensional case, but all the results can be Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

19 Building decision trees (example) 10! Training Examples for Concept PlayTennis Training examples for PlayTennis An Illustrative Example Day Outlook Temperature Humidity Wind PlayTennis? 1 Sunny Hot High Light No 2 Sunny Hot High Strong No 3 Overcast Hot High Light Yes 4 Rain Mild High Light Yes 5 Rain Cool Normal Light Yes 6 Rain Cool Normal Strong No 7 Overcast Cool Normal Strong Yes 8 Sunny Mild High Light No 9 Sunny Cool Normal Light Yes 10 Rain Mild Normal Light Yes 11 Sunny Mild Normal Strong Yes 12 Overcast Mild High Strong Yes 13 Overcast Hot Normal Light Yes 14 Rain Mild High Strong No! ID3 Build-DT using Gain( )! How Will ID3 Construct A Decision Tree? Machine Learning Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

20 Building decision trees (cont.) How to build a decision tree? Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

21 Building decision trees (cont.) How to build a decision tree? 1 Start at the top of the tree. Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

22 Building decision trees (cont.) How to build a decision tree? 1 Start at the top of the tree. 2 Grow it by splitting attributes one by one. Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

23 Building decision trees (cont.) How to build a decision tree? 1 Start at the top of the tree. 2 Grow it by splitting attributes one by one. 3 Assign leaf nodes. Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

24 Building decision trees (cont.) How to build a decision tree? 1 Start at the top of the tree. 2 Grow it by splitting attributes one by one. 3 Assign leaf nodes. 4 When we get to the bottom, prune the tree to prevent overfitting. Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

25 Building decision trees (cont.) How to build a decision tree? 1 Start at the top of the tree. 2 Grow it by splitting attributes one by one. 3 Assign leaf nodes. 4 When we get to the bottom, prune the tree to prevent overfitting. How choose a test variable for an internal node? Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

26 Building decision trees (cont.) How to build a decision tree? 1 Start at the top of the tree. 2 Grow it by splitting attributes one by one. 3 Assign leaf nodes. 4 When we get to the bottom, prune the tree to prevent overfitting. How choose a test variable for an internal node? Choosing different measures result in different algorithms. We describe ID Gini index Entropy Misclassification error p Node impurity measures for two-class classification, as a function of the Hamid Beigy (Sharif University proportion of Technology) p in class 2. Cross-entropy Decisionhas Tree been scaled to pass through Fall / 24

27 Building decision trees (cont.) ID3 uses information gain to choose a test variable for an internal node. Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

28 Building decision trees (cont.) ID3 uses information gain to choose a test variable for an internal node. The information gain of D relative to attribute A is the expected reduction in entropy due to splitting on A. Gain(D, A) = H(D) [ ] Dv D H(D v ) v values(a) where D v is {x D : x.a = v}, the set of examples in D where attribute A has value v Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

29 ID3 Algorihm 10! Training Examples for Concept PlayTennis An Illustrative Example Day Outlook Temperature Humidity Wind PlayTennis? 1 Sunny Hot High Light No 2 Sunny Hot High Strong No 3 Overcast Hot High Light Yes 4 Rain Mild High Light Yes 5 Rain Cool Normal Light Yes 6 Rain Cool Normal Strong No 7 Overcast Cool Normal Strong Yes 8 Sunny Mild High Light No 9 Sunny Cool Normal Light Yes 10 Rain Mild Normal Light Yes 11 Sunny Mild Normal Strong Yes 12 Overcast Mild High Strong Yes 13 Overcast Hot Normal Light Yes 14 Rain Mild High Strong No! ID3 Build-DT using Gain( )! How Will ID3 Construct A Decision Tree? H(D) = (9/14) log(9/14) (5/14) log(5/14) = 0.94bits H(D, Humidity = High) = (3/7) log(3/7) (4/7) log(4/7) = 0.985bits H(D, Humidity = Normal) = (6/7) log(6/7) (1/7) log(1/7) = 0.592bits Machine Learning Gain(D, Humidity) = 0.94 (7/14) (7/14) = 0.151bits Gain(D, Wind) = 0.94 (8/14) (6/14) 1.0 = 0.048bits Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

30 ID3 Algorihm An Illustrative Example onstructing A Decision Tree 10! Training Examples for Concept PlayTennis for PlayTennis using ID3 [ 2 ] Day Outlook Temperature Humidity Wind PlayTennis? 1 Sunny Hot High Light No 2 Sunny Hot High Strong No 3 Overcast Hot High Light Yes Temperature Humidity Wind PlayTennis? 4 Rain Mild High Light Yes Hot High Light No 5 Rain Cool Normal Light Yes Hot High Strong No t Hot 6 High Rain Cool Yes Normal Strong No Mild 7 HighOvercast LightCool Yes Normal Strong Yes Cool 8 Normal Sunny Light Mild Yes High Light No Cool 9 Normal Sunny Strong Cool No Normal Light Yes t Cool 10 Normal Rain Strong Mild Yes Normal Light Yes Mild 11 HighSunny LightMild No Normal Strong Yes Cool 12 Normal Overcast LightMild Yes High Strong Yes Mild 13 Normal Overcast LightHot Yes Normal Light Yes Mild 14 Normal Rain Strong Mild Yes High Strong No t Mild High Strong Yes t Hot Normal Light Yes Mild! ID3 Build-DT High using Gain( ) Strong No! How Will ID3 Construct A Decision Tree? [9+, 5-] Gain(D, Humidity) = 0.151bits Outlook.151 bits Attribute bits Gain(D, Wind) = 0.048bits ) = Gain(D, bits Temperature) = 0.029bits 246 bitsgain(d, Outlook) = 0.246bits ttribute (Root of Subtree) Machine Learning Sunny [2+, 3-] Overcast [4+, 0-] Rain [3+, 2-] Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

31 ID3 Algorihm An Illustrative Example 10! Training Examples for Concept PlayTennis Day Outlook Temperature Humidity Wind PlayTennis? 1 Sunny Hot High Light No 2 Sunny Hot High Strong No 3 Overcast Hot High Light Yes 4 Rain Mild High Light Yes 5 Rain Cool Normal Light Yes 6 Rain Cool Normal Strong No 7 Overcast Cool Normal Strong Yes 8 Sunny Mild High Light No 9 Sunny Cool Normal Light Yes 10 Rain Mild Normal Incremental Light Learning of Decision Yes Trees 11 Sunny Mild Normal Strong Yes Overcast Mild High Strong Yes! ID3 can not be trained incrementally. 13 Overcast Hot Normal Light Yes 14 Rain Mild High Strong No! ID3 Build-DT using Gain( )! Gain(D How Will Sunny ID3, Construct Humidity) A Decision = Tree? 0.97bits Gain(D Sunny, Wind) = 0.02bits Gain(D Sunny, Temperature) = 0.57bits Machine Learning! ID4, ID5, ID5R are samples of incremental induction of decision trees! Reading " P.E. Utgoff, Incremental Induction of Decision Trees, Machine Learning, Vol. 4, pp ,1989. [2+,3-] No [9+,5-] High Humidity? Yes Outlook? Sunny Overcast Rain Normal [0+,3-] [2+,0-] Yes [4+,0-] Strong No Wind? Light Yes [0+,2-] [3+,0-] [3+,2-] Machine Learning Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

32 Inductive Bias in ID3 Types of Biases Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

33 Inductive Bias in ID3 Types of Biases 1 Preference (search) bias Put priority on choosing hypothesis. Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

34 Inductive Bias in ID3 Types of Biases 1 Preference (search) bias Put priority on choosing hypothesis. 2 Language bias Put restriction on the set of hypotheses considered Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

35 Inductive Bias in ID3 Types of Biases 1 Preference (search) bias Put priority on choosing hypothesis. 2 Language bias Put restriction on the set of hypotheses considered Which Bias is better? Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

36 Inductive Bias in ID3 Types of Biases 1 Preference (search) bias Put priority on choosing hypothesis. 2 Language bias Put restriction on the set of hypotheses considered Which Bias is better? 1 Preference bias is more desirable. Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

37 Inductive Bias in ID3 Types of Biases 1 Preference (search) bias Put priority on choosing hypothesis. 2 Language bias Put restriction on the set of hypotheses considered Which Bias is better? 1 Preference bias is more desirable. 2 Because, the learner works within a complete space that is assured to contain the unknown concept. Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

38 Inductive Bias in ID3 Types of Biases 1 Preference (search) bias Put priority on choosing hypothesis. 2 Language bias Put restriction on the set of hypotheses considered Which Bias is better? 1 Preference bias is more desirable. 2 Because, the learner works within a complete space that is assured to contain the unknown concept. Inductive Bias of ID3 Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

39 Inductive Bias in ID3 Types of Biases 1 Preference (search) bias Put priority on choosing hypothesis. 2 Language bias Put restriction on the set of hypotheses considered Which Bias is better? 1 Preference bias is more desirable. 2 Because, the learner works within a complete space that is assured to contain the unknown concept. Inductive Bias of ID3 1 Shorter trees are preferred over longer trees. Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

40 Inductive Bias in ID3 Types of Biases 1 Preference (search) bias Put priority on choosing hypothesis. 2 Language bias Put restriction on the set of hypotheses considered Which Bias is better? 1 Preference bias is more desirable. 2 Because, the learner works within a complete space that is assured to contain the unknown concept. Inductive Bias of ID3 1 Shorter trees are preferred over longer trees. 2 Occams razor : Prefer the simplest hypothesis that fits the data. Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

41 Inductive Bias in ID3 Types of Biases 1 Preference (search) bias Put priority on choosing hypothesis. 2 Language bias Put restriction on the set of hypotheses considered Which Bias is better? 1 Preference bias is more desirable. 2 Because, the learner works within a complete space that is assured to contain the unknown concept. Inductive Bias of ID3 1 Shorter trees are preferred over longer trees. 2 Occams razor : Prefer the simplest hypothesis that fits the data. 3 Trees that place high information gain attributes close to the root are preferred over those that do not. Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

42 Overfitting in ID3 How can we avoid over-fitting? Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

43 Overfitting in ID3 How can we avoid over-fitting? 1 Prevention Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

44 Overfitting in ID3 How can we avoid over-fitting? 1 Prevention Stop training (growing) before it reaches the point that overfits. Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

45 Overfitting in ID3 How can we avoid over-fitting? 1 Prevention Stop training (growing) before it reaches the point that overfits. Select attributes that are relevant (i.e., will be useful in the decision tree) Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

46 Overfitting in ID3 How can we avoid over-fitting? 1 Prevention Stop training (growing) before it reaches the point that overfits. Select attributes that are relevant (i.e., will be useful in the decision tree) Requires some predictive measure of relevance Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

47 Overfitting in ID3 How can we avoid over-fitting? 1 Prevention Stop training (growing) before it reaches the point that overfits. Select attributes that are relevant (i.e., will be useful in the decision tree) Requires some predictive measure of relevance 2 Avoidance Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

48 Overfitting in ID3 How can we avoid over-fitting? 1 Prevention Stop training (growing) before it reaches the point that overfits. Select attributes that are relevant (i.e., will be useful in the decision tree) Requires some predictive measure of relevance 2 Avoidance Allow to over-fit, then improve the generalization capability of the tree. Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

49 Overfitting in ID3 How can we avoid over-fitting? 1 Prevention Stop training (growing) before it reaches the point that overfits. Select attributes that are relevant (i.e., will be useful in the decision tree) Requires some predictive measure of relevance 2 Avoidance Allow to over-fit, then improve the generalization capability of the tree. Holding out a validation set (test set) Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

50 Overfitting in ID3 How can we avoid over-fitting? 1 Prevention Stop training (growing) before it reaches the point that overfits. Select attributes that are relevant (i.e., will be useful in the decision tree) Requires some predictive measure of relevance 2 Avoidance Allow to over-fit, then improve the generalization capability of the tree. Holding out a validation set (test set) 3 Detection and Recovery Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

51 Overfitting in ID3 How can we avoid over-fitting? 1 Prevention Stop training (growing) before it reaches the point that overfits. Select attributes that are relevant (i.e., will be useful in the decision tree) Requires some predictive measure of relevance 2 Avoidance Allow to over-fit, then improve the generalization capability of the tree. Holding out a validation set (test set) 3 Detection and Recovery Letting the problem happen, detecting when it does, recovering afterward Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

52 Overfitting in ID3 How can we avoid over-fitting? 1 Prevention Stop training (growing) before it reaches the point that overfits. Select attributes that are relevant (i.e., will be useful in the decision tree) Requires some predictive measure of relevance 2 Avoidance Allow to over-fit, then improve the generalization capability of the tree. Holding out a validation set (test set) 3 Detection and Recovery Letting the problem happen, detecting when it does, recovering afterward Build model, remove (prune) elements that contribute to overfitting Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

53 Overfitting in ID3 How can we avoid over-fitting? 1 Prevention Stop training (growing) before it reaches the point that overfits. Select attributes that are relevant (i.e., will be useful in the decision tree) Requires some predictive measure of relevance 2 Avoidance Allow to over-fit, then improve the generalization capability of the tree. Holding out a validation set (test set) 3 Detection and Recovery Letting the problem happen, detecting when it does, recovering afterward Build model, remove (prune) elements that contribute to overfitting How to select Best tree? Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

54 Overfitting in ID3 How can we avoid over-fitting? 1 Prevention Stop training (growing) before it reaches the point that overfits. Select attributes that are relevant (i.e., will be useful in the decision tree) Requires some predictive measure of relevance 2 Avoidance Allow to over-fit, then improve the generalization capability of the tree. Holding out a validation set (test set) 3 Detection and Recovery Letting the problem happen, detecting when it does, recovering afterward Build model, remove (prune) elements that contribute to overfitting How to select Best tree? 1 Training and validation set Use a separate set of examples (distinct from the training set) for test. Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

55 Overfitting in ID3 How can we avoid over-fitting? 1 Prevention Stop training (growing) before it reaches the point that overfits. Select attributes that are relevant (i.e., will be useful in the decision tree) Requires some predictive measure of relevance 2 Avoidance Allow to over-fit, then improve the generalization capability of the tree. Holding out a validation set (test set) 3 Detection and Recovery Letting the problem happen, detecting when it does, recovering afterward Build model, remove (prune) elements that contribute to overfitting How to select Best tree? 1 Training and validation set Use a separate set of examples (distinct from the training set) for test. 2 Statistical test Use all data for training, but apply the statistical test to estimate the over-fitting. Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

56 Overfitting in ID3 How can we avoid over-fitting? 1 Prevention Stop training (growing) before it reaches the point that overfits. Select attributes that are relevant (i.e., will be useful in the decision tree) Requires some predictive measure of relevance 2 Avoidance Allow to over-fit, then improve the generalization capability of the tree. Holding out a validation set (test set) 3 Detection and Recovery Letting the problem happen, detecting when it does, recovering afterward Build model, remove (prune) elements that contribute to overfitting How to select Best tree? 1 Training and validation set Use a separate set of examples (distinct from the training set) for test. 2 Statistical test Use all data for training, but apply the statistical test to estimate the over-fitting. 3 Define the measure of complexity Halting the grow when this measure is minimized. Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

57 Machine Lea Pruning algorithms Reduced-Error Pruning. Acc Size error true error training error Number of nodes in tree Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

58 Pruning algorithms Reduced-Error Pruning. oss-validation Approach aining and Validation Sets, node) btree rooted at node Reduced-Error Pruning error af (with majority label of associated examples) ed-error-pruning (D) D train (training / growing ), D validation (validation / pruning ) tree T using ID3 on D train y on D validation decreases DO n-leaf node candidate in T p[candidate] Prune (T, candidate) uracy[candidate] Test (Temp[candidate], D validation ) p with best value of Accuracy (best increase; greedy) ed) T The effect of reduced error pruning on error Accuracy Acc true error training error Number of nodes in tree Size Machine Lea On training data On test data 0.6 Post-pruned tree 0.55 Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

59 Pruning algorithms Reduced-Error Pruning. oss-validation Approach aining and Validation Sets, node) btree rooted at node Reduced-Error Pruning error af (with majority label of associated examples) ed-error-pruning (D) D train (training / growing ), D validation (validation / pruning ) tree T using ID3 on D train y on D validation decreases DO n-leaf node candidate in T p[candidate] Prune (T, candidate) uracy[candidate] Test (Temp[candidate], D validation ) p with best value of Accuracy (best increase; greedy) ed) T The effect of reduced error pruning on error Accuracy Acc true error training error Number of nodes in tree Size Machine Lea On training data On test data 0.6 Post-pruned tree 0.55 Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

60 Pruning algorithms Prunning algorithms Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

61 Pruning algorithms Prunning algorithms 1 Reduced Error Pruning Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

62 Pruning algorithms Prunning algorithms 1 Reduced Error Pruning 2 Pessimistic Error Pruning Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

63 Pruning algorithms Prunning algorithms 1 Reduced Error Pruning 2 Pessimistic Error Pruning 3 Minimum Error Pruning Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

64 Pruning algorithms Prunning algorithms 1 Reduced Error Pruning 2 Pessimistic Error Pruning 3 Minimum Error Pruning 4 Critical Value Pruning Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

65 Pruning algorithms Prunning algorithms 1 Reduced Error Pruning 2 Pessimistic Error Pruning 3 Minimum Error Pruning 4 Critical Value Pruning 5 Cost-Complexity Pruning Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

66 Pruning algorithms Prunning algorithms 1 Reduced Error Pruning 2 Pessimistic Error Pruning 3 Minimum Error Pruning 4 Critical Value Pruning 5 Cost-Complexity Pruning Reading Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

67 Pruning algorithms Prunning algorithms 1 Reduced Error Pruning 2 Pessimistic Error Pruning 3 Minimum Error Pruning 4 Critical Value Pruning 5 Cost-Complexity Pruning Reading 1 F. Esposito, D. Malerba, and G. Semeraro, A Comparative Analysis of Methods for Pruning Decision Trees, IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. 19, No. 5, pp , May Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

68 Pruning algorithms Prunning algorithms 1 Reduced Error Pruning 2 Pessimistic Error Pruning 3 Minimum Error Pruning 4 Critical Value Pruning 5 Cost-Complexity Pruning Reading 1 F. Esposito, D. Malerba, and G. Semeraro, A Comparative Analysis of Methods for Pruning Decision Trees, IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. 19, No. 5, pp , May S. R. Safavian and D. Landgrebe, A Survey of Decision Tree Classifier Methodology, IEEE Trans on Systems, Man, and Cybernetics, Vol. 21, No. 3, pp , Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

69 Continuous Valued Attributes Two methods for handling continuous attributes Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

70 Continuous Valued Attributes Two methods for handling continuous attributes 1 Discretization (e.g., histogramming) Break real-valued attributes into ranges in advance Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

71 Continuous Valued Attributes Example Two methods for handling continuous attributes 1 Discretization (e.g., histogramming) Break real-valued attributes into ranges in advance high = {Temp > 35C} med = {10C < Temp 35C} low = {Temp 10C} Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

72 Continuous Valued Attributes Example Two methods for handling continuous attributes 1 Discretization (e.g., histogramming) Break real-valued attributes into ranges in advance high = {Temp > 35C} med = {10C < Temp 35C} low = {Temp 10C} 2 Using thresholds for splitting nodes Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

73 Continuous Valued Attributes Example Two methods for handling continuous attributes 1 Discretization (e.g., histogramming) Break real-valued attributes into ranges in advance high = {Temp > 35C} med = {10C < Temp 35C} low = {Temp 10C} 2 Using thresholds for splitting nodes Example A a produces subsets A a and A > a. Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

74 Continuous Valued Attributes Example Two methods for handling continuous attributes 1 Discretization (e.g., histogramming) Break real-valued attributes into ranges in advance high = {Temp > 35C} med = {10C < Temp 35C} low = {Temp 10C} 2 Using thresholds for splitting nodes Example A a produces subsets A a and A > a. 3 Information gain is calculated the same way as for discrete splits Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

75 Continuous Valued Attributes Example Two methods for handling continuous attributes 1 Discretization (e.g., histogramming) Break real-valued attributes into ranges in advance Example high = {Temp > 35C} med = {10C < Temp 35C} low = {Temp 10C} 2 Using thresholds for splitting nodes A a produces subsets A a and A > a. 3 Information gain is calculated the same way as for discrete splits How to find the split with highest Gain Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

76 Continuous Valued Attributes Example Two methods for handling continuous attributes 1 Discretization (e.g., histogramming) Break real-valued attributes into ranges in advance Example high = {Temp > 35C} med = {10C < Temp 35C} low = {Temp 10C} 2 Using thresholds for splitting nodes A a produces subsets A a and A > a. 3 Information gain is calculated the same way as for discrete splits How to find the split with highest Gain Example length label Thresholds Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

77 Missing Data Problem: What If Some Examples Missing Values of A? Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

78 Missing Data Training: evaluate Gain (D, A) where for some x D, a value Testing: classify a new example x without knowing the value! Solutions: Incorporating a Guess into Calculation of G Problem: What If Some Examples Missing Values of A? Consider dataset. Day Outlook Temperature Humidity Wind PlayTennis? 1 Sunny Hot High Light No 2 Sunny Hot High Strong No 3 Overcast Hot High Light Yes 4 Rain Mild High Light Yes 5 Rain Cool Normal Light Yes 6 Rain Cool Normal Strong No 7 Overcast Cool Normal Strong Yes 8 Sunny Mild??? Light No 9 Sunny Cool Normal Light Yes 10 Rain Mild Normal Light Yes 11 Sunny Mild Normal Strong Yes 12 Overcast Mild High Strong Yes 13 Overcast Hot Normal Light Yes 14 Rain Mild High Strong No [2 Machine Learning Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

79 known Attribute Values Missing Data Training: evaluate Gain (D, A) where for some x D, a value Testing: classify a new example x without knowing the value s Missing! Solutions: Values of Incorporating A? a Guess into Calculation of G Problem: What If Some Examples Missing Values of A? butes Consider during dataset. training or testing Day Outlook Temperature Humidity Wind PlayTennis? 1 Sunny Hot High Light No 2 Sunny Hot High Strong No 3 Overcast Hot High Light Yes 4 Rain Mild High Light Yes 5 Rain Cool Normal Light Yes 6 Rain Cool Normal Strong No 7 Overcast Cool Normal Strong Yes 8 Sunny Mild??? Light No 9 Sunny Cool Normal Light Yes 10 Rain Mild Normal Light Yes 11 Sunny Mild Normal Strong Yes 12 Overcast Mild High Strong Yes 13 Overcast Hot Normal Light Yes 14 Rain Mild High Strong No ormal,, Blood-Test =?, >, sometimes low priority (or cost too high) ssification ere for some x D, a value for A is not given without knowing the value of A into Calculation of Gain(D, A) Wind Light Strong Light Light Light Strong Strong Light Light PlayTennis? No No Yes Yes Yes No Yes No Yes What is the decision tree? [2+, 3-] [9+, 5-] Outlook Sunny Overcast [4+, 0-] Machine Learning Rain [3+, 2-] Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24 [2

80 known Attribute Values Missing Data Training: evaluate Gain (D, A) where for some x D, a value Testing: classify a new example x without knowing the value s Missing! Solutions: Values of Incorporating A? a Guess into Calculation of G Problem: What If Some Examples Missing Values of A? butes Consider during dataset. training or testing Day Outlook Temperature Humidity Wind PlayTennis? 1 Sunny Hot High Light No 2 Sunny Hot High Strong No 3 Overcast Hot High Light Yes 4 Rain Mild High Light Yes 5 Rain Cool Normal Light Yes 6 Rain Cool Normal Strong No 7 Overcast Cool Normal Strong Yes 8 Sunny Mild??? Light No 9 Sunny Cool Normal Light Yes 10 Rain Mild Normal Light Yes 11 Sunny Mild Normal Strong Yes 12 Overcast Mild High Strong Yes 13 Overcast Hot Normal Light Yes 14 Rain Mild High Strong No ormal,, Blood-Test =?, >, sometimes low priority (or cost too high) ssification ere for some x D, a value for A is not given without knowing the value of A into Calculation of Gain(D, A) Wind Light Strong Light Light Light Strong Strong Light Light PlayTennis? No No Yes Yes Yes No Yes No Yes What is the decision tree? [2+, 3-] [9+, 5-] Outlook Sunny Overcast [4+, 0-] Machine Learning Rain [3+, 2-] Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24 [2

81 Attributes with Many Values Problem: If attribute has many values such as Date, Gain() will select it (why?) Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

82 Attributes with Many Values Problem: If attribute has many values such as Date, Gain() will select it (why?) One Approach: Use GainRatio instead of Gain Gain(D, A) = H(D) GainRatio(D, A) = SplitInformation(D, A) = v values(a) [ ] Dv D H(D v ) Gain(D, A) SplitInformation(D, A) [ Dv D log D ] v D v values(a) Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

83 Attributes with Many Values Problem: If attribute has many values such as Date, Gain() will select it (why?) One Approach: Use GainRatio instead of Gain Gain(D, A) = H(D) GainRatio(D, A) = SplitInformation(D, A) = v values(a) [ ] Dv D H(D v ) Gain(D, A) SplitInformation(D, A) [ Dv D log D ] v D v values(a) SplitInformation: directly proportional to values(a), i.e., penalizes attributes with more values. Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

84 Attributes with Many Values Problem: If attribute has many values such as Date, Gain() will select it (why?) One Approach: Use GainRatio instead of Gain Gain(D, A) = H(D) GainRatio(D, A) = SplitInformation(D, A) = v values(a) [ ] Dv D H(D v ) Gain(D, A) SplitInformation(D, A) [ Dv D log D ] v D v values(a) SplitInformation: directly proportional to values(a), i.e., penalizes attributes with more values. What is its inductive bias? Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

85 Attributes with Many Values Problem: If attribute has many values such as Date, Gain() will select it (why?) One Approach: Use GainRatio instead of Gain Gain(D, A) = H(D) GainRatio(D, A) = SplitInformation(D, A) = v values(a) [ ] Dv D H(D v ) Gain(D, A) SplitInformation(D, A) [ Dv D log D ] v D v values(a) SplitInformation: directly proportional to values(a), i.e., penalizes attributes with more values. What is its inductive bias? Preference bias (for lower branch factor) expressed via GainRatio(.) Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

86 Attributes with Many Values Problem: If attribute has many values such as Date, Gain() will select it (why?) One Approach: Use GainRatio instead of Gain Gain(D, A) = H(D) GainRatio(D, A) = SplitInformation(D, A) = v values(a) [ ] Dv D H(D v ) Gain(D, A) SplitInformation(D, A) [ Dv D log D ] v D v values(a) SplitInformation: directly proportional to values(a), i.e., penalizes attributes with more values. What is its inductive bias? Preference bias (for lower branch factor) expressed via GainRatio(.) Alternative attribute selection : Gini Index Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

87 Handling Attributes With Different Costs Problem: In some learning tasks the instance attributes may have associated costs. Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

88 Handling Attributes With Different Costs Problem: In some learning tasks the instance attributes may have associated costs. Solutions Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

89 Handling Attributes With Different Costs Problem: In some learning tasks the instance attributes may have associated costs. Solutions 1 ExtendedID3 Gain(S, A) Cost(A) Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

90 Handling Attributes With Different Costs Problem: In some learning tasks the instance attributes may have associated costs. Solutions 1 ExtendedID3 2 TanandSchlimmer Gain(S, A) Cost(A) Gain 2 (S, A) Cost(A) Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

91 Handling Attributes With Different Costs Problem: In some learning tasks the instance attributes may have associated costs. Solutions 1 ExtendedID3 2 TanandSchlimmer 3 Nunez Gain(S, A) Cost(A) Gain 2 (S, A) Cost(A) where w [0, 1] is a constant. 2 Gain(S,A) 1 (Cost(A) + 1) w Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

92 t Regression that appropriate Tree impurity measure for regression is subset of X reaching node m. on tree, the goodness of a split is measured by the m the estimated value. In regression tree, the goodness of a split is measured by the mean square error from the estimated value. A 1 True F 1 (A-A 1 )] False F[11+, 2 (A-A2-] 1 )] an, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classifica ion Trees, Belmont, CA; Wadsworth International Group, (1 ba, F. Esposito, M. Ceci, and A. Appice, Top-Down Induction ression and Splitting Nodes, IEEE Trans. on Pattern Analysis Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

93 t Regression that appropriate Tree impurity measure for regression is subset of X reaching node m. on tree, the goodness of a split is measured by the m the estimated value. In regression tree, the goodness of a split is measured by the mean square error from the estimated value. A 1 True F 1 (A-A 1 )] False F[11+, 2 (A-A2-] 1 )] References an, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classifica ion Trees, Belmont, CA; Wadsworth International Group, (1 ba, F. Esposito, M. Ceci, and A. Appice, Top-Down Induction ression and Splitting Nodes, IEEE Trans. on Pattern Analysis Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

94 t Regression that appropriate Tree impurity measure for regression is subset of X reaching node m. on tree, the goodness of a split is measured by the m the estimated value. In regression tree, the goodness of a split is measured by the mean square error from the estimated value. A 1 True F 1 (A-A 1 )] False F[11+, 2 (A-A2-] 1 )] References 1 L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classification and Regression Trees, Belmont, CA; Wadsworth International Group, (1984). an, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classifica ion Trees, Belmont, CA; Wadsworth International Group, (1 ba, F. Esposito, M. Ceci, and A. Appice, Top-Down Induction ression and Splitting Nodes, IEEE Trans. on Pattern Analysis Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

95 t Regression that appropriate Tree impurity measure for regression is subset of X reaching node m. on tree, the goodness of a split is measured by the m the estimated value. In regression tree, the goodness of a split is measured by the mean square error from the estimated value. A 1 True F 1 (A-A 1 )] False F[11+, 2 (A-A2-] 1 )] References 1 L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classification and Regression Trees, Belmont, CA; Wadsworth International Group, (1984). an, J. 2 H. D. Friedman, Malerba, F. Esposito, R. M. A. Ceci, Olshen, and A. Appice, and Top-Down C. J. Stone, Induction of Classifica Model Trees ion Trees, with Regression Belmont, and Splitting CA; Wadsworth Nodes, IEEE Trans. International Pattern Analysis and Group, Machine (1 Intelligence, Vol. 25, No. 5, pp , May ba, F. Esposito, M. Ceci, and A. Appice, Top-Down Induction ression and Splitting Nodes, IEEE Trans. on Pattern Analysis Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

96 Other types of decision trees Univariate trees In univariate trees, the test at each internal node just uses only one of input attributes. Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

97 Other types of decision trees Univariate trees In univariate trees, the test at each internal node just uses only one of input attributes. Multivariate trees In multivariate trees, the test at each internal node can use all input attributes. Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

98 Other types of decision trees Univariate trees In univariate trees, the test at each internal node just uses only one of input attributes. Multivariate trees In multivariate trees, the test at each internal node can use all input attributes. Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

99 Other types of decision trees Decision Trees 26! Univariate Trees Univariate " In univariate trees trees, the test at each internal node just uses only one of input In attributes. univariate trees, the test at each internal node just uses only one of input! Multivariate attributes. Trees " In multivariate trees, the test at each internal node can use all input attributes. Multivariate trees " For example: Consider a data set with numerical attributes. In multivariate trees, the test at each internal node can use all input attributes. # The test can be made using the weighted linear combination of some input attributes. " For example X=(x 1,x 2 ) be the input attributes. Let f (X)=w 0 +w 1 x 1 +w 2 x 2 can be used for test at an internal node. Such as f (x) > 0. (w 0 +w 1 x 1 +w 2 x 2 )>0 True False YES [11+, NO2-]! Reading " C. E. Brodley and P. E. Utgoff, Multivariate Decision Trees, Machine Learning, Vol. 19, pp , Machine Learning Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

100 Other types of decision trees Decision Trees ariate Trees univariate trees, the test at each internal node just uses only one of input 26 ttributes.! Univariate Trees ivariate Univariate " Trees In univariate trees trees, the test at each internal node just uses only one of input multivariate In attributes. univariate trees, the trees, test the at test each atinternal each internal node can nodeuse justall uses input onlyattributes. one of input or! example: Multivariate attributes. Consider Treesa data set with numerical attributes. # The " test In can multivariate be made trees, using the the test weighted at each linear internal combination node can use of all some input input attributes. attributes. Multivariate trees or example " For X=(x example: Consider a data set with numerical attributes. In multivariate 1,x 2 ) be the input attributes. Let f (X)=w # The test can trees, be made the using test the at weighted each internal linear combination node 0 +w can 1 x of some use 1 +w 2 x all 2 can be used for input input attributes. st at an internal node. Such as f (x) > 0. attributes. " For example X=(x 1,x 2 ) be the input attributes. Let f (X)=w 0 +w 1 x 1 +w 2 x 2 can be used for test at an internal node. Such as f (x) > 0. (w 0 +w 1 x 1 +w 2 x 2 )>0 (w 0 +w 1 x 1 +w 2 x 2 )>0 Decision Trees True True False False YES YES [11+, NO2-] [11+, NO2-] ing! Reading. E. Brodley " C. E. and Brodley P. E. and Utgoff, P. E. Utgoff, Multivariate Multivariate Decision Trees, Machine Learning, Vol. Vol. 19, 19, p , pp , Machine Learning Machine Learning Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

101 Other types of decision trees Decision Trees ariate Trees univariate trees, the test at each internal node just uses only one of input 26 ttributes.! Univariate Trees ivariate Univariate " Trees In univariate trees trees, the test at each internal node just uses only one of input multivariate In attributes. univariate trees, the trees, test the at test each atinternal each internal node can nodeuse justall uses input onlyattributes. one of input or! example: Multivariate attributes. Consider Treesa data set with numerical attributes. # The " test In can multivariate be made trees, using the the test weighted at each linear internal combination node can use of all some input input attributes. attributes. Multivariate trees or example " For X=(x example: Consider a data set with numerical attributes. In multivariate 1,x 2 ) be the input attributes. Let f (X)=w # The test can trees, be made the using test the at weighted each internal linear combination node 0 +w can 1 x of some use 1 +w 2 x all 2 can be used for input input attributes. st at an internal node. Such as f (x) > 0. attributes. " For example X=(x 1,x 2 ) be the input attributes. Let f (X)=w 0 +w 1 x 1 +w 2 x 2 can be used for test at an internal node. Such as f (x) > 0. (w 0 +w 1 x 1 +w 2 x 2 )>0 (w 0 +w 1 x 1 +w 2 x 2 )>0 Decision Trees True True False False YES YES [11+, NO2-] [11+, NO2-] ing! Reading. E. Brodley " C. E. Brodley and P. E. Utgoff, Multivariate Vol. 19, References and P. E. Utgoff, Multivariate Decision Trees, Machine Learning, Vol. 19, p , pp , Machine Learning Machine Learning Hamid Beigy (Sharif University of Technology) Decision Tree Fall / 24

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition Todd Holloway Two Lecture Series for B551 November 20 & 27, 2007 Indiana University Outline Introduction Bias and