Decision Tree. Machine Learning. Hamid Beigy. Sharif University of Technology. Fall PDF Free Download

Decision Tree Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 1 / 24

Table of contents 1 Introduction 2 Decision tree classification 3 Building decision trees 4 ID3 Algorithm Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 2 / 24

Introduction 1 The decision tree is a classic and natural model of learning. Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 3 / 24

Introduction 1 The decision tree is a classic and natural model of learning. 2 It is closely related to the notion of divide and conquer. Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 3 / 24

Introduction 1 The decision tree is a classic and natural model of learning. 2 It is closely related to the notion of divide and conquer. 3 A decision tree partitions the instance space into axis-parallel regions, labeled with class value 4 Why decsion trees? Interpretable, popular in medical applications because they mimic the way a doctor thinks Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 3 / 24

Introduction 1 The decision tree is a classic and natural model of learning. 2 It is closely related to the notion of divide and conquer. 3 A decision tree partitions the instance space into axis-parallel regions, labeled with class value 4 Why decsion trees? Interpretable, popular in medical applications because they mimic the way a doctor thinks Can model discrete outcomes nicely Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 3 / 24

Introduction 1 The decision tree is a classic and natural model of learning. 2 It is closely related to the notion of divide and conquer. 3 A decision tree partitions the instance space into axis-parallel regions, labeled with class value 4 Why decsion trees? Interpretable, popular in medical applications because they mimic the way a doctor thinks Can model discrete outcomes nicely Can be very powerful, can be as complex as you need them Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 3 / 24

Introduction 1 The decision tree is a classic and natural model of learning. 2 It is closely related to the notion of divide and conquer. 3 A decision tree partitions the instance space into axis-parallel regions, labeled with class value 4 Why decsion trees? Interpretable, popular in medical applications because they mimic the way a doctor thinks Can model discrete outcomes nicely Can be very powerful, can be as complex as you need them C4.5 and CART decision trees are very popular. Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 3 / 24

Decision tree classification 1 Structure of decsion trees Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 4 / 24

Decision tree classification 1 Structure of decsion trees Each internal node tests an attribute Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 4 / 24

Decision tree classification 1 Structure of decsion trees Each internal node tests an attribute Each branch corresponds to attribute value Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 4 / 24

Decision tree classification 1 Structure of decsion trees Each internal node tests an attribute Each branch corresponds to attribute value Each leaf node assigns a classification. Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 4 / 24

can Decision not be tree trained classification incrementally., ID5, ID5R are samples of incremental induction of decision trees 1 ding Structure of decsion trees Each internal node tests an attribute P.E. Utgoff, Each Incremental branch corresponds Induction to attribute of Decision value Trees, Machine Learning, Vol. 4, p 186,1989. Each leaf node assigns a classification. 2 Decision Tree for PlayTennis [9+,5-] Outlook? Sunny Overcast Rain [2+,3-] Humidity? Yes Wind? [4+,0-] High Normal Strong Light [3+,2-] No Yes [0+,3-] [2+,0-] No Yes [0+,2-] [3+,0-] Machine Learning Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 4 / 24

e idea of binary classification trees is not unlike that of the histogram - partition the featur Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 5 / 24 Decision surface 0 0 0 0 0 0 0 0 0 0 Histogram Classifier 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Linear Classifier 0 1 Tree Classifier 0 0 1 0 0 1 0 0 0 0 1 1 1 1 1 1 1 1 (a) (b) (c) Figure 9.2: (a) Histogram classifier ; (b) Linear classifier; (c)tree classifier. 3 Binary Classification Trees

poor generalization characteristics, and then prune this tree, to avoid overfitting. Building decision trees 9.3.1 Growing Trees The growing process is based on recursively subdividing the feature space. Usually the subdivisions are splits of existing regions into two smaller regions (i.e., binary splits). For simplicity, the Until now we have been referring to trees, but did not made clear how do trees relate to splits are perpendicular to one of the feature axis. An example of such construction is depicted partitions. It turns out that any decision tree can be associated with a partition of the input in Figure 9.3. space X and vice-versa. In particular, a Recursive Dyadic Partition (RDP) can be associated with a (binary) tree. In fact, this is the most e cient way of describing a RDP. In Figure 9.4 we illustrate the procedure. Each leaf of the tree corresponds to a cell of the partition. The nodes in the tree correspond to the various partition cells that are generated in the construction and so on... of the tree. The orientation of the dyadic split alternates between the levels of the tree (for the example of Figure 9.4, at the root level the split is done in the horizontal axis, at the level below that (the level of nodes 2 and 3) the split is done in the vertical axis, and so on...). The tree is called dyadic because the splits of cells are always at the midpoint along one coordinate axis, and consequently the Figure sidelengths 9.3: Growing of all cells a recursive are dyadic binary (i.e., tree powers (X of =[0, 2). 1] 2 ). 1 Decsion trees recursively subdivide the feature space. 2 The test variable specifies the division Often the splitting process is based on the training data, and is designed to separate 1 data 1 with di erent labels as much as possible. In such constructions, 1 the splits, 4 and hence the treestructure itself, are data dependent. This causes major di culties for the analysis (and tunning) 1 2 3 2 2 3 2 3 5 of these methods. Alternatively, the splitting and subdivision could be taken independent from 4 5 the training data. The latter approach is the one we are going to investigate in detail, since it is more amenable to analysis, and we will consider Dyadic Decision Trees and Recursive Dyadic Partitions (depicted in Figure 9.4) in particular. 6 7 3 1 1 6 4 7 84 2 3 2 3 6 7 6 7 5 4 5 2 8 5 9 2 4 1 3 5 8 9 Figure 9.4: Example of Recursive Dyadic Partition (RDP) growing (X =[0, 1] 2 ). In the following we are going to consider the 2-dimensional case, but all the results can be Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 6 / 24

Building decision trees (example) 10! Training Examples for Concept PlayTennis Training examples for PlayTennis An Illustrative Example Day Outlook Temperature Humidity Wind PlayTennis? 1 Sunny Hot High Light No 2 Sunny Hot High Strong No 3 Overcast Hot High Light Yes 4 Rain Mild High Light Yes 5 Rain Cool Normal Light Yes 6 Rain Cool Normal Strong No 7 Overcast Cool Normal Strong Yes 8 Sunny Mild High Light No 9 Sunny Cool Normal Light Yes 10 Rain Mild Normal Light Yes 11 Sunny Mild Normal Strong Yes 12 Overcast Mild High Strong Yes 13 Overcast Hot Normal Light Yes 14 Rain Mild High Strong No! ID3 Build-DT using Gain( )! How Will ID3 Construct A Decision Tree? Machine Learning Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 7 / 24

Building decision trees (cont.) How to build a decision tree? Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 8 / 24

Building decision trees (cont.) How to build a decision tree? 1 Start at the top of the tree. Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 8 / 24

Building decision trees (cont.) How to build a decision tree? 1 Start at the top of the tree. 2 Grow it by splitting attributes one by one. Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 8 / 24

Building decision trees (cont.) How to build a decision tree? 1 Start at the top of the tree. 2 Grow it by splitting attributes one by one. 3 Assign leaf nodes. Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 8 / 24

Building decision trees (cont.) How to build a decision tree? 1 Start at the top of the tree. 2 Grow it by splitting attributes one by one. 3 Assign leaf nodes. 4 When we get to the bottom, prune the tree to prevent overfitting. How choose a test variable for an internal node? Choosing different measures result in different algorithms. We describe ID3. 0.5 0.4 Gini index Entropy 0.3 0.2 Misclassification error 0.1 0.0 0.0 0.2 0.4 0.6 0.8 1.0 p Node impurity measures for two-class classification, as a function of the Hamid Beigy (Sharif University proportion of Technology) p in class 2. Cross-entropy Decisionhas Tree been scaled to pass through Fall 1396 8 / 24

Building decision trees (cont.) ID3 uses information gain to choose a test variable for an internal node. Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 9 / 24

Building decision trees (cont.) ID3 uses information gain to choose a test variable for an internal node. The information gain of D relative to attribute A is the expected reduction in entropy due to splitting on A. Gain(D, A) = H(D) [ ] Dv D H(D v ) v values(a) where D v is {x D : x.a = v}, the set of examples in D where attribute A has value v Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 9 / 24

ID3 Algorihm 10! Training Examples for Concept PlayTennis An Illustrative Example Day Outlook Temperature Humidity Wind PlayTennis? 1 Sunny Hot High Light No 2 Sunny Hot High Strong No 3 Overcast Hot High Light Yes 4 Rain Mild High Light Yes 5 Rain Cool Normal Light Yes 6 Rain Cool Normal Strong No 7 Overcast Cool Normal Strong Yes 8 Sunny Mild High Light No 9 Sunny Cool Normal Light Yes 10 Rain Mild Normal Light Yes 11 Sunny Mild Normal Strong Yes 12 Overcast Mild High Strong Yes 13 Overcast Hot Normal Light Yes 14 Rain Mild High Strong No! ID3 Build-DT using Gain( )! How Will ID3 Construct A Decision Tree? H(D) = (9/14) log(9/14) (5/14) log(5/14) = 0.94bits H(D, Humidity = High) = (3/7) log(3/7) (4/7) log(4/7) = 0.985bits H(D, Humidity = Normal) = (6/7) log(6/7) (1/7) log(1/7) = 0.592bits Machine Learning Gain(D, Humidity) = 0.94 (7/14) 0.985 (7/14) 0.592 = 0.151bits Gain(D, Wind) = 0.94 (8/14) 0.811 + (6/14) 1.0 = 0.048bits Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 10 / 24

ID3 Algorihm An Illustrative Example onstructing A Decision Tree 10! Training Examples for Concept PlayTennis for PlayTennis using ID3 [ 2 ] Day Outlook Temperature Humidity Wind PlayTennis? 1 Sunny Hot High Light No 2 Sunny Hot High Strong No 3 Overcast Hot High Light Yes Temperature Humidity Wind PlayTennis? 4 Rain Mild High Light Yes Hot High Light No 5 Rain Cool Normal Light Yes Hot High Strong No t Hot 6 High Rain Cool Yes Normal Strong No Mild 7 HighOvercast LightCool Yes Normal Strong Yes Cool 8 Normal Sunny Light Mild Yes High Light No Cool 9 Normal Sunny Strong Cool No Normal Light Yes t Cool 10 Normal Rain Strong Mild Yes Normal Light Yes Mild 11 HighSunny LightMild No Normal Strong Yes Cool 12 Normal Overcast LightMild Yes High Strong Yes Mild 13 Normal Overcast LightHot Yes Normal Light Yes Mild 14 Normal Rain Strong Mild Yes High Strong No t Mild High Strong Yes t Hot Normal Light Yes Mild! ID3 Build-DT High using Gain( ) Strong No! How Will ID3 Construct A Decision Tree? [9+, 5-] Gain(D, Humidity) = 0.151bits Outlook.151 bits Attribute bits Gain(D, Wind) = 0.048bits ) = Gain(D, 0.029 bits Temperature) = 0.029bits 246 bitsgain(d, Outlook) = 0.246bits ttribute (Root of Subtree) Machine Learning Sunny [2+, 3-] Overcast [4+, 0-] Rain [3+, 2-] Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 11 / 24

ID3 Algorihm An Illustrative Example 10! Training Examples for Concept PlayTennis Day Outlook Temperature Humidity Wind PlayTennis? 1 Sunny Hot High Light No 2 Sunny Hot High Strong No 3 Overcast Hot High Light Yes 4 Rain Mild High Light Yes 5 Rain Cool Normal Light Yes 6 Rain Cool Normal Strong No 7 Overcast Cool Normal Strong Yes 8 Sunny Mild High Light No 9 Sunny Cool Normal Light Yes 10 Rain Mild Normal Incremental Light Learning of Decision Yes Trees 11 Sunny Mild Normal Strong Yes 27 12 Overcast Mild High Strong Yes! ID3 can not be trained incrementally. 13 Overcast Hot Normal Light Yes 14 Rain Mild High Strong No! ID3 Build-DT using Gain( )! Gain(D How Will Sunny ID3, Construct Humidity) A Decision = Tree? 0.97bits Gain(D Sunny, Wind) = 0.02bits Gain(D Sunny, Temperature) = 0.57bits Machine Learning! ID4, ID5, ID5R are samples of incremental induction of decision trees! Reading " P.E. Utgoff, Incremental Induction of Decision Trees, Machine Learning, Vol. 4, pp. 161-186,1989. [2+,3-] No [9+,5-] High Humidity? Yes Outlook? Sunny Overcast Rain Normal [0+,3-] [2+,0-] Yes [4+,0-] Strong No Wind? Light Yes [0+,2-] [3+,0-] [3+,2-] Machine Learning Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 12 / 24

Inductive Bias in ID3 Types of Biases Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 13 / 24

Inductive Bias in ID3 Types of Biases 1 Preference (search) bias Put priority on choosing hypothesis. Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 13 / 24

Inductive Bias in ID3 Types of Biases 1 Preference (search) bias Put priority on choosing hypothesis. 2 Language bias Put restriction on the set of hypotheses considered Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 13 / 24

Inductive Bias in ID3 Types of Biases 1 Preference (search) bias Put priority on choosing hypothesis. 2 Language bias Put restriction on the set of hypotheses considered Which Bias is better? 1 Preference bias is more desirable. 2 Because, the learner works within a complete space that is assured to contain the unknown concept. Inductive Bias of ID3 1 Shorter trees are preferred over longer trees. Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 13 / 24

Inductive Bias in ID3 Types of Biases 1 Preference (search) bias Put priority on choosing hypothesis. 2 Language bias Put restriction on the set of hypotheses considered Which Bias is better? 1 Preference bias is more desirable. 2 Because, the learner works within a complete space that is assured to contain the unknown concept. Inductive Bias of ID3 1 Shorter trees are preferred over longer trees. 2 Occams razor : Prefer the simplest hypothesis that fits the data. Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 13 / 24

Inductive Bias in ID3 Types of Biases 1 Preference (search) bias Put priority on choosing hypothesis. 2 Language bias Put restriction on the set of hypotheses considered Which Bias is better? 1 Preference bias is more desirable. 2 Because, the learner works within a complete space that is assured to contain the unknown concept. Inductive Bias of ID3 1 Shorter trees are preferred over longer trees. 2 Occams razor : Prefer the simplest hypothesis that fits the data. 3 Trees that place high information gain attributes close to the root are preferred over those that do not. Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 13 / 24

Overfitting in ID3 How can we avoid over-fitting? Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 14 / 24

Overfitting in ID3 How can we avoid over-fitting? 1 Prevention Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 14 / 24

Overfitting in ID3 How can we avoid over-fitting? 1 Prevention Stop training (growing) before it reaches the point that overfits. Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 14 / 24

Overfitting in ID3 How can we avoid over-fitting? 1 Prevention Stop training (growing) before it reaches the point that overfits. Select attributes that are relevant (i.e., will be useful in the decision tree) Requires some predictive measure of relevance 2 Avoidance Allow to over-fit, then improve the generalization capability of the tree. Holding out a validation set (test set) 3 Detection and Recovery Letting the problem happen, detecting when it does, recovering afterward Build model, remove (prune) elements that contribute to overfitting How to select Best tree? 1 Training and validation set Use a separate set of examples (distinct from the training set) for test. Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 14 / 24

Overfitting in ID3 How can we avoid over-fitting? 1 Prevention Stop training (growing) before it reaches the point that overfits. Select attributes that are relevant (i.e., will be useful in the decision tree) Requires some predictive measure of relevance 2 Avoidance Allow to over-fit, then improve the generalization capability of the tree. Holding out a validation set (test set) 3 Detection and Recovery Letting the problem happen, detecting when it does, recovering afterward Build model, remove (prune) elements that contribute to overfitting How to select Best tree? 1 Training and validation set Use a separate set of examples (distinct from the training set) for test. 2 Statistical test Use all data for training, but apply the statistical test to estimate the over-fitting. Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 14 / 24

Overfitting in ID3 How can we avoid over-fitting? 1 Prevention Stop training (growing) before it reaches the point that overfits. Select attributes that are relevant (i.e., will be useful in the decision tree) Requires some predictive measure of relevance 2 Avoidance Allow to over-fit, then improve the generalization capability of the tree. Holding out a validation set (test set) 3 Detection and Recovery Letting the problem happen, detecting when it does, recovering afterward Build model, remove (prune) elements that contribute to overfitting How to select Best tree? 1 Training and validation set Use a separate set of examples (distinct from the training set) for test. 2 Statistical test Use all data for training, but apply the statistical test to estimate the over-fitting. 3 Define the measure of complexity Halting the grow when this measure is minimized. Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 14 / 24

Machine Lea Pruning algorithms Reduced-Error Pruning. Acc 0.65 0.6 0.55 0.5 0 10 20 30 Size error true error training error Number of nodes in tree Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 15 / 24

Pruning algorithms Reduced-Error Pruning. oss-validation Approach aining and Validation Sets, node) btree rooted at node Reduced-Error Pruning error af (with majority label of associated examples) ed-error-pruning (D) D train (training / growing ), D validation (validation / pruning ) tree T using ID3 on D train y on D validation decreases DO n-leaf node candidate in T p[candidate] Prune (T, candidate) uracy[candidate] Test (Temp[candidate], D validation ) p with best value of Accuracy (best increase; greedy) ed) T The effect of reduced error pruning on error Accuracy Acc 0.65 0.6 0.55 0.5 0 10 20 30 true error training error Number of nodes in tree Size Machine Lea 0.9 0.85 0.8 0.75 On training data 0.7 0.65 On test data 0.6 Post-pruned tree 0.55 Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 15 / 24

Pruning algorithms Prunning algorithms Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 16 / 24

Pruning algorithms Prunning algorithms 1 Reduced Error Pruning Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 16 / 24

Pruning algorithms Prunning algorithms 1 Reduced Error Pruning 2 Pessimistic Error Pruning Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 16 / 24

Pruning algorithms Prunning algorithms 1 Reduced Error Pruning 2 Pessimistic Error Pruning 3 Minimum Error Pruning Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 16 / 24

Pruning algorithms Prunning algorithms 1 Reduced Error Pruning 2 Pessimistic Error Pruning 3 Minimum Error Pruning 4 Critical Value Pruning Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 16 / 24

Pruning algorithms Prunning algorithms 1 Reduced Error Pruning 2 Pessimistic Error Pruning 3 Minimum Error Pruning 4 Critical Value Pruning 5 Cost-Complexity Pruning Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 16 / 24

Pruning algorithms Prunning algorithms 1 Reduced Error Pruning 2 Pessimistic Error Pruning 3 Minimum Error Pruning 4 Critical Value Pruning 5 Cost-Complexity Pruning Reading 1 F. Esposito, D. Malerba, and G. Semeraro, A Comparative Analysis of Methods for Pruning Decision Trees, IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. 19, No. 5, pp. 476-491, May 1997. Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 16 / 24

Pruning algorithms Prunning algorithms 1 Reduced Error Pruning 2 Pessimistic Error Pruning 3 Minimum Error Pruning 4 Critical Value Pruning 5 Cost-Complexity Pruning Reading 1 F. Esposito, D. Malerba, and G. Semeraro, A Comparative Analysis of Methods for Pruning Decision Trees, IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. 19, No. 5, pp. 476-491, May 1997. 2 S. R. Safavian and D. Landgrebe, A Survey of Decision Tree Classifier Methodology, IEEE Trans on Systems, Man, and Cybernetics, Vol. 21, No. 3, pp. 660-674, 1991. Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 16 / 24

Continuous Valued Attributes Two methods for handling continuous attributes Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 17 / 24

Continuous Valued Attributes Two methods for handling continuous attributes 1 Discretization (e.g., histogramming) Break real-valued attributes into ranges in advance Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 17 / 24

Continuous Valued Attributes Example Two methods for handling continuous attributes 1 Discretization (e.g., histogramming) Break real-valued attributes into ranges in advance high = {Temp > 35C} med = {10C < Temp 35C} low = {Temp 10C} 2 Using thresholds for splitting nodes Example A a produces subsets A a and A > a. 3 Information gain is calculated the same way as for discrete splits Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 17 / 24

Continuous Valued Attributes Example Two methods for handling continuous attributes 1 Discretization (e.g., histogramming) Break real-valued attributes into ranges in advance Example high = {Temp > 35C} med = {10C < Temp 35C} low = {Temp 10C} 2 Using thresholds for splitting nodes A a produces subsets A a and A > a. 3 Information gain is calculated the same way as for discrete splits How to find the split with highest Gain Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 17 / 24

Continuous Valued Attributes Example Two methods for handling continuous attributes 1 Discretization (e.g., histogramming) Break real-valued attributes into ranges in advance Example high = {Temp > 35C} med = {10C < Temp 35C} low = {Temp 10C} 2 Using thresholds for splitting nodes A a produces subsets A a and A > a. 3 Information gain is calculated the same way as for discrete splits How to find the split with highest Gain Example length 10 15 21 28 32 40 50 label - + + - + + - Thresholds 12.5 24.5 30 45 Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 17 / 24

Missing Data Problem: What If Some Examples Missing Values of A? Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 18 / 24

Missing Data Training: evaluate Gain (D, A) where for some x D, a value Testing: classify a new example x without knowing the value! Solutions: Incorporating a Guess into Calculation of G Problem: What If Some Examples Missing Values of A? Consider dataset. Day Outlook Temperature Humidity Wind PlayTennis? 1 Sunny Hot High Light No 2 Sunny Hot High Strong No 3 Overcast Hot High Light Yes 4 Rain Mild High Light Yes 5 Rain Cool Normal Light Yes 6 Rain Cool Normal Strong No 7 Overcast Cool Normal Strong Yes 8 Sunny Mild??? Light No 9 Sunny Cool Normal Light Yes 10 Rain Mild Normal Light Yes 11 Sunny Mild Normal Strong Yes 12 Overcast Mild High Strong Yes 13 Overcast Hot Normal Light Yes 14 Rain Mild High Strong No [2 Machine Learning Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 18 / 24

known Attribute Values Missing Data Training: evaluate Gain (D, A) where for some x D, a value Testing: classify a new example x without knowing the value s Missing! Solutions: Values of Incorporating A? a Guess into Calculation of G Problem: What If Some Examples Missing Values of A? butes Consider during dataset. training or testing Day Outlook Temperature Humidity Wind PlayTennis? 1 Sunny Hot High Light No 2 Sunny Hot High Strong No 3 Overcast Hot High Light Yes 4 Rain Mild High Light Yes 5 Rain Cool Normal Light Yes 6 Rain Cool Normal Strong No 7 Overcast Cool Normal Strong Yes 8 Sunny Mild??? Light No 9 Sunny Cool Normal Light Yes 10 Rain Mild Normal Light Yes 11 Sunny Mild Normal Strong Yes 12 Overcast Mild High Strong Yes 13 Overcast Hot Normal Light Yes 14 Rain Mild High Strong No ormal,, Blood-Test =?, >, sometimes low priority (or cost too high) ssification ere for some x D, a value for A is not given without knowing the value of A into Calculation of Gain(D, A) Wind Light Strong Light Light Light Strong Strong Light Light PlayTennis? No No Yes Yes Yes No Yes No Yes What is the decision tree? [2+, 3-] [9+, 5-] Outlook Sunny Overcast [4+, 0-] Machine Learning Rain [3+, 2-] Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 18 / 24 [2

Attributes with Many Values Problem: If attribute has many values such as Date, Gain() will select it (why?) Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 19 / 24

Attributes with Many Values Problem: If attribute has many values such as Date, Gain() will select it (why?) One Approach: Use GainRatio instead of Gain Gain(D, A) = H(D) GainRatio(D, A) = SplitInformation(D, A) = v values(a) [ ] Dv D H(D v ) Gain(D, A) SplitInformation(D, A) [ Dv D log D ] v D v values(a) SplitInformation: directly proportional to values(a), i.e., penalizes attributes with more values. Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 19 / 24

Attributes with Many Values Problem: If attribute has many values such as Date, Gain() will select it (why?) One Approach: Use GainRatio instead of Gain Gain(D, A) = H(D) GainRatio(D, A) = SplitInformation(D, A) = v values(a) [ ] Dv D H(D v ) Gain(D, A) SplitInformation(D, A) [ Dv D log D ] v D v values(a) SplitInformation: directly proportional to values(a), i.e., penalizes attributes with more values. What is its inductive bias? Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 19 / 24

Attributes with Many Values Problem: If attribute has many values such as Date, Gain() will select it (why?) One Approach: Use GainRatio instead of Gain Gain(D, A) = H(D) GainRatio(D, A) = SplitInformation(D, A) = v values(a) [ ] Dv D H(D v ) Gain(D, A) SplitInformation(D, A) [ Dv D log D ] v D v values(a) SplitInformation: directly proportional to values(a), i.e., penalizes attributes with more values. What is its inductive bias? Preference bias (for lower branch factor) expressed via GainRatio(.) Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 19 / 24

Attributes with Many Values Problem: If attribute has many values such as Date, Gain() will select it (why?) One Approach: Use GainRatio instead of Gain Gain(D, A) = H(D) GainRatio(D, A) = SplitInformation(D, A) = v values(a) [ ] Dv D H(D v ) Gain(D, A) SplitInformation(D, A) [ Dv D log D ] v D v values(a) SplitInformation: directly proportional to values(a), i.e., penalizes attributes with more values. What is its inductive bias? Preference bias (for lower branch factor) expressed via GainRatio(.) Alternative attribute selection : Gini Index Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 19 / 24

Handling Attributes With Different Costs Problem: In some learning tasks the instance attributes may have associated costs. Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 20 / 24

Handling Attributes With Different Costs Problem: In some learning tasks the instance attributes may have associated costs. Solutions Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 20 / 24

Handling Attributes With Different Costs Problem: In some learning tasks the instance attributes may have associated costs. Solutions 1 ExtendedID3 Gain(S, A) Cost(A) Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 20 / 24

Handling Attributes With Different Costs Problem: In some learning tasks the instance attributes may have associated costs. Solutions 1 ExtendedID3 2 TanandSchlimmer Gain(S, A) Cost(A) Gain 2 (S, A) Cost(A) Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 20 / 24

Handling Attributes With Different Costs Problem: In some learning tasks the instance attributes may have associated costs. Solutions 1 ExtendedID3 2 TanandSchlimmer 3 Nunez Gain(S, A) Cost(A) Gain 2 (S, A) Cost(A) where w [0, 1] is a constant. 2 Gain(S,A) 1 (Cost(A) + 1) w Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 20 / 24

t Regression that appropriate Tree impurity measure for regression is subset of X reaching node m. on tree, the goodness of a split is measured by the m the estimated value. In regression tree, the goodness of a split is measured by the mean square error from the estimated value. A 1 True F 1 (A-A 1 )] False F[11+, 2 (A-A2-] 1 )] References 1 L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classification and Regression Trees, Belmont, CA; Wadsworth International Group, (1984). an, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classifica ion Trees, Belmont, CA; Wadsworth International Group, (1 ba, F. Esposito, M. Ceci, and A. Appice, Top-Down Induction ression and Splitting Nodes, IEEE Trans. on Pattern Analysis Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 21 / 24

t Regression that appropriate Tree impurity measure for regression is subset of X reaching node m. on tree, the goodness of a split is measured by the m the estimated value. In regression tree, the goodness of a split is measured by the mean square error from the estimated value. A 1 True F 1 (A-A 1 )] False F[11+, 2 (A-A2-] 1 )] References 1 L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classification and Regression Trees, Belmont, CA; Wadsworth International Group, (1984). an, J. 2 H. D. Friedman, Malerba, F. Esposito, R. M. A. Ceci, Olshen, and A. Appice, and Top-Down C. J. Stone, Induction of Classifica Model Trees ion Trees, with Regression Belmont, and Splitting CA; Wadsworth Nodes, IEEE Trans. International Pattern Analysis and Group, Machine (1 Intelligence, Vol. 25, No. 5, pp. 612-625, May 2004. ba, F. Esposito, M. Ceci, and A. Appice, Top-Down Induction ression and Splitting Nodes, IEEE Trans. on Pattern Analysis Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 21 / 24

Other types of decision trees Univariate trees In univariate trees, the test at each internal node just uses only one of input attributes. Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 22 / 24

Other types of decision trees Univariate trees In univariate trees, the test at each internal node just uses only one of input attributes. Multivariate trees In multivariate trees, the test at each internal node can use all input attributes. Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 22 / 24

Other types of decision trees Decision Trees 26! Univariate Trees Univariate " In univariate trees trees, the test at each internal node just uses only one of input In attributes. univariate trees, the test at each internal node just uses only one of input! Multivariate attributes. Trees " In multivariate trees, the test at each internal node can use all input attributes. Multivariate trees " For example: Consider a data set with numerical attributes. In multivariate trees, the test at each internal node can use all input attributes. # The test can be made using the weighted linear combination of some input attributes. " For example X=(x 1,x 2 ) be the input attributes. Let f (X)=w 0 +w 1 x 1 +w 2 x 2 can be used for test at an internal node. Such as f (x) > 0. (w 0 +w 1 x 1 +w 2 x 2 )>0 True False YES [11+, NO2-]! Reading " C. E. Brodley and P. E. Utgoff, Multivariate Decision Trees, Machine Learning, Vol. 19, pp. 45-77, 1995. Machine Learning Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 22 / 24

Other types of decision trees Decision Trees ariate Trees univariate trees, the test at each internal node just uses only one of input 26 ttributes.! Univariate Trees ivariate Univariate " Trees In univariate trees trees, the test at each internal node just uses only one of input multivariate In attributes. univariate trees, the trees, test the at test each atinternal each internal node can nodeuse justall uses input onlyattributes. one of input or! example: Multivariate attributes. Consider Treesa data set with numerical attributes. # The " test In can multivariate be made trees, using the the test weighted at each linear internal combination node can use of all some input input attributes. attributes. Multivariate trees or example " For X=(x example: Consider a data set with numerical attributes. In multivariate 1,x 2 ) be the input attributes. Let f (X)=w # The test can trees, be made the using test the at weighted each internal linear combination node 0 +w can 1 x of some use 1 +w 2 x all 2 can be used for input input attributes. st at an internal node. Such as f (x) > 0. attributes. " For example X=(x 1,x 2 ) be the input attributes. Let f (X)=w 0 +w 1 x 1 +w 2 x 2 can be used for test at an internal node. Such as f (x) > 0. (w 0 +w 1 x 1 +w 2 x 2 )>0 (w 0 +w 1 x 1 +w 2 x 2 )>0 Decision Trees True True False False YES YES [11+, NO2-] [11+, NO2-] ing! Reading. E. Brodley " C. E. and Brodley P. E. and Utgoff, P. E. Utgoff, Multivariate Multivariate Decision Trees, Machine Learning, Vol. Vol. 19, 19, p. 45-77, pp. 1995. 45-77, 1995. Machine Learning Machine Learning Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 22 / 24

Other types of decision trees Decision Trees ariate Trees univariate trees, the test at each internal node just uses only one of input 26 ttributes.! Univariate Trees ivariate Univariate " Trees In univariate trees trees, the test at each internal node just uses only one of input multivariate In attributes. univariate trees, the trees, test the at test each atinternal each internal node can nodeuse justall uses input onlyattributes. one of input or! example: Multivariate attributes. Consider Treesa data set with numerical attributes. # The " test In can multivariate be made trees, using the the test weighted at each linear internal combination node can use of all some input input attributes. attributes. Multivariate trees or example " For X=(x example: Consider a data set with numerical attributes. In multivariate 1,x 2 ) be the input attributes. Let f (X)=w # The test can trees, be made the using test the at weighted each internal linear combination node 0 +w can 1 x of some use 1 +w 2 x all 2 can be used for input input attributes. st at an internal node. Such as f (x) > 0. attributes. " For example X=(x 1,x 2 ) be the input attributes. Let f (X)=w 0 +w 1 x 1 +w 2 x 2 can be used for test at an internal node. Such as f (x) > 0. (w 0 +w 1 x 1 +w 2 x 2 )>0 (w 0 +w 1 x 1 +w 2 x 2 )>0 Decision Trees True True False False YES YES [11+, NO2-] [11+, NO2-] ing! Reading. E. Brodley " C. E. Brodley and P. E. Utgoff, Multivariate Vol. 19, References and P. E. Utgoff, Multivariate Decision Trees, Machine Learning, Vol. 19, p. 45-77, pp. 1995. 45-77, 1995. Machine Learning Machine Learning Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 22 / 24

Decision Tree. Machine Learning. Hamid Beigy. Sharif University of Technology. Fall 1396