CMP 882 Machine Learning Decision ree Learning Lecture Scribe for week 7 ebruary 20th By: Mona Vajihollahi mvajihol@sfu.ca Overview: Introduction...2 Decision ree Hypothesis Space...3 Parity unction... 5 Decision Graph... 6 Algebraic Decision Diagram... 6 Inductive Bias...7 Occam s razor... 7 Avoiding Overfitting the Data...7 raining and Validation... 9 Reduced Error Pruning... 9 Rule PostPruning... 9 Other Issues in Decision rees Learning...10 kary Attribute Values:... 10 Real Attribute Values... 10 Attribute Cost... 10 Missing Values... 10 Other Splitting Criteria... 10 References...10
Introduction Decision ree Learning is a method for approximating discretevalued functions, using decision trees. Decision rees represent a disjunction of conjunctions of constraints on the attribute values of instances. his disjunction approximates the target function. Decision tree learning is robust to noisy data and is suitable for the target functions with discrete output values. Most of the algorithms for decision learning use a topdown greedy search in the space of possible decision trees to find the best tree which fits the training data. ID3, a basic algorithm in decision tree learning, construct the tree by answering the question which attribute is the best classifier? in each step. he answer is the attribute with highest Information Gain, which is the expected reduction in the entropy 1 caused by partitioning examples according to the attribute. igure 1 he entropy function of a Boolean classifier. he following example (Exercise 3.2, p 77, Machine Learning, om Mitchell) illustrates the use of entropy and information gain. Consider the following set of training examples: Instance Classification a1 a2 1 2 3 4 5 6 1 he concept of entropy plays an important role in information theory. Roughly speaking, entropy is a mathematical formulation of the uncertainty and/or the amount information in a data set. he (Shannon) entropy of a variable X is defined as:
igure 1, depicts the form of the entropy function relative to a Boolean classification, as the proportion, p, of positive examples varies between 0 and 1. So it is obvious that when the data set contains equal number of positive and negative examples (i.e. p = ½) the entropy function has the largest value, that is 1. In the above example there are 3 positive instances and 3 negative instances, so by the above discussion we can conclude that Entropy = 1. Alternatively, we can calculate the entropy using the following equation: p p Entropy( S) = p log 2 pθ log2 So: Entropy ( S) = 1 log 2 1 log 2 θ = 1 1 2 2 1/ 2 1/ 2 2 2 = Now, we can calculate the Information Gain of a2 relative to this training set, using the following equation: Gain( s, A) = Entropy( S) v Values( A) 1 Sv Entropy( S S So: S a2= S a2= Gain ( S, a ) = Entropy( S) Entropy( S a2= ) Entropy( Sa2= S S = 1 2/3 * 1 2/3 * 1 = 0 2 Decision ree Hypothesis Space he hypothesis space in decision tree learning is the set of all possible decision trees. Decision tree learning algorithms search trough this space to find a decision tree that correctly classifies the training data. In the resulted decision tree each Internal node perform a test on attribute x and branch according to the results. Leaf nodes specify class h(x). hus, the classification will be very easy, using the decision tree and splitting based on some tests. he space of all decision trees is a complete space of finite discretevalued function. Consequently, every Boolean function can be represented by decision trees and there are usually different decision trees for a single Boolean function. Let us see the use of decision trees in representing Boolean functions in some examples (Ex. 3.1, p 77, Machine Learning, om Mitchell). In the following examples we start with testing variable A. However, in actual decision tree building (for learning) this may not be a good idea and we have to choose the variable with the highest information gain for the root. v ) )
a )Α Β A B, are used instead of, to distinguish between attribute values and output results. ( Β ) b) Α C A C B In the second branch we must test all variables, which is not desirable. Entropy tries to address this kind of problems.
c) Α XOR Β A B B In XOR function Entropy does not say anything, so we have to test every variable. Also we can not find any nice dependencies between the variables. Another example of such a function is Parity function. Parity unction In parity function we can never conclude the result is 1 or 0 until we test all attributes and come to the end. When representing parity function with decision trees we have to split on every variable and keep building the tree all the way down. Learners usually like to decompose the problem and that is why these functions are difficult for Decision rees and Bayesian Classifiers to be learned. herefore, Parity function is usually used in testing different learners. ( A B) ( C ) d) D Unlike Parity and XOR we have a break here. (We don not need to test all variables) A B C D C he marked subtrees are identical. o address this problem, that is to avoid replication we could use a different representation, like a graph. We can have a more compact representation by using a Decision Graph. D
Decision Graph Decision Graphs, such as the one depicted in igure 2, are generalizations of decision trees, having decision nodes and leaves. he feature that distinguishes decision graphs from decision trees is that decision graphs may also contain Joins. A Join is represented by two nodes having a common child, and this specifies that two subsets have some common properties, and hence can be considered as one subset. he manner in which objects are categorized by decision graphs is the same as that of decision trees. Each decision tree and decision graph defines a categorization (i.e., some partitioning of the object space into disjoint categories). he set of decision functions representable by graphs is exactly the same as the set representable by trees. However, the set of categorizations which can enter into the definition of a decision function are different. or example, the categorizations for ( A B) ( C D) given in igure 2.a and igure 2.b are different since the decision tree partitions the object space into 7 categories, while the decision graph partitions the object space into 2 categories [1]. a) A Decision ree for ( A B) ( C D) b) A Decision Graph for ( A B) ( C D) igure 2 [1] Algebraic Decision Diagram Another variant of Decision rees are Algebraic Decision Diagrams, which are used in representing target functions. ADDs are a compact, efficiently manipulable data structure for representing realvalued functions over Boolean variables B n R. hey generalize a treestructured representation by allowing nodes to have multiple parents, leading to the recombination of isomorphic subgraphs and hence to a possible reduction in the representation size [2].
Inductive Bias It was mentioned before that the hypothesis space of ID3 is the power set of instances X. his means that there is no Restrictive Bias for decision tree learning. Restrictive Bias, also called Language Bias, limits the set of available hypothesis and we have no such bias for ID3. On the other hand, there is a Preference Bias, or Search Bias, for decision tree learning. he Preference Bias orders the available hypotheses and in ID3 there is a preference for: 1) Shorter rees. 2) rees with high information gain attributes near the root. his bias often goes by the name of Occam s razor 2. Occam s razor Occam s razor says: Prefer the simplest hypothesis that fits the data. In other words the inductive bias for ID3 says that if there is a shorter tree, then it would be taken. So the preference is for shorter trees but it is not always guaranteed. In fact, ID3 is a greedy algorithm so it can not always find the shortest tree. Another question that arises regarding Occam s razor is that What is a short tree? A tree with fewer nodes or a shallow tree? Actually, the general idea behind Occam s razor is that we want to avoid splitting as much as possible. herefore, shallower trees would be better for us. here are a number of arguments in favor and against Occam s razor. here have been many researches on this topic, but we can not clearly answer this question yet: What is bias really? Avoiding Overfitting the Data As discussed before in ID3 the tree is built just deeply enough to perfectly classify the training data. his strategy can produce trees that overfit the training examples. Consider the following Concept C: X1 1 X2 1 X3 0 X4 0 X5 0 Now suppose that the training set contains variables X3, X4 and X5. hen obviously the learner would classify X1 and X2 as 0, i.e. H(X1) = 0 and H(X2) = 0, which is not correct. Here the learner has overfitted the data. Given a hypothesis space H and hypothesis h H, let error train (h) be the error of the hypothesis h over the training examples and error D (h) be the error of the hypothesis over the entire distribution D of data: h is said to overfit the training data, if there exists some other hypothesis h H, such that errortrain ( h) < errortrain ( h' ) and errord ( h) > errord ( h' ). 2 William of Occam was one of the first philosophers who discussed the question of preferring short hypotheses, around the year 1320.
Overfitting usually happens where there are random errors or noise in the training examples, or there is small number of examples for training. In the latter case, we usually have coincidental regularities. his happens when some attribute partitions the training example very well but is irrelevant to the real target function. o see the effect of an error in the training data on building the relevant decision tree, consider the following example. Suppose in the entire distribution of the data we have: If SUNNY = then PLAYENNIS = But in the training example there is an instance with the following values of attributes: X1 X2 SUNNY PLAYENNIS So in reality the tree should not split any more in the branch of SUNNY = (igure 3.a), but having the above instance in the training example the learned tree has to split on every attribute after Sunny in the branch of SUNNY = (igure 3.b). While the resulted tree classifies the training examples well, it fails in classifying the real Data. Obviously, we have an overfitting here. SUNNY SUNNY X1 X1...... a) he decision tree for real target function b) he learned decision tree igure 3 In general, it is very hard to avoid overfitting because usually all we have is the training data. One way is to go for some statistics. We can do a statistical analysis and use the gathered information to answer the question: How much representative the sample is? hen, if the sample is not representative enough, we would have overfitting. or example, in this way statisticians may calculate the error for a given concept and then give a probability measure for all concepts. However, if we have some assumption about the real target function, it would be much easier to find out when overfitting occurs. It is worth to mention that there is a close relation between inductive bias and overfitting and having a good inductive bias can help us to avoid overfitting.
raining and Validation Among the different approaches that have been proposed to avoid overfitting the most common one is called training and validation set approach. In this approach the available training date is separated into two sets: raining Set and the Validation Set. raining set is then used to learn the hypothesis and the validation set is used to evaluate the accuracy of the resulted hypothesis. In this way, we provide a check against overfitting and will use the validation set to prune the tree whenever it overfits the data. he size of validation set has also a significant importance, because it should be large enough to provide a good sample of the instances. Although there are statistical formulas for computing the required size of the validation set, one common heuristic is to use 1/3 of the available examples for the validation set and use the remaining 2/3 for the training. here are different approaches for pruning a tree using the validation set. wo of those approaches are: 1) Reduced Error Pruning 2) Rule Post Pruning Reduced Error Pruning In this approach, the data is split into the training and validation sets and they are used in pruning. Pruning a decision node is removing its subtree and making it a leaf node, then assigning it the most common classification affiliated with that node. In Reduced Error Pruning each node is considered for pruning. he pruning algorithm is as follows: Do until further pruning is harmful: 1. Evaluate the impact on validation set of pruning each possible node (plus those below) 2. Greedily remove the node that most improves the accuracy on the validation set In this way, each pruning is tested over the validation set and the smallest version of the accurate subtree is produced. o address the problem of limited data, other techniques have been proposed which have the same idea behind. hose techniques usually partition the available data several times in different ways and then average the results. Rule PostPruning Another approach for pruning the learned tree is Rule PostPruning. In this approach, we convert the decision tree to the equivalent set of logical rules and then prune each rule. Obviously pruning in the Logical ramework is equivalent to Generalizing the rules. hese generalized rules are then sorted according to their accuracy and used in the same order. Here the validation set is used to estimate the accuracy of the resulted rules.
Other Issues in Decision rees Learning here are still more topics that have been addressed by decision trees. Some of these topics are: kary Attribute Values: Problem: When an attribute has many values, its will be selected because of the high information gain. Solution: Using gain ratio. Real Attribute Values Problem: he attributes tested in the decision nodes must be discrete valued. Solution: Defining new discrete valued attributes and partitioning the real attribute value into discrete intervals. Attribute Cost Problem: Sometimes different attributes have different costs. How can we find the tree with the lowest cost? Solution: Including the cost of each attribute in the Gain function. Missing Values Problem: In some examples the value of an attribute is missed. Solution: here are different strategies, using training examples and sorting through the tree (e.g. If n test A, which has a missing value, assign the most common value among training examples at node n). Other Splitting Criteria We can use different measures for selecting attributes. or example we can measure not only the entropy but also the size of a set to select an attribute. References [1] Jonothan J. Oliver; Decision Graphs: An Extension of Decision rees, 1992 [2] Robert StAubin, Jesse Hoey and Craig Boutilier; APRICODD: Approximate Policy Construction using Decision Diagrams [3] om M. Mitchell; Machine Learning, 1997 [4] Entropy on the World Wide Web, http://www.math.psu.edu/gunesch/entropy.html