Deriving Decision Trees from Case Data

Topic 4 Automatic Kwledge Acquisition PART II Contents 5.1 The Bottleneck of Kwledge Aquisition 5.2 Inductive Learning: Decision Trees 5.3 Converting Decision Trees into Rules 5.4 Generating Decision Trees: Information gain 1 Deriving Decision Trees from Case Data 2 1

5. Automatic Kwledge Acquisition Deriving Decision Trees: random-tree There are various ways to derive decision trees from case data. A simple method is called Random-Tree, and is described on the enxt few slides. We assume that all given attributes are discrete (t continuous). We assume that the expert classification is binary ( or, true or false, treat or dont-treat, etc.) 3 5. Automatic Kwledge Acquisition Deriving Decision Trees: random-tree Lets take a case study: will someone play tennis given the weather, Case Outlook Temp. Humidity Wind Play? 1 2 3 overcast 4 5 6 overcast 7 8 9 4 2

5. Automatic Kwledge Acquisition Deriving Decision Trees: random-tree To develop a decision tree with RandomTree, we select an attribute at random, e.g., humidity We make the root de of the tree with this attribute: Humidity? 5 5. Automatic Kwledge Acquisition Deriving Decision Trees: random-tree We then list for each branch the cases which fit that branch: Humidity? 5,6,8,9 1,2,3,4,7 Case 1 2 3 4 5 6 7 8 9 Outlook overcast overcast Temp. Humidity Wind Play? 6 3

5. Automatic Kwledge Acquisition Deriving Decision Trees: random-tree If all of the cases of a branch share the same conclusion, we make this branch a leaf, and just show the decision. This is t the case here, since the decisions are mixed on both branches. Humidity? Case 1 2 3 4 5 6 7 8 9 Outlook overcast overcast 5,6,8,9 1,2,3,4,7 Temp. Humidity Wind Play? 7 5. Automatic Kwledge Acquisition Deriving Decision Trees: random-tree For n-terminal des, we then select a second attribute at random, and create the branch: Wind? Humidity? Wind? 5,6 8,9 1,3,4,7 2 Case 1 2 3 4 5 6 7 8 9 Outlook overcast overcast Temp. Humidity Wind Play? 8 4

5. Automatic Kwledge Acquisition Deriving Decision Trees: random-tree And then repeat the previous steps until finished: overcast Humidity? Wind? Wind? Outlook? Case 1 2 3 4 5 6 7 8 9 Outlook overcast overcast Temp. Humidity Wind Play? 9 5. Automatic Kwledge Acquisition Deriving Decision Trees: random-tree Trees will differ depending on the order in which attributes are used Trees may be smaller or larger (number of des, depth) than others. Later, we will look at a means to produce compact trees as (ID3-tree). 10 5

5. Automatic Kwledge Acquisition Other Models: The Restaurant Problem Will people wait for a table to be available? Assume that an expert provided the following decision tree (But... is Sociology an exact science? Or is Medicine?...) 11 5. Automatic Kwledge Acquisition Data from experts, their observations/diagstics, guide us Perfection is unreachable (even for experts see X 4 and previous tree) Goal: equal (or better) sucessful prediction rate on unseen instances, compared with the human expert 12 6

5. Automatic Kwledge Acquisition The best tree for this data might be : Smaller than the Expert s tree (this is an advantage Occam s Razor) Both trees agree on the root and two branches The best (the only?) measure of quality is prediction rate on unseen instances Later, we will use Information Theory to obtain trees as good as this one 13 5. Automatic Kwledge Acquisition For ather problem, the following tree was produced from a set of ting data: What s wrong with this tree? Experts tice it. Non-experts do t, r would a program. 14 7

5. Automatic Kwledge Acquisition What s wrong with this tree? Experts tice it. Non-experts do t, r would a program. The problem is, the ting data had cases of diabetic women on their first pregnancy who were renally insufficient. The tree DID cover all observed cases, but t all possible cases! The wrong recommendation would be given in these cases. 15 Deriving Rules from Decision Trees 16 8

5. Automatic Kwledge Acquisition RULE EXTRACTION Traversing the tree from root to leaves produces: We focus on rules for the class NO because there are fewer of them The other class is defined using negation by default (as in Prolog) 17 Simplification of a Rule 18 9

5. Automatic Kwledge Acquisition RULE SIMPLIFICATION Sometimes, we can delete conditions in our rules without affecting the results produced by the rules: Original rules: Simplified rules: pregnant is implied by this being the patient s fiest pregnancy, so it can be dropped. Dropping renal insufficiency actually improves the working of the rules, because those with renal insufficiency and diabetic first pregnancy should also t be treated. 19 5. Automatic Kwledge Acquisition RULE SIMPLIFICATION A rule can be simplified by dropping some of the conditions, where dropping the conditions will t affect the decision the rule makes. There are two main ways to drop conditions: 1. The logical approach: Where one condition is logically implied by ather, the implied condition can be dropped: pregnant & first-pregnancy: BUT first-pregnancy implies pregnant! Age > 23 and Age > 42: BUT Age > 42 implies Age > 23! 2. The statistical approach: Where one condition can be dropped without changing the decision made by the rule over a set of data, then drop it. OR BETTER: when dropping the condition leaves unchanged OR IMPROVES the decisions made, drop it. 20 10

5. Automatic Kwledge Acquisition RULE SIMPLIFICATION: the logical approach Algorithm: For each rule: For each condition: If ather condition for this rule logically implies this one, then delete this one. Logical implication can be derived from the ting set: A condition X is implied by a condition Y if X is true whenever Y is true. E.g., Age>23 is true whenever Age>52 is true. E.g., first pregnancy is true whenever pregnant is true. 21 5. Automatic Kwledge Acquisition RULE SIMPLIFICATION: the statistical approach We need a test data set, which is a set of data with the expert s classification which was NOT used to derive the set of rules. Thus, two sets of data: One to derive the rules Ather to simplify them To test the precision of a rule set: 1. Set SCORE to 0 2. For each case in the test set, Apply the rules to the case data to produce a conclusion If the estimated conclusion is the same as the experts, increm SCORE 3. PRECISION = SCORE / No. of cases 22 11

5. Automatic Kwledge Acquisition RULE SIMPLIFICATION: the statistical approach To simplify rules: 1. Test the precision of the rules 2. For each rule: For each condition of the rule: Make a copy of the rule set with this condition deleted Test the precision of the new rule set on the test data If the precision is equal or better than the original precision:» Replace the original rule set with the copy» Replace original precision with this one» Restart Step 2 3. We get here when more conditions can be deleted. The rules are maximally simple. 23 5. Automatic Kwledge Acquisition RULE SIMPLIFICATION: the statistical approach The decision tree from above made a mistake in that it does t deal with cases with renal insufficiency who are diabetic and first pregnancy. Assuming there are such cases in our test database, using the statistical approach would lead to the renal insufficiency condition being dropped from our first rules, as it would improve precision. 24 12

Deleting Subsumed Rules 25 5. Automatic Kwledge Acquisition RULE Deletion More complete ting data would produce a better tree: Is this tree better or worse? It is more complex (larger) and has redundancy But for a doctor, it has better semantics. And for a machine, it has better predictive accuracy on the test set 26 13

5. Automatic Kwledge Acquisition RULE Deletion Lets look at the rules from this case: ( renal insuff. & pregnant & diabetes & first-preg) -> (renal insuff& press& pregn&diabetes&first-preg) -> (renal insuff& press) -> 27 5. Automatic Kwledge Acquisition Simplifying these rules using logic: ( renal insuff. & pregnant & diabetes & first-preg) -> (renal insuff& press& pregn&diabetes&first-preg) -> (renal insuff& press) -> Gives: ( renal insuff.&diabetes & first-preg) -> (renal insuff& press&diabetes&first-preg) -> (renal insuff& press) -> Looking at predicting accuracy, we see that deleting renal insuff. from the first rule does t change the predictions. (diabetes & first-preg) -> (renal insuff& press&diabetes&first-preg) -> (renal insuff& press) -> Now, the second rule cant fire unless the first does. So, we can delete the second rule (it is a subset of the cases of the first). 28 14

5. Automatic Kwledge Acquisition RULE Deletion As with deleting conditions from a rule, we can apply the same methods to deleting rules: 1. The logical approach: Where one rule is logically implied by ather, the implied rule can be dropped: 2. The statistical approach: Where one rule can be dropped without worsening the predictive accuracy of the rule-set as a whole, then delete the rule. 29 Producing Optimal Decision Trees: ID3-Tree 30 15

5. Automatic Kwledge Acquisition Psuedo-code to generate a decision tree 31 5. Automatic Kwledge Acquisition Selecting the best attribute for the root Random-Tree selects an attribute at random for the root of the tree. This approach tries to select the best attribute for the root. We seek an attribute which most determines the expert s decision. ID-3 assesses each attribute in terms of how much it helps to make a decision. Using the attribute splits the cases into smaller subsets. The closer these subsets are to being purely one of the decision classes, the better. The formula used is called Information Gain. 32 16

5. Automatic Kwledge Acquisition Information Suppose we have a set of cases, and the expert judges whether to treat the patient or t. In 50% of the cases, the expert proposes treatment, and in the other 50% proposes treatment. For a given new case, without looking at attributes, the probability of treatment is 50% (we have information to favor treatment or t). Now, assume we use an attribute to split our cases into two sets: Set 1: treatment recommended in 75% of cases Set 2: treatment recommended in 30% of cases Now, in each subset, we have more information as to what decision to make. -> Information Gain. 33 5. Automatic Kwledge Acquisition How to Calculate Information Gain of an Atribute Firstly, we caluclate the information contained before the split The formula we use is: H (for entropy H(p,q) = -p.log 2 (p) q.log 2 (q)...where p is the probability of decision 1 and q is the probability of the reverse decision. In our previous case: Initially: p = 50%, q=50% H(p,q) = -0.5 * log 2 (0.5) 0.5 * log 2 (0.5) = 1 ( information) Information = 1- Entropy = 1-H(p,q) 34 17

5. Automatic Kwledge Acquisition Alternative Formulas: Both give equal values Values always between 0 and 1 35 5. Automatic Kwledge Acquisition Special Cases: H(1/3, 2/3) = H(2/3, 1/3) = 0.92 bits of information H(1/2, 2/2) 1 bit of information ( information) H(1,0)=0 bits (maximum information) 36 18

5. Automatic Kwledge Acquisition How to Calculate Information Gain of an Atribute Initially: p = 50%, q=50% H(p,q) = -0.5 * log 2 (0.5) 0.5 * log 2 (0.5) = 1 ( information) Splitting the data, we get: H1(p,q) = -0.75 * log 2 (0.75) - 0.25 * log 2 (0.25) = 0.81 H2(p,q) = -0.25 * log 2 (0.25) - 0.75 * log 2 (0.75) = 0.81 We derive the total information of the two subsets by multiplying each information measure by the probability of the set. Lets assume the first set is 2/3 of the cases: H new (p,q) = 0.66*H1(p,q) + 0.34 * H2(p,q) = 0.81 37 5. Automatic Kwledge Acquisition How to Calculate Information Gain of an Atribute Given the original information of the case data was 1.0, and the information of the cases divided by the attribute is 0.811, we have an information gain of 0.189 The idea is, we look at each of the attributes in turn, and choose the attribute which gives us the est gain in information. 38 19

5. Automatic Kwledge Acquisition The Restaurant case revisited 39 5. Automatic Kwledge Acquisition The Restaurant case revisited 40 20

5. Automatic Kwledge Acquisition The Restaurant case revisited 41 5. Automatic Kwledge Acquisition The Restaurant case revisited 42 21

5. Automatic Kwledge Acquisition 43 22