Decision Tree Learning CSE 6003 Machine Learning and Reasoning
Outline What is Decision Tree Learning? What is Decision Tree? Decision Tree Examples Decision Trees to Rules Decision Tree Construction Decision Tree Algorithms Decision Tree Overfitting
Paradigms of Machine Learning Neural Network Machine Learning Genetic Algorithms Decision Trees Bayesian Learning Decision Tree technique is one of the machine learning techniques
Learning Types Learning Supervised Learning Classification Unsupervised Learning Clustering Decision Tree Learning Bayesian Learning Nearest Neighbour Neural Networks Support Vector Machines Regression Association Analysis Sequence Analysis Summerization Descriptive Statistics Outlier Analysis Scoring Decision Tree Learning is in the supervised learning type.
Decision Tree Learning Decision Tree Learning is a method for approximating discretevalued target functions, in which the learned function is represented by a decision tree. Decision Tree Learning is robust to noisy data and capable of learning disjunctive expressions. One of the most widely used method for inductive inference. Salary < 1 M Job = teacher Age < 30 Good Bad Bad House Hiring Good
Decision Tree Representation Decision Trees classify instances by sorting them down the tree from the root to some leaf node, which provides the classification of the instance. Each node in the tree specifies a test of some attribute of the instance Each branch descending from that node corresponds to one of the possible values for this attributes
Decision Trees Decision Tree is a tree where internal nodes are simple decision rules on one or more attributes each branch corresponds to an attribute value leaf nodes are predicted class labels Decision trees are used for deciding between several courses of action age income student credit_rating buys_computer <=30 high no fair no <=30 high no excellent no 31 40 high no fair yes >40 medium no fair yes >40 low yes fair yes >40 low yes excellent no 31 40 low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 31 40 medium no excellent yes 31 40 high yes fair yes >40 medium no excellent no Attribute Value age? <=30 31..40 >40 Classification student? yes credit rating? No Yes Excellent Fair no yes yes
Desicion Tree Applications Has been used for 1. Classification class1 class2 class1 class3 class5 class3 class1 2. Data Reduction class4 Initial attribute set: {A1, A2, A3, A4, A5, A6} A4? A1? A6? Class 1 Class 2 Class 1 Class 2 Reduced attribute set: {A1, A4, A6}
Decision Tree Example A credit card company receives thousands of applications for new cards. Each application contains information about an applicant, age marital status annual salary outstanding debts credit rating etc. Problem: to decide whether an application should approved, or to classify applications into two categories, approved and not approved.
Decision Tree Example (Cont) Approved or not
Decision Tree Example (Cont) Decision nodes and leaf nodes (classes)
Decision Tree Example (Cont) Construct a classification model from the data Use the model to classify future loan applications into Yes (approved) and No (not approved) What is the class for following case/instance?
Use the Decision Tree (Cont) No Once the tree is trained, then a new instance is classified by starting at the root and following the path as dictated by the test results for this instance.
Decision Tree Example Problem: decide whether to wait for a table at a restaurant Attributes: 1. Alternate: is there an alternative restaurant nearby? 2. Bar: is there a comfortable bar area to wait in? 3. Fri/Sat: is today Friday or Saturday? 4. Hungry: are we hungry? 5. Patrons: number of people in the restaurant (None, Some, Full) 6. Price: price range ($, $$, $$$) 7. Raining: is it raining outside? 8. Reservation: have we made a reservation? 9. Type: kind of restaurant (French, Italian, Thai, Burger) 10. WaitEstimate: estimated waiting time (0-10, 10-30, 30-60, >60)
Decision Tree Example (Cont.) Classification of examples is positive (T) or negative (F)
Decision Tree Example (Cont.) Here is the true tree for deciding whether to wait
Decision Trees to Rules
Decision Trees to Rules It is easy to derive a rule set from a decision tree Write a rule for each path in the decision tree from the root to a leaf. Can be represented as if-then rules Example: IF (Outlook=Sunny) (Humidity=High) THEN PlayTennis = No
Decision Trees to Rules
Decision Trees Construction
Decision Tree Each node tests some attribute of the instance Instances are represented by attribute-value pairs High information gain attributes close to the root Root: best attribute for classification Which attribute is the best classifier? answer based on information gain
Entropy Entropy specifies the minimum number of bits of information needed to encode the classification of an arbitrary member of S In general: m Entropy(S) p i 1 i log 2 p i Example for two class labels Entropy(S) p log p p log 1 2 1 2 2 p2
Entropy
Information Gain Measures the expected reduction in entropy given the value of some attribute A Gain (S,A) Entropy(S ) i A Si Entropy(S i) S Values(A): Set of all possible values for attribute A S i : Subset of S for which attribute A has value v
Decision Tree Example Which attribute first?
Decision Tree Example (Cont.)
Decision Tree Example (Cont.) Entropi ( S) (9/14)log 2(9/14) (5/14)log 2(5/14) 0,940 Gain(S, Outlook) = 0,246 Gain(S, Temperature) = 0,029 Gain(S, Huminity) = 0,151 Gain(S, Wind) = 0,048 ain (S, Wind ) Gain (S, Huminity S Entropy (S) S Weak S ) Entropy (S) S Entropy (S High Weak Entropy (S High S ) S Strong S ) S Normal Entropy (S Entropy (S Strong ) Normal 0,940 0,048 ) 0,940 0,151 8 14 7 14 *0,811 *0,985 6 14 7 14 *1,0 *1,0
Decision Tree Example (Cont.)
Decision Tree Construction Which attribute is next? Sunny Outlook Overcast? Yes Rain? Gain (SSunny, Wind ) 0,970 (2 / 5)1,0 (3/ 5)0,918 0,970 0,019 Gain (SSunny,Huminity ) 0,970 (3/ 5)0,0 (2 / 5)0,0 0,970 Gain (S Sunny, Temperatur e) 0,970 (2 / 5)0 (2 / 5)1 (1/ 5)0 0,570
Decision Tree Example (Cont.) [D3,D7,D12,D13] [D1,D2, D8] [D9,D11] [D4,D5,D10] [D6,D14]
Another Example At the weekend: - go shopping, - watch a movie, - play tennis or - just stay in. What you do depends on three things: - the weather (windy, rainy or sunny); - how much money you have (rich or poor) - whether your parents are visiting.
Another Example (Cont.)
Another Example height hair eyes class short blond blue + tall blond brown - tall red blue + short dark blue - tall dark blue - tall blond blue + tall dark brown - short blond brown - I(3+, 5-) = -3/8log 2 3/8 5/8log 2 5/8 = 0.954434003 Height: short (1+, 2-) tall(2+, 3-) Gain(height) = 0.954434003-3/8*I(1+,2-) - 5/8*I(2+,3-) = = 0.954434003 3/8(-1/3log 2 1/3-2/3log 2 2/3) 5/8(-2/5log 2 2/5-3/5log 2 3/5) = 0.003228944 Hair: blond(2+, 2-) red(1+, 0-) dark(0+, 3-) Gain(hair) = 0.954434003 4/8(-2/4log 2 2/4 2/4log 2 2/4) 1/8(-1/1log 2 1/1-0) -3/8(0-3/3log 2 3/3) = 0.954434003 0.5 = 0.454434003 Eyes: blue(3+, 2-) brown(0+, 3-) Gain(eyes) = 0.954434003 5/8(-3/5log 2 3/5 2/5log 2 2/5) -5/8(= = 0.954434003-0.606844122 = 0.347589881 Hair is the best attribute.
Another Example (Cont.) height hair eyes class short blond blue + tall blond brown - tall red blue + short dark blue - tall dark blue - tall blond blue + tall dark brown - short blond brown - hair dark red blond short, dark, blue: - tall, dark, blue: - tall, bark, brown: - tall, red, blue: + short, blond, blue: + tall, blond, brown: - tall, blond, blue: + short, blond, brown: - 34
Decision Trees Algorithms
Decision Tree Algorithms ID3 Quinlan (1981) Tries to reduce expected number of comparison C 4.5 Quinlan (1993) It is an extension of ID3 Just starting to be used in data mining applications Also used for rule induction CART Breiman, Friedman, Olshen, and Stone (1984) Classification and Regression Trees CHAID Kass (1980) Oldest decision tree algorithm Well established in database marketing industry QUEST Loh and Shih (1997)
Frequency Usage
Complexity of Tree Induction Assume m attributes n training instances tree depth O (log n) Building a tree O (m n log n) Total cost: O (m n log n)
Decision Tree Adv. DisAdv. Positives (+) + Reasonable training time + Fast application + Easy to interpret + Rule extraction from trees (can be re-represented as if-then-else rules) + Easy to implement + Can handle large number of features + Does not require any prior knowledge of data distribution Negatives (-) - Cannot handle complicated relationship between features - Problems with lots of missing data - Output attribute must be categorical - Limited to one output attribute - Difficulties involving in design an optimal decision tree - Overlap especially when the number of classes is large