CLASS 4, APRIL 2018 CHAPTER 9 CLASSIFICATION AND REGRESSION TREES DAY 2 PREDICTING PRICES OF TOYOTA CARS ROGER BOHN APRIL 2018 Notes based on: Data Mining for Business Analytics. Shmueli, et al + Data Mining with Rattle and R, G. Williams Middle section of slides is almost same as for previous class. 1 WEB SITE/TRITONED UPDATES New entries: Resources for studying R; Sony Entertainment + 1 more project description; trade ideas and looking for teammates on projects. Homework due Friday noon for today, and Saturday for your project proposals. Lecture notes, before or after class. Menu item. https:// bda2020.wordpress.com/2018/04/04/latest-syllabusassignments-and-notes/
SESSION LEARNING GOALS Demonstrate the analytic flow for analyzing big data. Begin to practice it. BDA as an art, as well as a science. Introduce decision tree models = CART = very different than classical econometric regression models Key concepts of BDA Holdout sample Transform the data to physically meaningful, or useful, or both Many others C A R T = CLASSIFICATION TREES One of a dozen common mining models. (Algorithms) Data + Algorithm Predictions Predictions - Actual Model performance Relatively straightforward
TREES AND RULES Goal: Classify or predict an outcome based on a set of predictors The output is a set of rules Example: Goal: classify a record as will accept credit card offer or will not accept Rule might be IF (Income > 92.5) AND (Education < 1.5) AND (Family <= 2.5) THEN Class = 0 (nonacceptor) Also called CART, Decision Trees, or just Trees Rules are represented by tree diagrams 5 6
KEY IDEAS Recursive partitioning: Repeatedly split the records into two parts To achieve maximum homogeneity within the new parts Choosing the next variable Pruning the tree: Simplify the tree by pruning minor branches to avoid overfitting 7 RECURSIVE PARTITIONING 8
RECURSIVE PARTITIONING STEPS Pick one of the predictor variables, x i Pick a value of x i, say s i, that divides the training data into two (not necessarily equal) portions Measure how pure or homogeneous each of the resulting portions are Pure = containing records of mostly one class Algorithm tries different values of x i, and s i to maximize purity in initial split After you get a maximum purity split, repeat the process for a second split, and so on 9 EXAMPLE: RIDING MOWERS Goal: Classify 24 households as owning or not owning riding mowers Predictors = Income, Lot Size 10
Income Lot_Size Ownership 60.0 18.4 owner 85.5 16.8 owner 64.8 21.6 owner 61.5 20.8 owner 87.0 23.6 owner 110.1 19.2 owner 108.0 17.6 owner 82.8 22.4 owner 69.0 20.0 owner 93.0 20.8 owner 51.0 22.0 owner 81.0 20.0 owner 75.0 19.6 non-owner 52.8 20.8 non-owner 64.8 17.2 non-owner 43.2 20.4 non-owner 84.0 17.6 non-owner 49.2 17.6 non-owner 59.4 16.0 non-owner 66.0 18.4 non-owner 47.4 16.4 non-owner 33.0 18.8 non-owner 51.0 14.0 non-owner 63.0 14.8 non-owner 11 12
HOW TO SPLIT Order records according to one variable, say income Take a predictor value, say 60 (the first record) and divide records into those with income >= 60 and those < 60 Measure resulting purity (homogeneity) of class in each resulting portion Try all other split values Repeat for other variable(s) Select the one variable & split that yields the most purity increase THE FIRST SPLIT: INCOME = 60
SECOND SPLIT: LOT SIZE = 21 AFTER ALL SPLITS
WHAT ABOUT CATEGORICAL VARIABLES? Examine all possible ways in which the categories can be split. E.g., categories A, B, C can be split 3 ways {A} and {B, C} {B} and {A, C} {C} and {A, B} With many categories, # of splits explodes (Toyota car models) Computation will bog down. How many ways to split 30 models? ~ 10 32 MEASURING IMPURITY 18
GINI INDEX I(A) = 1 - p = proportion of cases in rectangle A that belong to class k Gini Index for rectangle A containing m records I(A) = 0 when all cases belong to same class Max value when all classes are equally represented (= 0.50 in binary case) Note: XLMiner uses a variant called delta splitting rule 19 ENTROPY p = proportion of cases (out of m) in rectangle A that belong to class k Entropy ranges between 0 (most pure) and log 2 (m) (equal representation of classes) 20
IMPURITY AND RECURSIVE PARTITIONING Obtain overall impurity measure (weighted avg. of individual rectangles) At each successive stage, compare this measure across all possible splits in all variables Choose the split that reduces impurity the most Which variable, and where to split it. Chosen split points become nodes on the tree 21 FIRST SPLIT THE TREE
TREE AFTER ALL SPLITS The first split is on Income, then the next split is on Lot Size for both the low income group (at lot size 21) and the high income split (at lot size 20) Decision node Terminal node (leaf)
class in this portion of the first split (those with income The next >= 60) is split owner for this 11 owners group and of 165 nonowners on the will be basis of lot size, splitting at 20 TREE STRUCTURE Split points become nodes on tree (circles with split value in center) Rectangles represent leaves (terminal points, no further splits, classification value noted) Numbers on lines between nodes indicate # cases Read down tree to derive rule E.g., If lot size < 19, and if income > 84.75, then class = owner 26
Read down the tree to derive rules If Income < 60 AND Lot Size < 21, classify as Nonowner DETERMINING LEAF NODE LABEL Each leaf node label is determined by voting of the records within it, and by the cutoff value Records within each leaf node are from the training data Default cutoff=0.5 means that the leaf node s label is the majority class. Cutoff = 0.75: requires majority of 75% or more 1 records in the leaf to label it a 1 node 28
THE OVERFITTING PROBLEM 29 FULL TREES ARE COMPLEX AND OVERFIT THE DATA Natural end of process is 100% purity in each leaf This overfits the data, which end up fitting noise in the data Consider Example 2, Loan Acceptance with more records and more variables than the Riding Mower data the full tree is very complex
Full trees are too complex they end up fitting noise, overfitting the data OVERFITTING PRODUCES POOR PREDICTIVE PERFORMANCE PAST A CERTAIN POINT IN TREE COMPLEXITY, THE ERROR RATE ON NEW DATA STARTS TO INCREASE
PRUNING CART lets tree grow to full extent, then prunes it back Idea is to find that point at which the validation error is at a minimum Generate successively smaller trees by pruning leaves At each pruning stage, multiple trees are possible Use cost complexity to choose the best tree at that stage WHICH BRANCH TO CUT AT EACH STAGE OF PRUNING? CC(T) = Err(T) + α L(T) CC(T) = cost complexity of a tree Err(T) = proportion of misclassified records α = penalty factor attached to tree size (set by user) Among trees of given size, choose the one with lowest CC Do this for each size of tree (stage of pruning)
TREE INSTABILITY If 2 or more variables are of roughly equal importance, which one CART chooses for the first split can depend on the initial partition into training and validation A different partition into training/validation could lead to a different initial split This can cascade down and produce a very different tree from the first training/validation partition Solution is to try many different training/validation splits cross validation With future data, grow tree to 7 splits: estimated cv error std. error of the estimate smallest tree within 1 xstd of min. error (it has 7 splits) minimum error
ADVANTAGES OF TREES Easy to use, understand Produce rules that are easy to interpret & implement Variable selection & reduction is automatic Do not require the assumptions of statistical models. Completely distribution-free aka non-parametric Can work without extensive handling of missing data CART DISADVANTAGES May not perform well where there is structure in the data that is not well captured by horizontal or vertical splits Very simple, don t always give best fits Disadvantage of single trees: instability and poor predictive performance We will improve CART later in course with Random Forests. 38
SUMMARY Classification and Regression Trees are an easily understandable and transparent method for predicting or classifying new records A tree is a graphical representation of a set of rules Trees must be pruned to avoid over-fitting of the training data As trees do not make any assumptions about the data structure, they usually require large samples 39 40
TOYOTA COROLLA PRICES Case analysis Higher or lower than median price? 1436 records, 38 attributes 41 LOTS OF VARIABLES. WHICH TO USE? Look for unimportant variables Little variation in the data Zero correlation to final outcome (not always safe) Look for groups of variables (with high correlation) Probably irrelevant to price (use domain knowledge )
43
Corrgram of mtcars intercorrelations gear am drat mpg vs qsec wt disp cyl hp carb Figure 11.17 Corrgram of the correlations among the variables in the mtcars data frame. Rows and columns have been reordered using principal components analysis. 5/28/2014 ggpairs3.png (800 800) 46
GGPAIRS 1#,Uncomment,these,lines,and,install,if,necessary: 2 #install.packages('ggally') 3 #install.packages('ggplot2') 4 #install.packages('scales') 5 #install.packages('memisc') 6 7 library(ggplot2) 8 library(ggally) 9 library(scales) 10 data(diamonds) 11 12 diasamp = diamonds[sample(1:length(diamonds$price),10000),] 13 ggpairs(diasamp,params = c(shape = I("."),outlier.shape=I("."))) 48
COROLLA TREE: AGE, KM, 49 50
EVALUATE RESULT: CONFUSION MATRIX Calculate model using only training data Evaluate model using only validation data. 51 HOW TO BUILD IN PRACTICE Start with every plausible variable Throw out obviously unimportant (radio) Let the algorithm decide what belongs in final model Do NOT screen heavily, unless you have to Do Throw out obvious junk Think hard about categorical variables with lots of categories. They become lots of dummy variables! Blows up Consolidate categories based on causal similarity, or small sample size Consider pruning highly correlated variables (at least at first) All choices are tentative 52
REPORT WHAT YOU DID Don t omit too many variables unless sure they don t matter. For some variables, yes you can be sure! (Radio) 53 TYPICAL RESULTS: WHAT ARE KEY VARIABLES Age (in months) Km traveled Air conditioning Weight Do these match our understanding of cars? Our domain knowledge? 54
AIR CONDITIONING AC Yes/No Automatic AC Yes/no Model thinks this is 2 independent variables Use outside knowledge: Convert this to 1 variable with 3 levels 55 56
SESSION LEARNING GOALS Demonstrate the analytic flow for analyzing big data. Begin to practice it. BDA as an art, as well as a science. Introduce decision tree models = CART = very different than classical econometric regression models Key concepts of BDA Holdout sample Transform the data to physically meaningful, or useful, or both Many others KEY CONCEPTS Decision trees Key concept #1: Use a holdout sample to evaluate performance. Train/validate/test Confusion matrix for classification problems (discrete results) Key concept #2: Overfitting. Models always overfit. Holdout sample tells how badly Concept: tuning a model for better results. Key concept #0: Variables have physical/economic/business meanings Key concept #3: transforming the data for better fits and better insight/ understanding of result. aka Feature creation Concept: cleaning data to get rid of data errors. Key concept # 4: Nonlinearity. Key concept #5: Knowing causality is wonderful, but for many purposes not necessary. The data mining process flow.