WEB SITE/TRITONED UPDATES

Similar documents
CS Machine Learning

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Python Machine Learning

A Decision Tree Analysis of the Transfer Student Emma Gunu, MS Research Analyst Robert M Roe, PhD Executive Director of Institutional Research and

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Assignment 1: Predicting Amazon Review Ratings

Rule Learning With Negation: Issues Regarding Effectiveness

Lecture 1: Machine Learning Basics

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

(Sub)Gradient Descent

On-Line Data Analytics

Mathematics Success Level E

Functional Skills Mathematics Level 2 assessment

learning collegiate assessment]

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

CHAPTER 4: REIMBURSEMENT STRATEGIES 24

Objectives. Chapter 2: The Representation of Knowledge. Expert Systems: Principles and Programming, Fourth Edition

Probability and Statistics Curriculum Pacing Guide

Activities, Exercises, Assignments Copyright 2009 Cem Kaner 1

FRAMEWORK FOR IDENTIFYING THE MOST LIKELY SUCCESSFUL UNDERPRIVILEGED TERTIARY STUDY BURSARY APPLICANTS

Rule Learning with Negation: Issues Regarding Effectiveness

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Understanding and Interpreting the NRC s Data-Based Assessment of Research-Doctorate Programs in the United States (2010)

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Individual Differences & Item Effects: How to test them, & how to test them well

Mathematics Success Grade 7

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Introduction to Simulation

Impact of Cluster Validity Measures on Performance of Hybrid Models Based on K-means and Decision Trees

Chapter 2 Rule Learning in a Nutshell

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

Generating Test Cases From Use Cases

TIMSS ADVANCED 2015 USER GUIDE FOR THE INTERNATIONAL DATABASE. Pierre Foy

Using focal point learning to improve human machine tacit coordination

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Learning From the Past with Experiment Databases

Grade 6: Correlated to AGS Basic Math Skills

Section 3.4. Logframe Module. This module will help you understand and use the logical framework in project design and proposal writing.

Data Stream Processing and Analytics

Investment in e- journals, use and research outcomes

LANGUAGE DIVERSITY AND ECONOMIC DEVELOPMENT. Paul De Grauwe. University of Leuven

Lahore University of Management Sciences. FINN 321 Econometrics Fall Semester 2017

Linking the Ohio State Assessments to NWEA MAP Growth Tests *

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

EDCI 699 Statistics: Content, Process, Application COURSE SYLLABUS: SPRING 2016

Algebra 2- Semester 2 Review

*In Ancient Greek: *In English: micro = small macro = large economia = management of the household or family

Calibration of Confidence Measures in Speech Recognition

Probabilistic Latent Semantic Analysis

On Human Computer Interaction, HCI. Dr. Saif al Zahir Electrical and Computer Engineering Department UBC

Introduction to Questionnaire Design

Learning goal-oriented strategies in problem solving

CHEM 6487: Problem Seminar in Inorganic Chemistry Spring 2010

An Empirical Analysis of the Effects of Mexican American Studies Participation on Student Achievement within Tucson Unified School District

The Evolution of Random Phenomena

On-the-Fly Customization of Automated Essay Scoring

Software Maintenance

Asian Development Bank - International Initiative for Impact Evaluation. Video Lecture Series

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

STA 225: Introductory Statistics (CT)

Session 2B From understanding perspectives to informing public policy the potential and challenges for Q findings to inform survey design

Analyzing sentiments in tweets for Tesla Model 3 using SAS Enterprise Miner and SAS Sentiment Analysis Studio

Introduction to Causal Inference. Problem Set 1. Required Problems

Mathematics process categories

Miami-Dade County Public Schools

EDIT 576 (2 credits) Mobile Learning and Applications Fall Semester 2015 August 31 October 18, 2015 Fully Online Course

CSL465/603 - Machine Learning

Cal s Dinner Card Deals

Learning Methods in Multilingual Speech Recognition

preassessment was administered)

EDIT 576 DL1 (2 credits) Mobile Learning and Applications Fall Semester 2014 August 25 October 12, 2014 Fully Online Course

Grammar Lesson Plan: Yes/No Questions with No Overt Auxiliary Verbs

Defragmenting Textual Data by Leveraging the Syntactic Structure of the English Language

Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge

Active Learning. Yingyu Liang Computer Sciences 760 Fall

INTERMEDIATE ALGEBRA Course Syllabus

Multi-label classification via multi-target regression on data streams

*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN

Creating a Test in Eduphoria! Aware

The Indices Investigations Teacher s Notes

Chapter 4 - Fractions

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

May To print or download your own copies of this document visit Name Date Eurovision Numeracy Assignment

College Pricing and Income Inequality

CS 1103 Computer Science I Honors. Fall Instructor Muller. Syllabus

Ontologies vs. classification systems

College Pricing. Ben Johnson. April 30, Abstract. Colleges in the United States price discriminate based on student characteristics

12- A whirlwind tour of statistics

Trends in College Pricing

Firms and Markets Saturdays Summer I 2014

Pragmatic Use Case Writing

Students Understanding of Graphical Vector Addition in One and Two Dimensions

Evolutive Neural Net Fuzzy Filtering: Basic Description

MINUTE TO WIN IT: NAMING THE PRESIDENTS OF THE UNITED STATES

Data Structures and Algorithms

Numeracy Medium term plan: Summer Term Level 2C/2B Year 2 Level 2A/3C

Quantitative analysis with statistics (and ponies) (Some slides, pony-based examples from Blase Ur)

Interactive Whiteboard

AP Statistics Summer Assignment 17-18

Many instructors use a weighted total to calculate their grades. This lesson explains how to set up a weighted total using categories.

Transcription:

CLASS 4, APRIL 2018 CHAPTER 9 CLASSIFICATION AND REGRESSION TREES DAY 2 PREDICTING PRICES OF TOYOTA CARS ROGER BOHN APRIL 2018 Notes based on: Data Mining for Business Analytics. Shmueli, et al + Data Mining with Rattle and R, G. Williams Middle section of slides is almost same as for previous class. 1 WEB SITE/TRITONED UPDATES New entries: Resources for studying R; Sony Entertainment + 1 more project description; trade ideas and looking for teammates on projects. Homework due Friday noon for today, and Saturday for your project proposals. Lecture notes, before or after class. Menu item. https:// bda2020.wordpress.com/2018/04/04/latest-syllabusassignments-and-notes/

SESSION LEARNING GOALS Demonstrate the analytic flow for analyzing big data. Begin to practice it. BDA as an art, as well as a science. Introduce decision tree models = CART = very different than classical econometric regression models Key concepts of BDA Holdout sample Transform the data to physically meaningful, or useful, or both Many others C A R T = CLASSIFICATION TREES One of a dozen common mining models. (Algorithms) Data + Algorithm Predictions Predictions - Actual Model performance Relatively straightforward

TREES AND RULES Goal: Classify or predict an outcome based on a set of predictors The output is a set of rules Example: Goal: classify a record as will accept credit card offer or will not accept Rule might be IF (Income > 92.5) AND (Education < 1.5) AND (Family <= 2.5) THEN Class = 0 (nonacceptor) Also called CART, Decision Trees, or just Trees Rules are represented by tree diagrams 5 6

KEY IDEAS Recursive partitioning: Repeatedly split the records into two parts To achieve maximum homogeneity within the new parts Choosing the next variable Pruning the tree: Simplify the tree by pruning minor branches to avoid overfitting 7 RECURSIVE PARTITIONING 8

RECURSIVE PARTITIONING STEPS Pick one of the predictor variables, x i Pick a value of x i, say s i, that divides the training data into two (not necessarily equal) portions Measure how pure or homogeneous each of the resulting portions are Pure = containing records of mostly one class Algorithm tries different values of x i, and s i to maximize purity in initial split After you get a maximum purity split, repeat the process for a second split, and so on 9 EXAMPLE: RIDING MOWERS Goal: Classify 24 households as owning or not owning riding mowers Predictors = Income, Lot Size 10

Income Lot_Size Ownership 60.0 18.4 owner 85.5 16.8 owner 64.8 21.6 owner 61.5 20.8 owner 87.0 23.6 owner 110.1 19.2 owner 108.0 17.6 owner 82.8 22.4 owner 69.0 20.0 owner 93.0 20.8 owner 51.0 22.0 owner 81.0 20.0 owner 75.0 19.6 non-owner 52.8 20.8 non-owner 64.8 17.2 non-owner 43.2 20.4 non-owner 84.0 17.6 non-owner 49.2 17.6 non-owner 59.4 16.0 non-owner 66.0 18.4 non-owner 47.4 16.4 non-owner 33.0 18.8 non-owner 51.0 14.0 non-owner 63.0 14.8 non-owner 11 12

HOW TO SPLIT Order records according to one variable, say income Take a predictor value, say 60 (the first record) and divide records into those with income >= 60 and those < 60 Measure resulting purity (homogeneity) of class in each resulting portion Try all other split values Repeat for other variable(s) Select the one variable & split that yields the most purity increase THE FIRST SPLIT: INCOME = 60

SECOND SPLIT: LOT SIZE = 21 AFTER ALL SPLITS

WHAT ABOUT CATEGORICAL VARIABLES? Examine all possible ways in which the categories can be split. E.g., categories A, B, C can be split 3 ways {A} and {B, C} {B} and {A, C} {C} and {A, B} With many categories, # of splits explodes (Toyota car models) Computation will bog down. How many ways to split 30 models? ~ 10 32 MEASURING IMPURITY 18

GINI INDEX I(A) = 1 - p = proportion of cases in rectangle A that belong to class k Gini Index for rectangle A containing m records I(A) = 0 when all cases belong to same class Max value when all classes are equally represented (= 0.50 in binary case) Note: XLMiner uses a variant called delta splitting rule 19 ENTROPY p = proportion of cases (out of m) in rectangle A that belong to class k Entropy ranges between 0 (most pure) and log 2 (m) (equal representation of classes) 20

IMPURITY AND RECURSIVE PARTITIONING Obtain overall impurity measure (weighted avg. of individual rectangles) At each successive stage, compare this measure across all possible splits in all variables Choose the split that reduces impurity the most Which variable, and where to split it. Chosen split points become nodes on the tree 21 FIRST SPLIT THE TREE

TREE AFTER ALL SPLITS The first split is on Income, then the next split is on Lot Size for both the low income group (at lot size 21) and the high income split (at lot size 20) Decision node Terminal node (leaf)

class in this portion of the first split (those with income The next >= 60) is split owner for this 11 owners group and of 165 nonowners on the will be basis of lot size, splitting at 20 TREE STRUCTURE Split points become nodes on tree (circles with split value in center) Rectangles represent leaves (terminal points, no further splits, classification value noted) Numbers on lines between nodes indicate # cases Read down tree to derive rule E.g., If lot size < 19, and if income > 84.75, then class = owner 26

Read down the tree to derive rules If Income < 60 AND Lot Size < 21, classify as Nonowner DETERMINING LEAF NODE LABEL Each leaf node label is determined by voting of the records within it, and by the cutoff value Records within each leaf node are from the training data Default cutoff=0.5 means that the leaf node s label is the majority class. Cutoff = 0.75: requires majority of 75% or more 1 records in the leaf to label it a 1 node 28

THE OVERFITTING PROBLEM 29 FULL TREES ARE COMPLEX AND OVERFIT THE DATA Natural end of process is 100% purity in each leaf This overfits the data, which end up fitting noise in the data Consider Example 2, Loan Acceptance with more records and more variables than the Riding Mower data the full tree is very complex

Full trees are too complex they end up fitting noise, overfitting the data OVERFITTING PRODUCES POOR PREDICTIVE PERFORMANCE PAST A CERTAIN POINT IN TREE COMPLEXITY, THE ERROR RATE ON NEW DATA STARTS TO INCREASE

PRUNING CART lets tree grow to full extent, then prunes it back Idea is to find that point at which the validation error is at a minimum Generate successively smaller trees by pruning leaves At each pruning stage, multiple trees are possible Use cost complexity to choose the best tree at that stage WHICH BRANCH TO CUT AT EACH STAGE OF PRUNING? CC(T) = Err(T) + α L(T) CC(T) = cost complexity of a tree Err(T) = proportion of misclassified records α = penalty factor attached to tree size (set by user) Among trees of given size, choose the one with lowest CC Do this for each size of tree (stage of pruning)

TREE INSTABILITY If 2 or more variables are of roughly equal importance, which one CART chooses for the first split can depend on the initial partition into training and validation A different partition into training/validation could lead to a different initial split This can cascade down and produce a very different tree from the first training/validation partition Solution is to try many different training/validation splits cross validation With future data, grow tree to 7 splits: estimated cv error std. error of the estimate smallest tree within 1 xstd of min. error (it has 7 splits) minimum error

ADVANTAGES OF TREES Easy to use, understand Produce rules that are easy to interpret & implement Variable selection & reduction is automatic Do not require the assumptions of statistical models. Completely distribution-free aka non-parametric Can work without extensive handling of missing data CART DISADVANTAGES May not perform well where there is structure in the data that is not well captured by horizontal or vertical splits Very simple, don t always give best fits Disadvantage of single trees: instability and poor predictive performance We will improve CART later in course with Random Forests. 38

SUMMARY Classification and Regression Trees are an easily understandable and transparent method for predicting or classifying new records A tree is a graphical representation of a set of rules Trees must be pruned to avoid over-fitting of the training data As trees do not make any assumptions about the data structure, they usually require large samples 39 40

TOYOTA COROLLA PRICES Case analysis Higher or lower than median price? 1436 records, 38 attributes 41 LOTS OF VARIABLES. WHICH TO USE? Look for unimportant variables Little variation in the data Zero correlation to final outcome (not always safe) Look for groups of variables (with high correlation) Probably irrelevant to price (use domain knowledge )

43

Corrgram of mtcars intercorrelations gear am drat mpg vs qsec wt disp cyl hp carb Figure 11.17 Corrgram of the correlations among the variables in the mtcars data frame. Rows and columns have been reordered using principal components analysis. 5/28/2014 ggpairs3.png (800 800) 46

GGPAIRS 1#,Uncomment,these,lines,and,install,if,necessary: 2 #install.packages('ggally') 3 #install.packages('ggplot2') 4 #install.packages('scales') 5 #install.packages('memisc') 6 7 library(ggplot2) 8 library(ggally) 9 library(scales) 10 data(diamonds) 11 12 diasamp = diamonds[sample(1:length(diamonds$price),10000),] 13 ggpairs(diasamp,params = c(shape = I("."),outlier.shape=I("."))) 48

COROLLA TREE: AGE, KM, 49 50

EVALUATE RESULT: CONFUSION MATRIX Calculate model using only training data Evaluate model using only validation data. 51 HOW TO BUILD IN PRACTICE Start with every plausible variable Throw out obviously unimportant (radio) Let the algorithm decide what belongs in final model Do NOT screen heavily, unless you have to Do Throw out obvious junk Think hard about categorical variables with lots of categories. They become lots of dummy variables! Blows up Consolidate categories based on causal similarity, or small sample size Consider pruning highly correlated variables (at least at first) All choices are tentative 52

REPORT WHAT YOU DID Don t omit too many variables unless sure they don t matter. For some variables, yes you can be sure! (Radio) 53 TYPICAL RESULTS: WHAT ARE KEY VARIABLES Age (in months) Km traveled Air conditioning Weight Do these match our understanding of cars? Our domain knowledge? 54

AIR CONDITIONING AC Yes/No Automatic AC Yes/no Model thinks this is 2 independent variables Use outside knowledge: Convert this to 1 variable with 3 levels 55 56

SESSION LEARNING GOALS Demonstrate the analytic flow for analyzing big data. Begin to practice it. BDA as an art, as well as a science. Introduce decision tree models = CART = very different than classical econometric regression models Key concepts of BDA Holdout sample Transform the data to physically meaningful, or useful, or both Many others KEY CONCEPTS Decision trees Key concept #1: Use a holdout sample to evaluate performance. Train/validate/test Confusion matrix for classification problems (discrete results) Key concept #2: Overfitting. Models always overfit. Holdout sample tells how badly Concept: tuning a model for better results. Key concept #0: Variables have physical/economic/business meanings Key concept #3: transforming the data for better fits and better insight/ understanding of result. aka Feature creation Concept: cleaning data to get rid of data errors. Key concept # 4: Nonlinearity. Key concept #5: Knowing causality is wonderful, but for many purposes not necessary. The data mining process flow.