Overview of TreeNet Technology Stochastic Gradient Boosting

Similar documents
Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Python Machine Learning

Learning From the Past with Experiment Databases

Lecture 1: Machine Learning Basics

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Impact of Cluster Validity Measures on Performance of Hybrid Models Based on K-means and Decision Trees

Analysis of Enzyme Kinetic Data

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

Reducing Features to Improve Bug Prediction

STT 231 Test 1. Fill in the Letter of Your Choice to Each Question in the Scantron. Each question is worth 2 point.

CS Machine Learning

CHAPTER 4: REIMBURSEMENT STRATEGIES 24

Rule Learning With Negation: Issues Regarding Effectiveness

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

Assignment 1: Predicting Amazon Review Ratings

Predicting Students Performance with SimStudent: Learning Cognitive Skills from Observation

A Decision Tree Analysis of the Transfer Student Emma Gunu, MS Research Analyst Robert M Roe, PhD Executive Director of Institutional Research and

Applications of data mining algorithms to analysis of medical data

Learning Methods in Multilingual Speech Recognition

Firms and Markets Saturdays Summer I 2014

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Rule Learning with Negation: Issues Regarding Effectiveness

On-Line Data Analytics

Multi-Lingual Text Leveling

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Why Did My Detector Do That?!

Probability and Statistics Curriculum Pacing Guide

Radius STEM Readiness TM

Lecture 2: Quantifiers and Approximation

A Case Study: News Classification Based on Term Frequency

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Visit us at:

A Game-based Assessment of Children s Choices to Seek Feedback and to Revise

An Empirical Analysis of the Effects of Mexican American Studies Participation on Student Achievement within Tucson Unified School District

Model Ensemble for Click Prediction in Bing Search Ads

Calibration of Confidence Measures in Speech Recognition

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

On-the-Fly Customization of Automated Essay Scoring

Chapters 1-5 Cumulative Assessment AP Statistics November 2008 Gillespie, Block 4

The Good Judgment Project: A large scale test of different methods of combining expert predictions

SETTING STANDARDS FOR CRITERION- REFERENCED MEASUREMENT

Educating for innovationdriven

A Reinforcement Learning Variant for Control Scheduling

Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method

Statewide Framework Document for:

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

TIMSS ADVANCED 2015 USER GUIDE FOR THE INTERNATIONAL DATABASE. Pierre Foy

NCEO Technical Report 27

Purdue Data Summit Communication of Big Data Analytics. New SAT Predictive Validity Case Study

Artificial Neural Networks written examination

Evaluation of Teach For America:

learning collegiate assessment]

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Early Warning System Implementation Guide

Detailed course syllabus

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Introduction to Simulation

Experiment Databases: Towards an Improved Experimental Methodology in Machine Learning

Designing a Rubric to Assess the Modelling Phase of Student Design Projects in Upper Year Engineering Courses

POLA: a student modeling framework for Probabilistic On-Line Assessment of problem solving performance

Australian Journal of Basic and Applied Sciences

University of Cincinnati College of Medicine. DECISION ANALYSIS AND COST-EFFECTIVENESS BE-7068C: Spring 2016

Focus of the Unit: Much of this unit focuses on extending previous skills of multiplication and division to multi-digit whole numbers.

What s in a Step? Toward General, Abstract Representations of Tutoring System Log Data

A Model to Predict 24-Hour Urinary Creatinine Level Using Repeated Measurements

Detecting English-French Cognates Using Orthographic Edit Distance

Mandarin Lexical Tone Recognition: The Gating Paradigm

Multivariate k-nearest Neighbor Regression for Time Series data -

Linking the Ohio State Assessments to NWEA MAP Growth Tests *

stateorvalue to each variable in a given set. We use p(x = xjy = y) (or p(xjy) as a shorthand) to denote the probability that X = x given Y = y. We al

Hierarchical Linear Modeling with Maximum Likelihood, Restricted Maximum Likelihood, and Fully Bayesian Estimation

Entrepreneurial Discovery and the Demmert/Klein Experiment: Additional Evidence from Germany

Software Maintenance

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Truth Inference in Crowdsourcing: Is the Problem Solved?

Decision Analysis. Decision-Making Problem. Decision Analysis. Part 1 Decision Analysis and Decision Tables. Decision Analysis, Part 1

What is related to student retention in STEM for STEM majors? Abstract:

How to Judge the Quality of an Objective Classroom Test

Probability estimates in a scenario tree

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

Practice Examination IREB

Multi-label classification via multi-target regression on data streams

content First Introductory book to cover CAPM First to differentiate expected and required returns First to discuss the intrinsic value of stocks

Softprop: Softmax Neural Network Backpropagation Learning

What Different Kinds of Stratification Can Reveal about the Generalizability of Data-Mined Skill Assessment Models

Mathematics process categories

Getting Started with TI-Nspire High School Science

WHEN THERE IS A mismatch between the acoustic

The Boosting Approach to Machine Learning An Overview

Word Segmentation of Off-line Handwritten Documents

DEPARTMENT OF FINANCE AND ECONOMICS

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

On the Combined Behavior of Autonomous Resource Management Agents

*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN

Race, Class, and the Selective College Experience

Cross Language Information Retrieval

Transcription:

Overview of TreeNet Technology Stochastic Gradient Boosting Dan Steinberg January 2009

Introduction to TreeNet: Stochastic Gradient Boosting Powerful new approach to machine learning and function approximation developed by Jerome H. Friedman at Stanford University Co-author of CART with Breiman, Olshen and Stone Author of MARS, PRIM, Projection Pursuit, COSA, RuleFit and more Very strong for classification and regression problems Builds on the notions of committees of experts and boosting but is substantially different in key implementation details 2

Aspects of TreeNet Built on CART trees and thus immune to outliers handles missing values automatically results invariant under order preserving transformations of variables No need to ever consider functional form revision (log, sqrt, power) Highly discriminatory variable selector Effective with thousands of predictors Detects and identifies important interactions Can be used to easily test for presence of interactions and their degree Resistant to overtraining generalizes well Can be remarkably accurate with little effort Should easily outperform conventional models 3

Adapting to Major Errors in Data TreeNet is a machine learning technology designed to recognize patterns in historical data Ideally the data TreeNet will use to learn from will be accurate In some circumstances there is a risk that the most important variable, namely the dependent variable is subject to error. Known as mislabelled data Good examples of mislabeled data can be found in Medical diagnoses Insurance claim fraud Thus historical data for not frauds actually includes undetected fraud Some of the 0 s are actually 1 s which complicates learning (possibly fatal) TreeNet manages such data successfully 4

Some TreeNet Successes 2008 DMA Targeted Marketing. First runner Up 2007 DMA Targeted Marketing. 1 st Place Winner 2006 PAKDD competition: customer type discrimination 3 rd place Model built in one day. 1 st place accuracy 81.9% TreeNet Accuracy 81.2% 2005 BI-CUP University of Chile attracted 60 competitors: 1 st Place 2004 KDDCup Most Accurate (Classification Accuracy) 2003 Duke University/NCR Teradata CRM modeling competition Most Accurate and Best Top Decile Lift on both in and out of time samples A major financial services company has tested TreeNet across a broad range of targeted marketing and risk models for the past 2 years TreeNet consistently outperforms previous best models (around 10% AUROC) TreeNet models can be built in a fraction of the time previously devoted TreeNet reveals previously undetected predictive power in data 5

Multi-tree methods and their single tree ancestors Multi-tree methods have been under development since the early 1990s. Most important variants (and dates of published articles) are: Bagger (Breiman, 1996, Bootstrap Aggregation ) Boosting (Freund and Schapire, 1995) Multiple Additive Regression Trees (Friedman, 1999, aka MART or TreeNet ) RandomForests (Breiman, 2001) Work continues with major refinements underway (Friedman in collaboration with Salford Systems) 6

Multi-tree Methods: Simplest Case Simplest example: Grow a tree on training data Find a way to grow another different tree (change something in set up) Repeat many times, eg 500 replications Average results or create voting scheme. Eg. relate PD to fraction of trees predicting default for a given case Beauty of the method is that every new tree starts with a complete set of data. Any one tree can run out of data, but when that happens we just start again with a new tree and all the data (before sampling) Prediction Via Voting 7

Automated Multiple Tree Generation Earliest multi-model methods recommended taking several good candidates and averaging them. Examples considered as few as 3 trees. Too difficult to generate multiple models manually. Hard enough to get one good model. How do we generate different trees? Bagger: random re-weighting of the data via bootstrap resampling Reweight at random and regrow. Every repetition independent of others RandomForests: Random splits. Tree itself is grown at least partly at random Boosting: Reweighting data based on prior success in correctly classifiying a case. High weights on difficult to classifiy cases TreeNet: Boosting with major refinements. Each tree attempts to correct errors made by predecessors Each tree is linked to predecessors. Like a series expansion where the addition of terms progressively improves the predictions 8

TreeNet (aka MART) We focus on TreeNet because It is the method used in many successful real world studies We have found it to be more accurate than the other methods Many people affected by a TreeNet model these days Major new fraud detection engine uses TreeNet David Cossock of Yahoo has recently published a paper on uses of TreeNet in web search Dramatic new capabilities include: Graphical display of the impact of any predictor New automated ways to test for existence of interactions New ways to identify and rank interactions Ability to constrain model: allow some interactions and disallow others Method to recast Treenet model as a logistic regression (TreeNet 3.0) 9

TreeNet Process Begin with one very small tree as initial model Could be as small as ONE split generating 2 terminal nodes Typical model will have 3-5 splits in a tree, generating 4-6 terminal nodes Output is a probability (eg of default) Model is intentionally weak Compute residuals for this simple model (prediction error) for every record in data (even for classification model) Grow second small tree to predict the residuals from first tree New model is now: Tree 1 + Tree 2 Compute residuals from this new 2-tree model and grow 3 rd tree to predict revised residuals 10

TreeNet: Trees incrementally revise predicted scores Tree 1 Tree 2 Tree 3 + + First tree grown on original target. Intentionally weak model 2 nd tree grown on residuals from first. Predictions made to improve first tree 3rd tree grown on residuals from model consisting of first two trees Every tree produces at least one positive and at least one negative node. Red reflects a relatively large positive and deep blue reflects a relatively negative node. Total score is obtained by finding relevant terminal node in every tree in model and summing across all trees 11

TreeNet: Sample Individual Trees Intercept 0.172 YES - 0.228 EQ2CUST_STF < 5.04 COST2INC < 65.4-0.051 Tree 1 NO Tree 1 + 0.068 + 0.065 TOTAL_DEPS < 65.5M - 0.213 Tree 2 EQ2TOT_AST < 2.8 Tree 2-0.010-0.088 EQ2CUST_STF < 5.04 + 0.140 Tree 3 Tree 3 EQ2TOT_AST < 2.8-0.007 Predicted Other Trees Response 12

TreeNet Methdology: Key points Trees are kept small Updates are small (downweighted). Like a partial adjustment model. Update factors can be as small as.01,.001,.0001. This means that the model prediction changes by very small amounts in each training cycle Use random subsets of the training data in each cycle. Never train on all the training data in any one cycle Highly problematic cases are IGNORED. If model prediction starts to diverge substantially from observed data, that data will not be used in further updates Cross-validation used for self-test in small data sets Model can be tuned to optimize Area under the ROC curve Logistic likelihood (deviance) Classification Accuracy Lift achieved in a specified percentile of the predicted-probability ranked data 13

Why does TreeNet work? Slow learning: the method peels the onion extracting very small amounts of information in any one learning cycle TreeNet can leverage hints available in the data across a number of predictors If feasible TreeNet can successfully include more variables than traditional models Can capture substantial nonlinearity and complex interactions of high degree TreeNet self-protects against errors in the dependent variable (vital for fraud studies). If a record is actually a 1 but is misrecorded in the data as a 0 and TreeNet recognizes it as a 1 it will not attempt to get this record correct. 14

Multiple Additive Regression Trees Friedman originally named his methodology MART as the method generates small trees which are summed to obtain an overall score The model can be thought of as a series expansion approximating the true functional relationship We can think of each small tree as a mini-scorecard, making use of possibly different combinations of variables. Each mini-scorecard is designed to offer a slight improvement by correcting and refining its predecessors Because each tree starts at the root node and can use all of the available data a TreeNet model can never run out of data no matter how many trees are built. We have TreeNet consumer default models in production consisting of 2,000-3,000 trees 15

Selecting Optimal Model TreeNet first grows a large number of trees We evaluate performance of the ensemble at every set of sequential trees starting with the first Start with 1 tree. Then go on to 2 trees (1 st + 2 nd ). Then 3 trees. (1 st + 2 nd + 3 rd )Etc Criteria used right now are Classification Accuracy Log-Likelihood ROC (area under curve) Lift in top P percentile (often top decile) 16

TreeNet Summary Screen Displaying ROC on train and test samples at all ensemble sizes CXE Class Error ROC Area Lift ---------------------------------------------------------------------------- Optimal Number of Trees: 739 1069 643 163 Optimal Criterion 0.4224394 0.1803279 0.8862029 1.9626555 17

How the different criteria select different response profiles Artificial data: Red curve is truth CXE Logistic Best Accuracy 18

Interpreting TN Models As TN models consist of hundreds or even thousands of trees there is no useful way to represent the model via a display of one or two trees However, the model can be summarized in a variety of ways Partial dependency plots: These exhibit the relationship between the target and any predictor as captured by the model. Variable Importance Rankings: These stable rankings give an excellent assessment of the relative importance of predictors ROC curves: TN models produce scores that are typically unique for each scored record allowing records to be ranked from best to worst. The ROC curve and area under the curve reveal how successful the ranking is. Confusion Matrix: Using an adjustable score threshold this matrix displays the model false positive and false negative rates. 19

TN Summary: Variable Importance Ranking Based on actual use of variables in the trees on training data (Sum improvements) 20

TN Classification Accuracy: Test Data Threshold can be adjusted to reflect unbalanced classes, rare events 21

TreeNet: Partial Dependency Plot Y-axis: Log-odds 22

Place Knot Locations on Graph for Smooth 23

Smooth Depicted in Green 24

Generate SAS Code for Smooth Can also obtain Java C PMML Other languages coming: SQL Visual Basic 25

Dealing with Monotonicity Undesired response profile 26

Impose Constraint Constrained smooth generated 27

Interaction Detection TreeNet models based on 2-node trees automatically EXCLUDE interactions Model may be highly nonlinear but is by definition strictly additive Every term in the model is based on a single variable (single split) Use this as a baseline model for best possible additive model (automated GAM) Build TreeNet on larger tree (default is 6 nodes) Permits up to 5-way interaction but in practice is more like 3-way interaction Can conduct informal likelihood ratio test TN(2-node) vs TN(6-node) Large differences signal important interactions In TreeNet 2.0 can locate interactions via 3-D 2-variable dependency plots In TreeNet 3.0 variables participating in interactions are ranked using new methodology developed by Friedman and Salford Systems 28

Interactions Ranking Report Variables ranked according to degree of interaction: measured by how much model would be hurt if interaction is suppressed Interactions involving NMOS are most important 29

What interacts with NMOS? NMOS PAY_FREQUENCY_2$ Here we see that NMOS primarily interacts with PAY_FREQUENCY_2$ All other interactions with NMOS have much smaller impacts 30

Example of an important interaction Slope reverses due to interaction Note that the dominant pattern is downward sloping But a key segment defined by the 3 rd variable is upeward sloping 31

Examples Corporate default Analysis of bank ratings 32

Corporate Default Scorecard Mid-sized bank Around 200 defaults Around 200 indeterminate/slow Classic small sample problem Standard financial statements available, variety of ratios created All key variables had some fraction of missing data, from 2% to 6% in a set of 10 core predictors, and up to 35% missing in a set of 20 core predictors Single CART tree involves just 6 predictors yields cross-validated ROC= 0.6523 33

TreeNet Analysis of Corporate Default Data: Summary Display Summary reports progress of model as it evolves with an increasing number of trees. Markers indicate models optimizing entropy, misclassification rate, ROC 34

Corporate Default Model: TreeNet 1789 Trees in model with best test sample area under ROC curve CXE Class Error ROC Area Lift -------------------------------------------------------------------- Optimal N Trees: 2000 1930 1789 1911 Optimal Criterion 0.534914 0.3485293 0.7164921 1.9231729 TreeNet Cross-Validated ROC=0.71649 far better than single tree Able to make use of more variables due to the many trees Single CART tree uses 6 predictors, TreeNet uses more than 20 Some of these variables were missing in 35% or more of all accounts Can extract graphical displays to describe model 35

Predictive Variable Financial Ratio 1 Risk or Prob of Default 0.003 Ratio 1 Sales / Current Assets 0.002 0.001 0.000 10 20 30 40 50 60 Ratio Value -0.001-0.002-0.003-0.004 TreeNet can trace out impact of predictor on PD 36

Predictive Variable Financial Ratio 2 Risk or Prob of Default Return on Assets (ROA) Operating Ratio Profit / Total 2 Asset 0.0006 0.0004 0.0002 Ratio Value -0.9-0.8-0.7-0.6-0.5-0.4-0.3-0.2-0.1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0-0.0002-0.0004 37

Additive vs Interaction Model (2-node vs 6-node trees) 2 node trees model CXE Class Error ROC Area Lift -------------------------------------------------------------------- Optimal N Trees: 2000 1812 1978 1763 Optimal Criterion 0.54538 0.39331 0.70749 1.84881 6 node trees model Optimal N Trees: 2000 1930 1789 1911 Optimal Criterion 0.53491 0.34852 0.71649 1.92317 Two node trees do not permit interactions of any kind as each term in the model involves a single predictor in isolation.2-node tree models are highly nonlinear but strictly additive Three- or more node trees allow progressively higher order interactions. Running model with different tree sizes allows simple discovery of the precise amount of interaction required for maximum performance models 38

Bank Ratings: Regression Example Build a predictive model for average of major bank ratings Scaled average of S&P. Moody s, Fitch Ratings Challenges include Small data base (66 banks) Missing values prevalent (up to 35 missing in any predictor) Impossible to build linear regression model because of missings Expect relationships to be nonlinear 66 banks 25 potential predictors 39

Cross-Validated Performance: Predicting Rating Score 1.4 1.3 1.2 1.1 1 0.9 0.8 0.7 0.6 1 60 119 178 237 296 355 414 473 532 591 650 709 768 827 886 945 CV Mean Absolute Error Model Size The optimal model achieved around 860 trees with cross-validated mean absolute error 0.87, target variable ranges from 1 to 10 40

Variable Importance Ranking Ranks variables according to their contributions to the overall variation in the target variable 41

Country Contribution to Risk Score Distribution by Country AT BE CA CH DE DK ES FR GB GR IE IT JP LU NL PT SE US 0 2 4 6 8 Swiss banks tend be rated low risk 42

Bank Specialization Distribution Specialised Governmental Credit Inst. Savings Bank Real Estate / Mortgage Bank Non-banking Credit Institution Investment Bank/Securities House Cooperative Bank Commercial Bank 0 10 20 30 40 50 Commercial banks and investment banks are rated higher risk 43

Scale: Total Deposits Histogram 0 5 10 15 20 25 30 35 0 e+00 1 e+08 2 e+08 3 e+08 4 e+08 5 e+08 6 e+08 A step function in size of bank 44

ROAE Histogram 0 10 20 30 40-20 -10 0 10 20 30 40 45

Cost to Income Ratio: Impact on Risk Score Histogram 0 5 10 15 20 20 40 60 80 100 High cost to income increases risk score 46

Equity to Total Assets Histogram 0 5 10 15 20 0 5 10 15 Greater risk forecasted when equity is a large share of total assets 47

References Breiman, L. (1996). Bagging predictors. Machine Learning, 24, 123-140. Freund, Y. & Schapire, R. E. (1996). Experiments with a new boosting algorithm. In L. Saitta, ed., Machine Learning: Proceedings of the Thirteenth National Conference, Morgan Kaufmann, pp. 148-156. Friedman, J.H. (1999). Stochastic gradient boosting. Stanford: Statistics Department, Stanford University. Friedman, J.H. (1999). Greedy function approximation: a gradient boosting machine. Stanford: Statistics Department, Stanford University. Hastie, T., Tibshirani, R., and Friedman, J.H (2000). The Elements of Statistical Learning. Springer. 48