Decision Boundary. Hemant Ishwaran and J. Sunil Rao

Similar documents
Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Impact of Cluster Validity Measures on Performance of Hybrid Models Based on K-means and Decision Trees

Rule Learning With Negation: Issues Regarding Effectiveness

Rule Learning with Negation: Issues Regarding Effectiveness

CS Machine Learning

Python Machine Learning

Grade 2: Using a Number Line to Order and Compare Numbers Place Value Horizontal Content Strand

Learning From the Past with Experiment Databases

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

Lecture 1: Machine Learning Basics

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Learning Methods in Multilingual Speech Recognition

Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Longitudinal Analysis of the Effectiveness of DCPS Teachers

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Assignment 1: Predicting Amazon Review Ratings

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

A Decision Tree Analysis of the Transfer Student Emma Gunu, MS Research Analyst Robert M Roe, PhD Executive Director of Institutional Research and

Linking Task: Identifying authors and book titles in verbose queries

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Artificial Neural Networks written examination

Switchboard Language Model Improvement with Conversational Data from Gigaword

Mathematics Success Grade 7

An Introduction to Simio for Beginners

Corrective Feedback and Persistent Learning for Information Extraction

Predicting Students Performance with SimStudent: Learning Cognitive Skills from Observation

On-Line Data Analytics

Digital Fabrication and Aunt Sarah: Enabling Quadratic Explorations via Technology. Michael L. Connell University of Houston - Downtown

A Game-based Assessment of Children s Choices to Seek Feedback and to Revise

(Sub)Gradient Descent

Interpreting ACER Test Results

Introduction to Causal Inference. Problem Set 1. Required Problems

Mathematics Scoring Guide for Sample Test 2005

STT 231 Test 1. Fill in the Letter of Your Choice to Each Question in the Scantron. Each question is worth 2 point.

Are You Ready? Simplify Fractions

Session 2B From understanding perspectives to informing public policy the potential and challenges for Q findings to inform survey design

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

How to Judge the Quality of an Objective Classroom Test

A Case Study: News Classification Based on Term Frequency

Dublin City Schools Mathematics Graded Course of Study GRADE 4

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Page 1 of 11. Curriculum Map: Grade 4 Math Course: Math 4 Sub-topic: General. Grade(s): None specified

Degree Qualification Profiles Intellectual Skills

Algebra 2- Semester 2 Review

Chapter 2 Rule Learning in a Nutshell

Reducing Features to Improve Bug Prediction

Universidade do Minho Escola de Engenharia

Assessing Functional Relations: The Utility of the Standard Celeration Chart

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

A Reinforcement Learning Variant for Control Scheduling

TIMSS ADVANCED 2015 USER GUIDE FOR THE INTERNATIONAL DATABASE. Pierre Foy

stateorvalue to each variable in a given set. We use p(x = xjy = y) (or p(xjy) as a shorthand) to denote the probability that X = x given Y = y. We al

A Case-Based Approach To Imitation Learning in Robotic Agents

This scope and sequence assumes 160 days for instruction, divided among 15 units.

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Full text of O L O W Science As Inquiry conference. Science as Inquiry

What Different Kinds of Stratification Can Reveal about the Generalizability of Data-Mined Skill Assessment Models

Softprop: Softmax Neural Network Backpropagation Learning

CAAP. Content Analysis Report. Sample College. Institution Code: 9011 Institution Type: 4-Year Subgroup: none Test Date: Spring 2011

Learning By Asking: How Children Ask Questions To Achieve Efficient Search

Seminar - Organic Computing

Build on students informal understanding of sharing and proportionality to develop initial fraction concepts.

Linking the Ohio State Assessments to NWEA MAP Growth Tests *

CLA+ Analytics: Making Data Relevant Through Data Mining in Real Time

Blended E-learning in the Architectural Design Studio

Discriminative Learning of Beam-Search Heuristics for Planning

D Road Maps 6. A Guide to Learning System Dynamics. System Dynamics in Education Project

Chapter 4 - Fractions

An investigation of imitation learning algorithms for structured prediction

Generative models and adversarial training

INPE São José dos Campos

Moderator: Gary Weckman Ohio University USA

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Disambiguation of Thai Personal Name from Online News Articles

Montana Content Standards for Mathematics Grade 3. Montana Content Standards for Mathematical Practices and Mathematics Content Adopted November 2011

While you are waiting... socrative.com, room number SIMLANG2016

South Carolina College- and Career-Ready Standards for Mathematics. Standards Unpacking Documents Grade 5

Backwards Numbers: A Study of Place Value. Catherine Perez

FRAMEWORK FOR IDENTIFYING THE MOST LIKELY SUCCESSFUL UNDERPRIVILEGED TERTIARY STUDY BURSARY APPLICANTS

CS 446: Machine Learning

Assessing System Agreement and Instance Difficulty in the Lexical Sample Tasks of SENSEVAL-2

Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method

Optimizing to Arbitrary NLP Metrics using Ensemble Selection

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

Lecture 2: Quantifiers and Approximation

Creating a Test in Eduphoria! Aware

Grade 6: Correlated to AGS Basic Math Skills

NCEO Technical Report 27

Evaluating and Comparing Classifiers: Review, Some Recommendations and Limitations

THE PENNSYLVANIA STATE UNIVERSITY SCHREYER HONORS COLLEGE DEPARTMENT OF MATHEMATICS ASSESSING THE EFFECTIVENESS OF MULTIPLE CHOICE MATH TESTS

An Empirical Analysis of the Effects of Mexican American Studies Participation on Student Achievement within Tucson Unified School District

ScienceDirect. A Framework for Clustering Cardiac Patient s Records Using Unsupervised Learning Techniques

Field Experience Management 2011 Training Guides

Numeracy Medium term plan: Summer Term Level 2C/2B Year 2 Level 2A/3C

Circuit Simulators: A Revolutionary E-Learning Platform

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Early Warning System Implementation Guide

Transcription:

32 Decision Trees, Advanced Techniques in Constructing define impurity using the log-rank test. As in CART, growing a tree by reducing impurity ensures that terminal nodes are populated by individuals with similar behavior. In the case of a survival tree, terminal nodes are composed of patients with similar survival. The terminal node value in a survival tree is the survival function and is estimated using those patients within the terminal node. This differs from classification and regression trees, where terminal node values are a single value (the estimated class label or predicted value for the response, respectively). Figure 3 shows an example of a survival tree. Hemant Ishwaran and J. Sunil Rao See also Decision Trees, Advanced Techniques in Constructing; Recursive Partitioning Further Readings Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (19). Classification and regression trees. Belmont, CA: Wadsworth. LeBlanc, M., & Crowley, J. (1993). Survival trees by goodness of split. Journal of the American Statistical Association,, 57 7. Segal, M. R. (19). Regression trees for censored data. Biometrics,, 35 7. Stone, M. (197). Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society, Series B, 3, 111 17. Decision Tr e e s, Ad v a n c e d Techniques in Constructing Decision trees such as classification, regression, and survival trees offer the medical decision maker a comprehensive way to calculate predictors and decision rules in a variety of commonly encountered data settings. However, performance of decision trees on external data sets can sometimes be poor. Aggregating decision trees is a simple way to improve performance and in some instances, aggregated tree predictors can exhibit state-of-theart performance. Decision Boundary Decision trees, by their very nature, are simple and intuitive to understand. For example, a binary classification tree assigns data by dropping a data point (case) down the tree and moving either left or right through nodes depending on the value of a given variable. The nature of a binary tree ensures that each case is assigned to a unique terminal node. The value for the terminal node (the predicted outcome) defines how the case is classified. By following the path as a case moves down the tree to its terminal node, the decision rule for that case can be read directly off the tree. Such a rule is simple to understand, as it is nothing more than a sequence of simple rules strung together. The decision boundary, on the other hand, is a more abstract concept. Decision boundaries are estimated by a collection of decision rules for cases taken together or, in the case of decision trees, the boundary produced in the predictor space between classes by the decision tree. Unlike decision rules, decision boundaries are difficult to visualize and interpret for data involving more than one or two variables. However, when the data involve only a few variables, the decision boundary is a powerful way to visualize a classifier and to study its performance. Consider Figure 1. On the left-hand side is the classification tree for a prostate data set. Here, the outcome is presence or absence of prostate cancer and the independent variables are prostate-specific antigen () and tumor volume, both having been transformed on the log scale. Each case in the data is classified uniquely depending on the value of these two variables. For example, the leftmost terminal node in Figure 1 is composed of those patients with tumor volumes less than 7.51 and levels less than 2.59 (on the log scale). Terminal node values are assigned by majority voting (i.e., the predicted outcome is the class label with the largest frequency). For this node, there are 5 nondiseased patients and 1 diseased patients, and thus, the predicted class label is nondiseased. The right-hand side of Figure 1 displays the decision boundary for the tree. The dark-shaded region is the space of all values for and tumor volume that would be classified as nondiseased, whereas the light-shaded regions are those values classified as diseased. Superimposed on the figure,

Decision Trees, Advanced Techniques in Constructing 329 Tumor Vol < 7.51 Tumor Vol > = 7.51 < 2.59 > = 2.59 < 1.775 > = 1.775 > = 1.29 < 1.29 5/1 3/1 < 1.1 > = 1.1 7/9 /1 3/1 0/ Figure 1 Decision tree (left-hand side) and decision boundary (right-hand side) for prostate cancer data with prostate-specific antigen () and tumor volume as independent variables (both transformed on the log scale) Note: Barplots under terminal nodes of the decision tree indicate proportion of cases classified as diseased or nondiseased, with the predicted class label determined by majority voting. Decision boundary shows how the tree classifies a new patient based on and tumor volume. Gray-shaded points identify diseased patients, and white points identify nondiseased patients from the data. using white and light-gray dots, are the observed data points from the original data. Light-gray points are truly diseased patients, whereas white points are truly nondiseased patients. Most of the light-gray points fall in the light-shaded region of the decision space and, likewise, most of the white points fall in the dark-shaded region of the decision space, thus showing that the classifier is classifying a large fraction of the data correctly. Some data points are misclassified, though. For example, there are several light-gray points in the center of the plot falling in the dark-shaded region. As well, there are four light-gray points with small tumor volumes and values falling in the dark-shaded region. The misclassified data points in the center of the decision space are especially troublesome. These points are being misclassified because the decision space for the tree is rectangular. If the decision boundary were smoother, then these points would not be misclassified. The nonsmooth nature of the decision boundary is a well-known deficiency of classification trees and can seriously degrade performance, especially in complex decision problems involving many variables. Instability of Decision Trees Decision trees, such as classification trees, are known to be unstable. That is, if the original data set is changed (perturbed) in some way, then the classifier constructed from the altered data can be surprisingly different from the original classifier. This is an undesirable property, especially if small perturbations to the data lead to substantial differences. This property can be demonstrated using the prostate data set of Figure 1. However, to show this, it is important to first agree on a method for perturbing the data. One technique that can be used is to employ bootstrap resampling. A bootstrap sample is a special type of resampling

330 Decision Trees, Advanced Techniques in Constructing procedure. A data point is randomly selected from the data and then returned. This process is repeated n times, where n is the sample size. The resulting bootstrap sample consists of n data points but will contain replicated data. On average, a bootstrap sample draws only approximately 3% of the original data. A total of 1,000 different bootstrap samples of the prostate data were drawn. A classification tree was calculated for each of these 1,000 samples. The top panel of plots in Figure 2 shows decision boundaries for four of these trees (bootstrap samples 2, 5, 25, and 1,000; note that Tree 1 is the classification tree from Figure 1 based on the original data). One can see clearly that the decision spaces differ quite substantially thus providing clear evidence of the instability. It is also interesting to note how some of the trees have better decision spaces than the original tree (recall Figure 1; also see Tree 1 in Figure 2). For example, Trees 2, 5, 25, and 1,000 identify some or all of the four problematic light-gray points appearing within the lower quadrant of the dark-shaded region of the original decision space. As well, Trees 5, 25, and, 1,000 identify some of the problematic green points appearing within the center of the original decision space. An important lesson that emerges from this example is not only that decision trees can be unstable but also that trees constructed from different perturbations of the original data can produce decision boundaries that in some instances have better behavior than the original decision space (over certain regions). Thus, it stands to reason that, if one could combine many such trees, the classifier formed by aggregating the trees might have better overall performance. In other words, the whole may be greater than the sum of the parts and one may be able to capitalize on the inherent instability using aggregation to produce more accurate classifiers. Bagging This idea in fact is the basis for a powerful method referred to as bootstrap aggregation, or simply bagging. Bagging can be used for many kinds of predictors, not just decision trees. The basic premise for bagging is that, if the underlying predictor is unstable, then aggregating the predictor over multiple bootstrap samples will produce a more accurate, and more stable, procedure. To bag a classification tree, the procedure is as follows (bagging can be applied to regression trees and survival trees in a similar fashion): 1. Draw a bootstrap sample of the original data. 2. Construct a classification tree using data from Step 1. 3. Repeat Steps 1 and 2 many times, independently.. Calculate an aggregated classifier using the trees formed in Steps 1 to 3. Use majority voting to 1 to 3. The bottom panel of plots in Figure 2 shows the decision boundary for the bagged classifier as a function of number of trees (based on the same prostate data as before). The first plot is the original classifier based on all the data (Tree 1). The second plot is the bagged classifier composed of Tree 1 and the bootstrap tree derived using the first bootstrap sample. The third plot is the bagged classifier using Tree 1 and the first four bootstrapped trees, and so forth. As number of trees increases, the bagged classifier becomes more refined. Even the decision boundary for the bagged classifier using only five trees (third plot) is substantially smoother than the original classifier and is able to better classify problematic cases. By 1,000 trees (last plot), the bagged classifier s decision boundary is fully defined. The accuracy of the bagged classifier is substantially better than any single bootstrapped tree. Table 1 records the misclassification (error) rate for the bagged predictor against the averaged error rate for the 1,000 bootstrapped trees. The first column is the overall error rate, the second column is the error rate for diseased patients, and the third column is the error rate for nondiseased patients. Error rates were calculated using out-of-bag data. Recall that each bootstrap sample uses on average 7% of the original data. The remaining 33% of the data is called out-of-bag and serves as test data, as it is not used in constructing the tree. Table 1 shows that

Decision Trees, Advanced Techniques in Constructing 331 tree 2 tree 5 tree 25 tree 00 bagged tree (1 to 2) bagged tree (1 to 5) bagged tree (1 to 25) bagged tree (1 to 00) Figure 2 Top row shows decision boundary for a specific bootstrapped tree (1,000 trees used in total), and the bottom plot shows different aggregated (bagged) decision trees Note: Bagged trees are more robust to noise (stable) because they utilize information from more than one tree. The most stable bagged tree is the one on the extreme right-hand side and shows decision boundary using 1,000 trees. the bagged classifier is substantially more accurate than any given tree. Random Forests Random forests is a refinement of bagging that can yield even more accurate predictors. The method works like bagging by using bootstrapping and aggregation but includes an additional step that is designed to encourage independence of trees. This effect is often most pronounced when the data contain many variables. To create a random forest classifier, the procedure is as follows (regression forests and random survival forests can be constructed using the same principle): 1. Draw a bootstrap sample of the original data. 2. Construct a classification tree using data from Step 1. For each node in the tree, determine the optimal split for the node using M randomly selected dependent variables. 3. Repeat Steps 1 and 2 many times, independently.. Calculate an aggregated classifier using the trees formed in Steps 1 to 3. Use majority voting to 1 to 3. Step 2 is the crucial step distinguishing forests from bagging. Unlike bagging, each bootstrapped tree is constructed using different variables, and not all variables are used (at most M are used at each node in the tree growing process). Considerable empirical evidence has shown that forests can be substantially more accurate because of this feature. Boosting Boosting is another related technique that has some similarities to bagging although its connection is not as direct. It too can produce accurate Table 1 Misclassification error rate (in percentage) for bagged classifier (1,000 trees) and single tree classifier Classifier All Bagged tree 27.2 2. 25.9 Single tree 3.9 3.7 33.0

332 Decision Trees, Construction classifiers through a combination of reweighting and aggregation. To create a boosted tree classifier, the following procedure can be used (although other methods are also available in the literature): 1. Draw a bootstrap sample from the original data giving each observation equal chance (i.e., weight) of appearing in the sample. 2. Build a classification tree using the bootstrap data and classify each of the observations, keeping track of which ones are classified incorrectly or correctly. 3. For those observations that were incorrectly classified, increase their weight and correspondingly decrease the weight assigned to observations that were correctly classified.. Draw another bootstrap sample using the newly updated observation weights (i.e., those observations that were previously incorrectly classified will have a greater chance of appearing in the next bootstrap sample). 5. Repeat Steps 2 to many times.. Calculate an aggregated classifier using the trees formed in Steps 1 to 5. Use majority voting to 1 to 5. The idea of reweighting observations adaptively is a key to boosting s performance gains. In a sense, the algorithm tends to focus more and more on observations that are difficult to classify. There has been much work in the literature on studying the operating characteristics of boosting, primarily motivated by the fact that the approach can produce significant gains in prediction accuracy over a single tree classifier. Again, as with bagging, boosting is a general algorithm that can be applied to more than tree-based classifiers. While these aggregation algorithms were initially thought to destroy the simple interpretable structure (topology) produced by a single tree classifier, recent work has shown that, in fact, treelike structures (with respect to the decision boundary) are often maintained, and interpretable structure about how the predictors interact with one another can still be gleaned. Hemant Ishwaran and J. Sunil Rao See also Decision Tree: Introduction; Recursive Partitioning Further Readings Breiman, L. (199). Bagging predictors. Machine Learning, 2, 123. Breiman, L. (2001). Random forests. Machine Learning, 5, 5 32. Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (19). Classification and regression trees. Belmont, CA: Wadsworth. Efron, B. (192). The jackknife, the bootstrap and other resampling plans (Society for Industrial and Applied Mathematics CBMS-NSF Monographs, No. 3). Philadelphia: SIAM. Freund, Y., & Shapire, R. E. (199). Experiments with a new boosting algorithm. In Machine Learning: Proceedings of the 13th International Conference (pp. 1 15). San Francisco: Morgan Kaufman. Ishwaran, H., Kogalur, U. B., Blackstone, E. H., & Lauer, M. S. (200). Random survival forests. Annals of Applied Statistics, 2(3), 1 0. Rao, J. S., & Potts, W. J. E. (1997). Visualizing bagged decision trees. In Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining (pp. 23 2). Newport Beach, CA: AAAI Press. Decision Trees, Construction A decision model is a mathematical formulation of a decision problem that compares alternative choices in a formal process by calculating their expected outcome. The decision tree is a graphical representation of a decision model that represents the basic elements of the model. The key elements of the model are the possible choices, information about chance events, and preferences of the decision maker. The choices are the alternatives being compared in the decision model. The information consists of an enumeration of the events that may occur consequent to the choice and the probabilities of each of their outcomes. Preferences are