Decision Tree. Machine Learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Similar documents
Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

CS Machine Learning

Lecture 1: Machine Learning Basics

(Sub)Gradient Descent

Learning Methods in Multilingual Speech Recognition

Impact of Cluster Validity Measures on Performance of Hybrid Models Based on K-means and Decision Trees

Rule Learning With Negation: Issues Regarding Effectiveness

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

A Decision Tree Analysis of the Transfer Student Emma Gunu, MS Research Analyst Robert M Roe, PhD Executive Director of Institutional Research and

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

A Version Space Approach to Learning Context-free Grammars

Grade 6: Correlated to AGS Basic Math Skills

Data Stream Processing and Analytics

Lecture 1: Basic Concepts of Machine Learning

Chapter 2 Rule Learning in a Nutshell

Learning goal-oriented strategies in problem solving

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Rule Learning with Negation: Issues Regarding Effectiveness

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

A Reinforcement Learning Variant for Control Scheduling

Assignment 1: Predicting Amazon Review Ratings

STA 225: Introductory Statistics (CT)

Proof Theory for Syntacticians

Python Machine Learning

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

CS 101 Computer Science I Fall Instructor Muller. Syllabus

GACE Computer Science Assessment Test at a Glance

Probability and Statistics Curriculum Pacing Guide

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

CS 1103 Computer Science I Honors. Fall Instructor Muller. Syllabus

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

CSL465/603 - Machine Learning

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Multi-label Classification via Multi-target Regression on Data Streams

Constructive Induction-based Learning Agents: An Architecture and Preliminary Experiments

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

On-Line Data Analytics

Calibration of Confidence Measures in Speech Recognition

Applications of data mining algorithms to analysis of medical data

Softprop: Softmax Neural Network Backpropagation Learning

RANKING AND UNRANKING LEFT SZILARD LANGUAGES. Erkki Mäkinen DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF TAMPERE REPORT A ER E P S I M S

Edexcel GCSE. Statistics 1389 Paper 1H. June Mark Scheme. Statistics Edexcel GCSE

Multi-label classification via multi-target regression on data streams

Cooperative evolutive concept learning: an empirical study

Semi-Supervised Face Detection

Active Learning. Yingyu Liang Computer Sciences 760 Fall

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Reducing Features to Improve Bug Prediction

Using focal point learning to improve human machine tacit coordination

A. What is research? B. Types of research

Learning Methods for Fuzzy Systems

Numeracy Medium term plan: Summer Term Level 2C/2B Year 2 Level 2A/3C

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Lab 1 - The Scientific Method

Axiom 2013 Team Description Paper

Switchboard Language Model Improvement with Conversational Data from Gigaword

Using dialogue context to improve parsing performance in dialogue systems

Linking Task: Identifying authors and book titles in verbose queries

INPE São José dos Campos

NORTH CAROLINA VIRTUAL PUBLIC SCHOOL IN WCPSS UPDATE FOR FALL 2007, SPRING 2008, AND SUMMER 2008

Multi-Lingual Text Leveling

AP Calculus AB. Nevada Academic Standards that are assessable at the local level only.

Radius STEM Readiness TM

Defragmenting Textual Data by Leveraging the Syntactic Structure of the English Language

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

Measurement. When Smaller Is Better. Activity:

Version Space. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Version Space Term 2012/ / 18

Introduction to Questionnaire Design

Improving Simple Bayes. Abstract. The simple Bayesian classier (SBC), sometimes called

COMPUTER-ASSISTED INDEPENDENT STUDY IN MULTIVARIATE CALCULUS

Model Ensemble for Click Prediction in Bing Search Ads

A Neural Network GUI Tested on Text-To-Phoneme Mapping

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Predicting Students Performance with SimStudent: Learning Cognitive Skills from Observation

Sociology 521: Social Statistics and Quantitative Methods I Spring Wed. 2 5, Kap 305 Computer Lab. Course Website

Learning From the Past with Experiment Databases

Math 96: Intermediate Algebra in Context

SARDNET: A Self-Organizing Feature Map for Sequences

Visit us at:

Artificial Neural Networks written examination

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

Getting Started with TI-Nspire High School Science

STT 231 Test 1. Fill in the Letter of Your Choice to Each Question in the Scantron. Each question is worth 2 point.

Alignment of Australian Curriculum Year Levels to the Scope and Sequence of Math-U-See Program

Multivariate k-nearest Neighbor Regression for Time Series data -

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

12- A whirlwind tour of statistics

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Detecting English-French Cognates Using Orthographic Edit Distance

Software Maintenance

Mining Association Rules in Student s Assessment Data

TOPICS LEARNING OUTCOMES ACTIVITES ASSESSMENT Numbers and the number system

FRAMEWORK FOR IDENTIFYING THE MOST LIKELY SUCCESSFUL UNDERPRIVILEGED TERTIARY STUDY BURSARY APPLICANTS

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Evaluating and Comparing Classifiers: Review, Some Recommendations and Limitations

A Bootstrapping Model of Frequency and Context Effects in Word Learning

Transcription:

Decision Tree Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 1 / 24

Table of contents 1 Introduction 2 Decision tree classification 3 Building decision trees 4 ID3 Algorithm Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 2 / 24

Introduction 1 The decision tree is a classic and natural model of learning. Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 3 / 24

Introduction 1 The decision tree is a classic and natural model of learning. 2 It is closely related to the notion of divide and conquer. Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 3 / 24

Introduction 1 The decision tree is a classic and natural model of learning. 2 It is closely related to the notion of divide and conquer. 3 A decision tree partitions the instance space into axis-parallel regions, labeled with class value Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 3 / 24

Introduction 1 The decision tree is a classic and natural model of learning. 2 It is closely related to the notion of divide and conquer. 3 A decision tree partitions the instance space into axis-parallel regions, labeled with class value 4 Why decsion trees? Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 3 / 24

Introduction 1 The decision tree is a classic and natural model of learning. 2 It is closely related to the notion of divide and conquer. 3 A decision tree partitions the instance space into axis-parallel regions, labeled with class value 4 Why decsion trees? Interpretable, popular in medical applications because they mimic the way a doctor thinks Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 3 / 24

Introduction 1 The decision tree is a classic and natural model of learning. 2 It is closely related to the notion of divide and conquer. 3 A decision tree partitions the instance space into axis-parallel regions, labeled with class value 4 Why decsion trees? Interpretable, popular in medical applications because they mimic the way a doctor thinks Can model discrete outcomes nicely Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 3 / 24

Introduction 1 The decision tree is a classic and natural model of learning. 2 It is closely related to the notion of divide and conquer. 3 A decision tree partitions the instance space into axis-parallel regions, labeled with class value 4 Why decsion trees? Interpretable, popular in medical applications because they mimic the way a doctor thinks Can model discrete outcomes nicely Can be very powerful, can be as complex as you need them Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 3 / 24

Introduction 1 The decision tree is a classic and natural model of learning. 2 It is closely related to the notion of divide and conquer. 3 A decision tree partitions the instance space into axis-parallel regions, labeled with class value 4 Why decsion trees? Interpretable, popular in medical applications because they mimic the way a doctor thinks Can model discrete outcomes nicely Can be very powerful, can be as complex as you need them C4.5 and CART decision trees are very popular. Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 3 / 24

Decision tree classification 1 Structure of decsion trees Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 4 / 24

Decision tree classification 1 Structure of decsion trees Each internal node tests an attribute Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 4 / 24

Decision tree classification 1 Structure of decsion trees Each internal node tests an attribute Each branch corresponds to attribute value Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 4 / 24

Decision tree classification 1 Structure of decsion trees Each internal node tests an attribute Each branch corresponds to attribute value Each leaf node assigns a classification. Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 4 / 24

can Decision not be tree trained classification incrementally., ID5, ID5R are samples of incremental induction of decision trees 1 ding Structure of decsion trees Each internal node tests an attribute P.E. Utgoff, Each Incremental branch corresponds Induction to attribute of Decision value Trees, Machine Learning, Vol. 4, p 186,1989. Each leaf node assigns a classification. 2 Decision Tree for PlayTennis [9+,5-] Outlook? Sunny Overcast Rain [2+,3-] Humidity? Yes Wind? [4+,0-] High Normal Strong Light [3+,2-] No Yes [0+,3-] [2+,0-] No Yes [0+,2-] [3+,0-] Machine Learning Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 4 / 24

e idea of binary classification trees is not unlike that of the histogram - partition the featur Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 5 / 24 Decision surface 0 0 0 0 0 0 0 0 0 0 Histogram Classifier 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Linear Classifier 0 1 Tree Classifier 0 0 1 0 0 1 0 0 0 0 1 1 1 1 1 1 1 1 (a) (b) (c) Figure 9.2: (a) Histogram classifier ; (b) Linear classifier; (c)tree classifier. 3 Binary Classification Trees

poor generalization characteristics, and then prune this tree, to avoid overfitting. Building decision trees 9.3.1 Growing Trees The growing process is based on recursively subdividing the feature space. Usually the subdivisions are splits of existing regions into two smaller regions (i.e., binary splits). For simplicity, the splits are perpendicular to one of the feature axis. An example of such construction is depicted in Figure 9.3. 1 Decsion trees recursively subdivide the feature space. and so on... Figure 9.3: Growing a recursive binary tree (X =[0, 1] 2 ). Often the splitting process is based on the training data, and is designed to separate data with di erent labels as much as possible. In such constructions, the splits, and hence the treestructure itself, are data dependent. This causes major di culties for the analysis (and tunning) of these methods. Alternatively, the splitting and subdivision could be taken independent from the training data. The latter approach is the one we are going to investigate in detail, since it is more amenable to analysis, and we will consider Dyadic Decision Trees and Recursive Dyadic Partitions (depicted in Figure 9.4) in particular. 84 Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 6 / 24

poor generalization characteristics, and then prune this tree, to avoid overfitting. Building decision trees 9.3.1 Growing Trees The growing process is based on recursively subdividing the feature space. Usually the subdivisions are splits of existing regions into two smaller regions (i.e., binary splits). For simplicity, the Until now we have been referring to trees, but did not made clear how do trees relate to splits are perpendicular to one of the feature axis. An example of such construction is depicted partitions. It turns out that any decision tree can be associated with a partition of the input in Figure 9.3. space X and vice-versa. In particular, a Recursive Dyadic Partition (RDP) can be associated with a (binary) tree. In fact, this is the most e cient way of describing a RDP. In Figure 9.4 we illustrate the procedure. Each leaf of the tree corresponds to a cell of the partition. The nodes in the tree correspond to the various partition cells that are generated in the construction and so on... of the tree. The orientation of the dyadic split alternates between the levels of the tree (for the example of Figure 9.4, at the root level the split is done in the horizontal axis, at the level below that (the level of nodes 2 and 3) the split is done in the vertical axis, and so on...). The tree is called dyadic because the splits of cells are always at the midpoint along one coordinate axis, and consequently the Figure sidelengths 9.3: Growing of all cells a recursive are dyadic binary (i.e., tree powers (X of =[0, 2). 1] 2 ). 1 Decsion trees recursively subdivide the feature space. 2 The test variable specifies the division Often the splitting process is based on the training data, and is designed to separate 1 data 1 with di erent labels as much as possible. In such constructions, 1 the splits, 4 and hence the treestructure itself, are data dependent. This causes major di culties for the analysis (and tunning) 1 2 3 2 2 3 2 3 5 of these methods. Alternatively, the splitting and subdivision could be taken independent from 4 5 the training data. The latter approach is the one we are going to investigate in detail, since it is more amenable to analysis, and we will consider Dyadic Decision Trees and Recursive Dyadic Partitions (depicted in Figure 9.4) in particular. 6 7 3 1 1 6 4 7 84 2 3 2 3 6 7 6 7 5 4 5 2 8 5 9 2 4 1 3 5 8 9 Figure 9.4: Example of Recursive Dyadic Partition (RDP) growing (X =[0, 1] 2 ). In the following we are going to consider the 2-dimensional case, but all the results can be Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 6 / 24

Building decision trees (example) 10! Training Examples for Concept PlayTennis Training examples for PlayTennis An Illustrative Example Day Outlook Temperature Humidity Wind PlayTennis? 1 Sunny Hot High Light No 2 Sunny Hot High Strong No 3 Overcast Hot High Light Yes 4 Rain Mild High Light Yes 5 Rain Cool Normal Light Yes 6 Rain Cool Normal Strong No 7 Overcast Cool Normal Strong Yes 8 Sunny Mild High Light No 9 Sunny Cool Normal Light Yes 10 Rain Mild Normal Light Yes 11 Sunny Mild Normal Strong Yes 12 Overcast Mild High Strong Yes 13 Overcast Hot Normal Light Yes 14 Rain Mild High Strong No! ID3 Build-DT using Gain( )! How Will ID3 Construct A Decision Tree? Machine Learning Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 7 / 24

Building decision trees (cont.) How to build a decision tree? Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 8 / 24

Building decision trees (cont.) How to build a decision tree? 1 Start at the top of the tree. Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 8 / 24

Building decision trees (cont.) How to build a decision tree? 1 Start at the top of the tree. 2 Grow it by splitting attributes one by one. Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 8 / 24

Building decision trees (cont.) How to build a decision tree? 1 Start at the top of the tree. 2 Grow it by splitting attributes one by one. 3 Assign leaf nodes. Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 8 / 24

Building decision trees (cont.) How to build a decision tree? 1 Start at the top of the tree. 2 Grow it by splitting attributes one by one. 3 Assign leaf nodes. 4 When we get to the bottom, prune the tree to prevent overfitting. Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 8 / 24

Building decision trees (cont.) How to build a decision tree? 1 Start at the top of the tree. 2 Grow it by splitting attributes one by one. 3 Assign leaf nodes. 4 When we get to the bottom, prune the tree to prevent overfitting. How choose a test variable for an internal node? Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 8 / 24

Building decision trees (cont.) How to build a decision tree? 1 Start at the top of the tree. 2 Grow it by splitting attributes one by one. 3 Assign leaf nodes. 4 When we get to the bottom, prune the tree to prevent overfitting. How choose a test variable for an internal node? Choosing different measures result in different algorithms. We describe ID3. 0.5 0.4 Gini index Entropy 0.3 0.2 Misclassification error 0.1 0.0 0.0 0.2 0.4 0.6 0.8 1.0 p Node impurity measures for two-class classification, as a function of the Hamid Beigy (Sharif University proportion of Technology) p in class 2. Cross-entropy Decisionhas Tree been scaled to pass through Fall 1396 8 / 24

Building decision trees (cont.) ID3 uses information gain to choose a test variable for an internal node. Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 9 / 24

Building decision trees (cont.) ID3 uses information gain to choose a test variable for an internal node. The information gain of D relative to attribute A is the expected reduction in entropy due to splitting on A. Gain(D, A) = H(D) [ ] Dv D H(D v ) v values(a) where D v is {x D : x.a = v}, the set of examples in D where attribute A has value v Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 9 / 24

ID3 Algorihm 10! Training Examples for Concept PlayTennis An Illustrative Example Day Outlook Temperature Humidity Wind PlayTennis? 1 Sunny Hot High Light No 2 Sunny Hot High Strong No 3 Overcast Hot High Light Yes 4 Rain Mild High Light Yes 5 Rain Cool Normal Light Yes 6 Rain Cool Normal Strong No 7 Overcast Cool Normal Strong Yes 8 Sunny Mild High Light No 9 Sunny Cool Normal Light Yes 10 Rain Mild Normal Light Yes 11 Sunny Mild Normal Strong Yes 12 Overcast Mild High Strong Yes 13 Overcast Hot Normal Light Yes 14 Rain Mild High Strong No! ID3 Build-DT using Gain( )! How Will ID3 Construct A Decision Tree? H(D) = (9/14) log(9/14) (5/14) log(5/14) = 0.94bits H(D, Humidity = High) = (3/7) log(3/7) (4/7) log(4/7) = 0.985bits H(D, Humidity = Normal) = (6/7) log(6/7) (1/7) log(1/7) = 0.592bits Machine Learning Gain(D, Humidity) = 0.94 (7/14) 0.985 (7/14) 0.592 = 0.151bits Gain(D, Wind) = 0.94 (8/14) 0.811 + (6/14) 1.0 = 0.048bits Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 10 / 24

ID3 Algorihm An Illustrative Example onstructing A Decision Tree 10! Training Examples for Concept PlayTennis for PlayTennis using ID3 [ 2 ] Day Outlook Temperature Humidity Wind PlayTennis? 1 Sunny Hot High Light No 2 Sunny Hot High Strong No 3 Overcast Hot High Light Yes Temperature Humidity Wind PlayTennis? 4 Rain Mild High Light Yes Hot High Light No 5 Rain Cool Normal Light Yes Hot High Strong No t Hot 6 High Rain Cool Yes Normal Strong No Mild 7 HighOvercast LightCool Yes Normal Strong Yes Cool 8 Normal Sunny Light Mild Yes High Light No Cool 9 Normal Sunny Strong Cool No Normal Light Yes t Cool 10 Normal Rain Strong Mild Yes Normal Light Yes Mild 11 HighSunny LightMild No Normal Strong Yes Cool 12 Normal Overcast LightMild Yes High Strong Yes Mild 13 Normal Overcast LightHot Yes Normal Light Yes Mild 14 Normal Rain Strong Mild Yes High Strong No t Mild High Strong Yes t Hot Normal Light Yes Mild! ID3 Build-DT High using Gain( ) Strong No! How Will ID3 Construct A Decision Tree? [9+, 5-] Gain(D, Humidity) = 0.151bits Outlook.151 bits Attribute bits Gain(D, Wind) = 0.048bits ) = Gain(D, 0.029 bits Temperature) = 0.029bits 246 bitsgain(d, Outlook) = 0.246bits ttribute (Root of Subtree) Machine Learning Sunny [2+, 3-] Overcast [4+, 0-] Rain [3+, 2-] Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 11 / 24

ID3 Algorihm An Illustrative Example 10! Training Examples for Concept PlayTennis Day Outlook Temperature Humidity Wind PlayTennis? 1 Sunny Hot High Light No 2 Sunny Hot High Strong No 3 Overcast Hot High Light Yes 4 Rain Mild High Light Yes 5 Rain Cool Normal Light Yes 6 Rain Cool Normal Strong No 7 Overcast Cool Normal Strong Yes 8 Sunny Mild High Light No 9 Sunny Cool Normal Light Yes 10 Rain Mild Normal Incremental Light Learning of Decision Yes Trees 11 Sunny Mild Normal Strong Yes 27 12 Overcast Mild High Strong Yes! ID3 can not be trained incrementally. 13 Overcast Hot Normal Light Yes 14 Rain Mild High Strong No! ID3 Build-DT using Gain( )! Gain(D How Will Sunny ID3, Construct Humidity) A Decision = Tree? 0.97bits Gain(D Sunny, Wind) = 0.02bits Gain(D Sunny, Temperature) = 0.57bits Machine Learning! ID4, ID5, ID5R are samples of incremental induction of decision trees! Reading " P.E. Utgoff, Incremental Induction of Decision Trees, Machine Learning, Vol. 4, pp. 161-186,1989. [2+,3-] No [9+,5-] High Humidity? Yes Outlook? Sunny Overcast Rain Normal [0+,3-] [2+,0-] Yes [4+,0-] Strong No Wind? Light Yes [0+,2-] [3+,0-] [3+,2-] Machine Learning Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 12 / 24

Inductive Bias in ID3 Types of Biases Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 13 / 24

Inductive Bias in ID3 Types of Biases 1 Preference (search) bias Put priority on choosing hypothesis. Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 13 / 24

Inductive Bias in ID3 Types of Biases 1 Preference (search) bias Put priority on choosing hypothesis. 2 Language bias Put restriction on the set of hypotheses considered Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 13 / 24

Inductive Bias in ID3 Types of Biases 1 Preference (search) bias Put priority on choosing hypothesis. 2 Language bias Put restriction on the set of hypotheses considered Which Bias is better? Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 13 / 24

Inductive Bias in ID3 Types of Biases 1 Preference (search) bias Put priority on choosing hypothesis. 2 Language bias Put restriction on the set of hypotheses considered Which Bias is better? 1 Preference bias is more desirable. Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 13 / 24

Inductive Bias in ID3 Types of Biases 1 Preference (search) bias Put priority on choosing hypothesis. 2 Language bias Put restriction on the set of hypotheses considered Which Bias is better? 1 Preference bias is more desirable. 2 Because, the learner works within a complete space that is assured to contain the unknown concept. Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 13 / 24

Inductive Bias in ID3 Types of Biases 1 Preference (search) bias Put priority on choosing hypothesis. 2 Language bias Put restriction on the set of hypotheses considered Which Bias is better? 1 Preference bias is more desirable. 2 Because, the learner works within a complete space that is assured to contain the unknown concept. Inductive Bias of ID3 Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 13 / 24

Inductive Bias in ID3 Types of Biases 1 Preference (search) bias Put priority on choosing hypothesis. 2 Language bias Put restriction on the set of hypotheses considered Which Bias is better? 1 Preference bias is more desirable. 2 Because, the learner works within a complete space that is assured to contain the unknown concept. Inductive Bias of ID3 1 Shorter trees are preferred over longer trees. Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 13 / 24

Inductive Bias in ID3 Types of Biases 1 Preference (search) bias Put priority on choosing hypothesis. 2 Language bias Put restriction on the set of hypotheses considered Which Bias is better? 1 Preference bias is more desirable. 2 Because, the learner works within a complete space that is assured to contain the unknown concept. Inductive Bias of ID3 1 Shorter trees are preferred over longer trees. 2 Occams razor : Prefer the simplest hypothesis that fits the data. Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 13 / 24

Inductive Bias in ID3 Types of Biases 1 Preference (search) bias Put priority on choosing hypothesis. 2 Language bias Put restriction on the set of hypotheses considered Which Bias is better? 1 Preference bias is more desirable. 2 Because, the learner works within a complete space that is assured to contain the unknown concept. Inductive Bias of ID3 1 Shorter trees are preferred over longer trees. 2 Occams razor : Prefer the simplest hypothesis that fits the data. 3 Trees that place high information gain attributes close to the root are preferred over those that do not. Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 13 / 24

Overfitting in ID3 How can we avoid over-fitting? Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 14 / 24

Overfitting in ID3 How can we avoid over-fitting? 1 Prevention Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 14 / 24

Overfitting in ID3 How can we avoid over-fitting? 1 Prevention Stop training (growing) before it reaches the point that overfits. Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 14 / 24

Overfitting in ID3 How can we avoid over-fitting? 1 Prevention Stop training (growing) before it reaches the point that overfits. Select attributes that are relevant (i.e., will be useful in the decision tree) Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 14 / 24

Overfitting in ID3 How can we avoid over-fitting? 1 Prevention Stop training (growing) before it reaches the point that overfits. Select attributes that are relevant (i.e., will be useful in the decision tree) Requires some predictive measure of relevance Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 14 / 24

Overfitting in ID3 How can we avoid over-fitting? 1 Prevention Stop training (growing) before it reaches the point that overfits. Select attributes that are relevant (i.e., will be useful in the decision tree) Requires some predictive measure of relevance 2 Avoidance Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 14 / 24

Overfitting in ID3 How can we avoid over-fitting? 1 Prevention Stop training (growing) before it reaches the point that overfits. Select attributes that are relevant (i.e., will be useful in the decision tree) Requires some predictive measure of relevance 2 Avoidance Allow to over-fit, then improve the generalization capability of the tree. Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 14 / 24

Overfitting in ID3 How can we avoid over-fitting? 1 Prevention Stop training (growing) before it reaches the point that overfits. Select attributes that are relevant (i.e., will be useful in the decision tree) Requires some predictive measure of relevance 2 Avoidance Allow to over-fit, then improve the generalization capability of the tree. Holding out a validation set (test set) Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 14 / 24

Overfitting in ID3 How can we avoid over-fitting? 1 Prevention Stop training (growing) before it reaches the point that overfits. Select attributes that are relevant (i.e., will be useful in the decision tree) Requires some predictive measure of relevance 2 Avoidance Allow to over-fit, then improve the generalization capability of the tree. Holding out a validation set (test set) 3 Detection and Recovery Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 14 / 24

Overfitting in ID3 How can we avoid over-fitting? 1 Prevention Stop training (growing) before it reaches the point that overfits. Select attributes that are relevant (i.e., will be useful in the decision tree) Requires some predictive measure of relevance 2 Avoidance Allow to over-fit, then improve the generalization capability of the tree. Holding out a validation set (test set) 3 Detection and Recovery Letting the problem happen, detecting when it does, recovering afterward Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 14 / 24

Overfitting in ID3 How can we avoid over-fitting? 1 Prevention Stop training (growing) before it reaches the point that overfits. Select attributes that are relevant (i.e., will be useful in the decision tree) Requires some predictive measure of relevance 2 Avoidance Allow to over-fit, then improve the generalization capability of the tree. Holding out a validation set (test set) 3 Detection and Recovery Letting the problem happen, detecting when it does, recovering afterward Build model, remove (prune) elements that contribute to overfitting Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 14 / 24

Overfitting in ID3 How can we avoid over-fitting? 1 Prevention Stop training (growing) before it reaches the point that overfits. Select attributes that are relevant (i.e., will be useful in the decision tree) Requires some predictive measure of relevance 2 Avoidance Allow to over-fit, then improve the generalization capability of the tree. Holding out a validation set (test set) 3 Detection and Recovery Letting the problem happen, detecting when it does, recovering afterward Build model, remove (prune) elements that contribute to overfitting How to select Best tree? Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 14 / 24

Overfitting in ID3 How can we avoid over-fitting? 1 Prevention Stop training (growing) before it reaches the point that overfits. Select attributes that are relevant (i.e., will be useful in the decision tree) Requires some predictive measure of relevance 2 Avoidance Allow to over-fit, then improve the generalization capability of the tree. Holding out a validation set (test set) 3 Detection and Recovery Letting the problem happen, detecting when it does, recovering afterward Build model, remove (prune) elements that contribute to overfitting How to select Best tree? 1 Training and validation set Use a separate set of examples (distinct from the training set) for test. Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 14 / 24

Overfitting in ID3 How can we avoid over-fitting? 1 Prevention Stop training (growing) before it reaches the point that overfits. Select attributes that are relevant (i.e., will be useful in the decision tree) Requires some predictive measure of relevance 2 Avoidance Allow to over-fit, then improve the generalization capability of the tree. Holding out a validation set (test set) 3 Detection and Recovery Letting the problem happen, detecting when it does, recovering afterward Build model, remove (prune) elements that contribute to overfitting How to select Best tree? 1 Training and validation set Use a separate set of examples (distinct from the training set) for test. 2 Statistical test Use all data for training, but apply the statistical test to estimate the over-fitting. Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 14 / 24

Overfitting in ID3 How can we avoid over-fitting? 1 Prevention Stop training (growing) before it reaches the point that overfits. Select attributes that are relevant (i.e., will be useful in the decision tree) Requires some predictive measure of relevance 2 Avoidance Allow to over-fit, then improve the generalization capability of the tree. Holding out a validation set (test set) 3 Detection and Recovery Letting the problem happen, detecting when it does, recovering afterward Build model, remove (prune) elements that contribute to overfitting How to select Best tree? 1 Training and validation set Use a separate set of examples (distinct from the training set) for test. 2 Statistical test Use all data for training, but apply the statistical test to estimate the over-fitting. 3 Define the measure of complexity Halting the grow when this measure is minimized. Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 14 / 24

Machine Lea Pruning algorithms Reduced-Error Pruning. Acc 0.65 0.6 0.55 0.5 0 10 20 30 Size error true error training error Number of nodes in tree Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 15 / 24

Pruning algorithms Reduced-Error Pruning. oss-validation Approach aining and Validation Sets, node) btree rooted at node Reduced-Error Pruning error af (with majority label of associated examples) ed-error-pruning (D) D train (training / growing ), D validation (validation / pruning ) tree T using ID3 on D train y on D validation decreases DO n-leaf node candidate in T p[candidate] Prune (T, candidate) uracy[candidate] Test (Temp[candidate], D validation ) p with best value of Accuracy (best increase; greedy) ed) T The effect of reduced error pruning on error Accuracy Acc 0.65 0.6 0.55 0.5 0 10 20 30 true error training error Number of nodes in tree Size Machine Lea 0.9 0.85 0.8 0.75 On training data 0.7 0.65 On test data 0.6 Post-pruned tree 0.55 Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 15 / 24

Pruning algorithms Reduced-Error Pruning. oss-validation Approach aining and Validation Sets, node) btree rooted at node Reduced-Error Pruning error af (with majority label of associated examples) ed-error-pruning (D) D train (training / growing ), D validation (validation / pruning ) tree T using ID3 on D train y on D validation decreases DO n-leaf node candidate in T p[candidate] Prune (T, candidate) uracy[candidate] Test (Temp[candidate], D validation ) p with best value of Accuracy (best increase; greedy) ed) T The effect of reduced error pruning on error Accuracy Acc 0.65 0.6 0.55 0.5 0 10 20 30 true error training error Number of nodes in tree Size Machine Lea 0.9 0.85 0.8 0.75 On training data 0.7 0.65 On test data 0.6 Post-pruned tree 0.55 Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 15 / 24

Pruning algorithms Prunning algorithms Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 16 / 24

Pruning algorithms Prunning algorithms 1 Reduced Error Pruning Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 16 / 24

Pruning algorithms Prunning algorithms 1 Reduced Error Pruning 2 Pessimistic Error Pruning Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 16 / 24

Pruning algorithms Prunning algorithms 1 Reduced Error Pruning 2 Pessimistic Error Pruning 3 Minimum Error Pruning Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 16 / 24

Pruning algorithms Prunning algorithms 1 Reduced Error Pruning 2 Pessimistic Error Pruning 3 Minimum Error Pruning 4 Critical Value Pruning Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 16 / 24

Pruning algorithms Prunning algorithms 1 Reduced Error Pruning 2 Pessimistic Error Pruning 3 Minimum Error Pruning 4 Critical Value Pruning 5 Cost-Complexity Pruning Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 16 / 24

Pruning algorithms Prunning algorithms 1 Reduced Error Pruning 2 Pessimistic Error Pruning 3 Minimum Error Pruning 4 Critical Value Pruning 5 Cost-Complexity Pruning Reading Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 16 / 24

Pruning algorithms Prunning algorithms 1 Reduced Error Pruning 2 Pessimistic Error Pruning 3 Minimum Error Pruning 4 Critical Value Pruning 5 Cost-Complexity Pruning Reading 1 F. Esposito, D. Malerba, and G. Semeraro, A Comparative Analysis of Methods for Pruning Decision Trees, IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. 19, No. 5, pp. 476-491, May 1997. Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 16 / 24

Pruning algorithms Prunning algorithms 1 Reduced Error Pruning 2 Pessimistic Error Pruning 3 Minimum Error Pruning 4 Critical Value Pruning 5 Cost-Complexity Pruning Reading 1 F. Esposito, D. Malerba, and G. Semeraro, A Comparative Analysis of Methods for Pruning Decision Trees, IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. 19, No. 5, pp. 476-491, May 1997. 2 S. R. Safavian and D. Landgrebe, A Survey of Decision Tree Classifier Methodology, IEEE Trans on Systems, Man, and Cybernetics, Vol. 21, No. 3, pp. 660-674, 1991. Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 16 / 24

Continuous Valued Attributes Two methods for handling continuous attributes Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 17 / 24

Continuous Valued Attributes Two methods for handling continuous attributes 1 Discretization (e.g., histogramming) Break real-valued attributes into ranges in advance Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 17 / 24

Continuous Valued Attributes Example Two methods for handling continuous attributes 1 Discretization (e.g., histogramming) Break real-valued attributes into ranges in advance high = {Temp > 35C} med = {10C < Temp 35C} low = {Temp 10C} Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 17 / 24

Continuous Valued Attributes Example Two methods for handling continuous attributes 1 Discretization (e.g., histogramming) Break real-valued attributes into ranges in advance high = {Temp > 35C} med = {10C < Temp 35C} low = {Temp 10C} 2 Using thresholds for splitting nodes Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 17 / 24

Continuous Valued Attributes Example Two methods for handling continuous attributes 1 Discretization (e.g., histogramming) Break real-valued attributes into ranges in advance high = {Temp > 35C} med = {10C < Temp 35C} low = {Temp 10C} 2 Using thresholds for splitting nodes Example A a produces subsets A a and A > a. Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 17 / 24

Continuous Valued Attributes Example Two methods for handling continuous attributes 1 Discretization (e.g., histogramming) Break real-valued attributes into ranges in advance high = {Temp > 35C} med = {10C < Temp 35C} low = {Temp 10C} 2 Using thresholds for splitting nodes Example A a produces subsets A a and A > a. 3 Information gain is calculated the same way as for discrete splits Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 17 / 24

Continuous Valued Attributes Example Two methods for handling continuous attributes 1 Discretization (e.g., histogramming) Break real-valued attributes into ranges in advance Example high = {Temp > 35C} med = {10C < Temp 35C} low = {Temp 10C} 2 Using thresholds for splitting nodes A a produces subsets A a and A > a. 3 Information gain is calculated the same way as for discrete splits How to find the split with highest Gain Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 17 / 24

Continuous Valued Attributes Example Two methods for handling continuous attributes 1 Discretization (e.g., histogramming) Break real-valued attributes into ranges in advance Example high = {Temp > 35C} med = {10C < Temp 35C} low = {Temp 10C} 2 Using thresholds for splitting nodes A a produces subsets A a and A > a. 3 Information gain is calculated the same way as for discrete splits How to find the split with highest Gain Example length 10 15 21 28 32 40 50 label - + + - + + - Thresholds 12.5 24.5 30 45 Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 17 / 24

Missing Data Problem: What If Some Examples Missing Values of A? Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 18 / 24

Missing Data Training: evaluate Gain (D, A) where for some x D, a value Testing: classify a new example x without knowing the value! Solutions: Incorporating a Guess into Calculation of G Problem: What If Some Examples Missing Values of A? Consider dataset. Day Outlook Temperature Humidity Wind PlayTennis? 1 Sunny Hot High Light No 2 Sunny Hot High Strong No 3 Overcast Hot High Light Yes 4 Rain Mild High Light Yes 5 Rain Cool Normal Light Yes 6 Rain Cool Normal Strong No 7 Overcast Cool Normal Strong Yes 8 Sunny Mild??? Light No 9 Sunny Cool Normal Light Yes 10 Rain Mild Normal Light Yes 11 Sunny Mild Normal Strong Yes 12 Overcast Mild High Strong Yes 13 Overcast Hot Normal Light Yes 14 Rain Mild High Strong No [2 Machine Learning Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 18 / 24

known Attribute Values Missing Data Training: evaluate Gain (D, A) where for some x D, a value Testing: classify a new example x without knowing the value s Missing! Solutions: Values of Incorporating A? a Guess into Calculation of G Problem: What If Some Examples Missing Values of A? butes Consider during dataset. training or testing Day Outlook Temperature Humidity Wind PlayTennis? 1 Sunny Hot High Light No 2 Sunny Hot High Strong No 3 Overcast Hot High Light Yes 4 Rain Mild High Light Yes 5 Rain Cool Normal Light Yes 6 Rain Cool Normal Strong No 7 Overcast Cool Normal Strong Yes 8 Sunny Mild??? Light No 9 Sunny Cool Normal Light Yes 10 Rain Mild Normal Light Yes 11 Sunny Mild Normal Strong Yes 12 Overcast Mild High Strong Yes 13 Overcast Hot Normal Light Yes 14 Rain Mild High Strong No ormal,, Blood-Test =?, >, sometimes low priority (or cost too high) ssification ere for some x D, a value for A is not given without knowing the value of A into Calculation of Gain(D, A) Wind Light Strong Light Light Light Strong Strong Light Light PlayTennis? No No Yes Yes Yes No Yes No Yes What is the decision tree? [2+, 3-] [9+, 5-] Outlook Sunny Overcast [4+, 0-] Machine Learning Rain [3+, 2-] Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 18 / 24 [2

known Attribute Values Missing Data Training: evaluate Gain (D, A) where for some x D, a value Testing: classify a new example x without knowing the value s Missing! Solutions: Values of Incorporating A? a Guess into Calculation of G Problem: What If Some Examples Missing Values of A? butes Consider during dataset. training or testing Day Outlook Temperature Humidity Wind PlayTennis? 1 Sunny Hot High Light No 2 Sunny Hot High Strong No 3 Overcast Hot High Light Yes 4 Rain Mild High Light Yes 5 Rain Cool Normal Light Yes 6 Rain Cool Normal Strong No 7 Overcast Cool Normal Strong Yes 8 Sunny Mild??? Light No 9 Sunny Cool Normal Light Yes 10 Rain Mild Normal Light Yes 11 Sunny Mild Normal Strong Yes 12 Overcast Mild High Strong Yes 13 Overcast Hot Normal Light Yes 14 Rain Mild High Strong No ormal,, Blood-Test =?, >, sometimes low priority (or cost too high) ssification ere for some x D, a value for A is not given without knowing the value of A into Calculation of Gain(D, A) Wind Light Strong Light Light Light Strong Strong Light Light PlayTennis? No No Yes Yes Yes No Yes No Yes What is the decision tree? [2+, 3-] [9+, 5-] Outlook Sunny Overcast [4+, 0-] Machine Learning Rain [3+, 2-] Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 18 / 24 [2

Attributes with Many Values Problem: If attribute has many values such as Date, Gain() will select it (why?) Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 19 / 24

Attributes with Many Values Problem: If attribute has many values such as Date, Gain() will select it (why?) One Approach: Use GainRatio instead of Gain Gain(D, A) = H(D) GainRatio(D, A) = SplitInformation(D, A) = v values(a) [ ] Dv D H(D v ) Gain(D, A) SplitInformation(D, A) [ Dv D log D ] v D v values(a) Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 19 / 24

Attributes with Many Values Problem: If attribute has many values such as Date, Gain() will select it (why?) One Approach: Use GainRatio instead of Gain Gain(D, A) = H(D) GainRatio(D, A) = SplitInformation(D, A) = v values(a) [ ] Dv D H(D v ) Gain(D, A) SplitInformation(D, A) [ Dv D log D ] v D v values(a) SplitInformation: directly proportional to values(a), i.e., penalizes attributes with more values. Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 19 / 24

Attributes with Many Values Problem: If attribute has many values such as Date, Gain() will select it (why?) One Approach: Use GainRatio instead of Gain Gain(D, A) = H(D) GainRatio(D, A) = SplitInformation(D, A) = v values(a) [ ] Dv D H(D v ) Gain(D, A) SplitInformation(D, A) [ Dv D log D ] v D v values(a) SplitInformation: directly proportional to values(a), i.e., penalizes attributes with more values. What is its inductive bias? Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 19 / 24

Attributes with Many Values Problem: If attribute has many values such as Date, Gain() will select it (why?) One Approach: Use GainRatio instead of Gain Gain(D, A) = H(D) GainRatio(D, A) = SplitInformation(D, A) = v values(a) [ ] Dv D H(D v ) Gain(D, A) SplitInformation(D, A) [ Dv D log D ] v D v values(a) SplitInformation: directly proportional to values(a), i.e., penalizes attributes with more values. What is its inductive bias? Preference bias (for lower branch factor) expressed via GainRatio(.) Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 19 / 24

Attributes with Many Values Problem: If attribute has many values such as Date, Gain() will select it (why?) One Approach: Use GainRatio instead of Gain Gain(D, A) = H(D) GainRatio(D, A) = SplitInformation(D, A) = v values(a) [ ] Dv D H(D v ) Gain(D, A) SplitInformation(D, A) [ Dv D log D ] v D v values(a) SplitInformation: directly proportional to values(a), i.e., penalizes attributes with more values. What is its inductive bias? Preference bias (for lower branch factor) expressed via GainRatio(.) Alternative attribute selection : Gini Index Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 19 / 24

Handling Attributes With Different Costs Problem: In some learning tasks the instance attributes may have associated costs. Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 20 / 24

Handling Attributes With Different Costs Problem: In some learning tasks the instance attributes may have associated costs. Solutions Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 20 / 24

Handling Attributes With Different Costs Problem: In some learning tasks the instance attributes may have associated costs. Solutions 1 ExtendedID3 Gain(S, A) Cost(A) Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 20 / 24

Handling Attributes With Different Costs Problem: In some learning tasks the instance attributes may have associated costs. Solutions 1 ExtendedID3 2 TanandSchlimmer Gain(S, A) Cost(A) Gain 2 (S, A) Cost(A) Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 20 / 24

Handling Attributes With Different Costs Problem: In some learning tasks the instance attributes may have associated costs. Solutions 1 ExtendedID3 2 TanandSchlimmer 3 Nunez Gain(S, A) Cost(A) Gain 2 (S, A) Cost(A) where w [0, 1] is a constant. 2 Gain(S,A) 1 (Cost(A) + 1) w Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 20 / 24

t Regression that appropriate Tree impurity measure for regression is subset of X reaching node m. on tree, the goodness of a split is measured by the m the estimated value. In regression tree, the goodness of a split is measured by the mean square error from the estimated value. A 1 True F 1 (A-A 1 )] False F[11+, 2 (A-A2-] 1 )] an, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classifica ion Trees, Belmont, CA; Wadsworth International Group, (1 ba, F. Esposito, M. Ceci, and A. Appice, Top-Down Induction ression and Splitting Nodes, IEEE Trans. on Pattern Analysis Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 21 / 24

t Regression that appropriate Tree impurity measure for regression is subset of X reaching node m. on tree, the goodness of a split is measured by the m the estimated value. In regression tree, the goodness of a split is measured by the mean square error from the estimated value. A 1 True F 1 (A-A 1 )] False F[11+, 2 (A-A2-] 1 )] References an, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classifica ion Trees, Belmont, CA; Wadsworth International Group, (1 ba, F. Esposito, M. Ceci, and A. Appice, Top-Down Induction ression and Splitting Nodes, IEEE Trans. on Pattern Analysis Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 21 / 24

t Regression that appropriate Tree impurity measure for regression is subset of X reaching node m. on tree, the goodness of a split is measured by the m the estimated value. In regression tree, the goodness of a split is measured by the mean square error from the estimated value. A 1 True F 1 (A-A 1 )] False F[11+, 2 (A-A2-] 1 )] References 1 L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classification and Regression Trees, Belmont, CA; Wadsworth International Group, (1984). an, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classifica ion Trees, Belmont, CA; Wadsworth International Group, (1 ba, F. Esposito, M. Ceci, and A. Appice, Top-Down Induction ression and Splitting Nodes, IEEE Trans. on Pattern Analysis Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 21 / 24

t Regression that appropriate Tree impurity measure for regression is subset of X reaching node m. on tree, the goodness of a split is measured by the m the estimated value. In regression tree, the goodness of a split is measured by the mean square error from the estimated value. A 1 True F 1 (A-A 1 )] False F[11+, 2 (A-A2-] 1 )] References 1 L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classification and Regression Trees, Belmont, CA; Wadsworth International Group, (1984). an, J. 2 H. D. Friedman, Malerba, F. Esposito, R. M. A. Ceci, Olshen, and A. Appice, and Top-Down C. J. Stone, Induction of Classifica Model Trees ion Trees, with Regression Belmont, and Splitting CA; Wadsworth Nodes, IEEE Trans. International Pattern Analysis and Group, Machine (1 Intelligence, Vol. 25, No. 5, pp. 612-625, May 2004. ba, F. Esposito, M. Ceci, and A. Appice, Top-Down Induction ression and Splitting Nodes, IEEE Trans. on Pattern Analysis Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 21 / 24

Other types of decision trees Univariate trees In univariate trees, the test at each internal node just uses only one of input attributes. Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 22 / 24

Other types of decision trees Univariate trees In univariate trees, the test at each internal node just uses only one of input attributes. Multivariate trees In multivariate trees, the test at each internal node can use all input attributes. Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 22 / 24

Other types of decision trees Univariate trees In univariate trees, the test at each internal node just uses only one of input attributes. Multivariate trees In multivariate trees, the test at each internal node can use all input attributes. Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 22 / 24

Other types of decision trees Decision Trees 26! Univariate Trees Univariate " In univariate trees trees, the test at each internal node just uses only one of input In attributes. univariate trees, the test at each internal node just uses only one of input! Multivariate attributes. Trees " In multivariate trees, the test at each internal node can use all input attributes. Multivariate trees " For example: Consider a data set with numerical attributes. In multivariate trees, the test at each internal node can use all input attributes. # The test can be made using the weighted linear combination of some input attributes. " For example X=(x 1,x 2 ) be the input attributes. Let f (X)=w 0 +w 1 x 1 +w 2 x 2 can be used for test at an internal node. Such as f (x) > 0. (w 0 +w 1 x 1 +w 2 x 2 )>0 True False YES [11+, NO2-]! Reading " C. E. Brodley and P. E. Utgoff, Multivariate Decision Trees, Machine Learning, Vol. 19, pp. 45-77, 1995. Machine Learning Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 22 / 24

Other types of decision trees Decision Trees ariate Trees univariate trees, the test at each internal node just uses only one of input 26 ttributes.! Univariate Trees ivariate Univariate " Trees In univariate trees trees, the test at each internal node just uses only one of input multivariate In attributes. univariate trees, the trees, test the at test each atinternal each internal node can nodeuse justall uses input onlyattributes. one of input or! example: Multivariate attributes. Consider Treesa data set with numerical attributes. # The " test In can multivariate be made trees, using the the test weighted at each linear internal combination node can use of all some input input attributes. attributes. Multivariate trees or example " For X=(x example: Consider a data set with numerical attributes. In multivariate 1,x 2 ) be the input attributes. Let f (X)=w # The test can trees, be made the using test the at weighted each internal linear combination node 0 +w can 1 x of some use 1 +w 2 x all 2 can be used for input input attributes. st at an internal node. Such as f (x) > 0. attributes. " For example X=(x 1,x 2 ) be the input attributes. Let f (X)=w 0 +w 1 x 1 +w 2 x 2 can be used for test at an internal node. Such as f (x) > 0. (w 0 +w 1 x 1 +w 2 x 2 )>0 (w 0 +w 1 x 1 +w 2 x 2 )>0 Decision Trees True True False False YES YES [11+, NO2-] [11+, NO2-] ing! Reading. E. Brodley " C. E. and Brodley P. E. and Utgoff, P. E. Utgoff, Multivariate Multivariate Decision Trees, Machine Learning, Vol. Vol. 19, 19, p. 45-77, pp. 1995. 45-77, 1995. Machine Learning Machine Learning Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 22 / 24

Other types of decision trees Decision Trees ariate Trees univariate trees, the test at each internal node just uses only one of input 26 ttributes.! Univariate Trees ivariate Univariate " Trees In univariate trees trees, the test at each internal node just uses only one of input multivariate In attributes. univariate trees, the trees, test the at test each atinternal each internal node can nodeuse justall uses input onlyattributes. one of input or! example: Multivariate attributes. Consider Treesa data set with numerical attributes. # The " test In can multivariate be made trees, using the the test weighted at each linear internal combination node can use of all some input input attributes. attributes. Multivariate trees or example " For X=(x example: Consider a data set with numerical attributes. In multivariate 1,x 2 ) be the input attributes. Let f (X)=w # The test can trees, be made the using test the at weighted each internal linear combination node 0 +w can 1 x of some use 1 +w 2 x all 2 can be used for input input attributes. st at an internal node. Such as f (x) > 0. attributes. " For example X=(x 1,x 2 ) be the input attributes. Let f (X)=w 0 +w 1 x 1 +w 2 x 2 can be used for test at an internal node. Such as f (x) > 0. (w 0 +w 1 x 1 +w 2 x 2 )>0 (w 0 +w 1 x 1 +w 2 x 2 )>0 Decision Trees True True False False YES YES [11+, NO2-] [11+, NO2-] ing! Reading. E. Brodley " C. E. Brodley and P. E. Utgoff, Multivariate Vol. 19, References and P. E. Utgoff, Multivariate Decision Trees, Machine Learning, Vol. 19, p. 45-77, pp. 1995. 45-77, 1995. Machine Learning Machine Learning Hamid Beigy (Sharif University of Technology) Decision Tree Fall 1396 22 / 24