Decision Tree Learning

Similar documents
Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Lecture 1: Machine Learning Basics

Lecture 1: Basic Concepts of Machine Learning

Version Space. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Version Space Term 2012/ / 18

CS Machine Learning

A Version Space Approach to Learning Context-free Grammars

Chapter 2 Rule Learning in a Nutshell

Radius STEM Readiness TM

Proof Theory for Syntacticians

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

The lab is designed to remind you how to work with scientific data (including dealing with uncertainty) and to review experimental design.

Constructive Induction-based Learning Agents: An Architecture and Preliminary Experiments

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Rule Learning With Negation: Issues Regarding Effectiveness

Switchboard Language Model Improvement with Conversational Data from Gigaword

University of Groningen. Systemen, planning, netwerken Bosman, Aart

Extending Place Value with Whole Numbers to 1,000,000

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

Learning goal-oriented strategies in problem solving

The Good Judgment Project: A large scale test of different methods of combining expert predictions

POLA: a student modeling framework for Probabilistic On-Line Assessment of problem solving performance

Guidelines for Project I Delivery and Assessment Department of Industrial and Mechanical Engineering Lebanese American University

Predicting Students Performance with SimStudent: Learning Cognitive Skills from Observation

Rule Learning with Negation: Issues Regarding Effectiveness

Lecture 10: Reinforcement Learning

Probabilistic Latent Semantic Analysis

(Sub)Gradient Descent

Learning From the Past with Experiment Databases

On-Line Data Analytics

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

STA 225: Introductory Statistics (CT)

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Numeracy Medium term plan: Summer Term Level 2C/2B Year 2 Level 2A/3C

Objectives. Chapter 2: The Representation of Knowledge. Expert Systems: Principles and Programming, Fourth Edition

GACE Computer Science Assessment Test at a Glance

Active Learning. Yingyu Liang Computer Sciences 760 Fall

CS 1103 Computer Science I Honors. Fall Instructor Muller. Syllabus

What s in a Step? Toward General, Abstract Representations of Tutoring System Log Data

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Mathematics subject curriculum

How do adults reason about their opponent? Typologies of players in a turn-taking game

Artificial Neural Networks written examination

Decision Analysis. Decision-Making Problem. Decision Analysis. Part 1 Decision Analysis and Decision Tables. Decision Analysis, Part 1

Prediction of Maximal Projection for Semantic Role Labeling

Self Study Report Computer Science

CHAPTER 4: REIMBURSEMENT STRATEGIES 24

Grade 6: Correlated to AGS Basic Math Skills

The Singapore Copyright Act applies to the use of this document.

Word learning as Bayesian inference

Defragmenting Textual Data by Leveraging the Syntactic Structure of the English Language

Softprop: Softmax Neural Network Backpropagation Learning

Using dialogue context to improve parsing performance in dialogue systems

CAAP. Content Analysis Report. Sample College. Institution Code: 9011 Institution Type: 4-Year Subgroup: none Test Date: Spring 2011

Interpreting ACER Test Results

A Case Study: News Classification Based on Term Frequency

Cal s Dinner Card Deals

CSL465/603 - Machine Learning

Probability and Statistics Curriculum Pacing Guide

Shockwheat. Statistics 1, Activity 1

Edexcel GCSE. Statistics 1389 Paper 1H. June Mark Scheme. Statistics Edexcel GCSE

A GENERIC SPLIT PROCESS MODEL FOR ASSET MANAGEMENT DECISION-MAKING

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Calibration of Confidence Measures in Speech Recognition

Discriminative Learning of Beam-Search Heuristics for Planning

The Indices Investigations Teacher s Notes

Chapter 4 - Fractions

An Effective Framework for Fast Expert Mining in Collaboration Networks: A Group-Oriented and Cost-Based Method

Innovative Methods for Teaching Engineering Courses

Model Ensemble for Click Prediction in Bing Search Ads

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

"f TOPIC =T COMP COMP... OBJ

Lecture 2: Quantifiers and Approximation

Running head: DELAY AND PROSPECTIVE MEMORY 1

TU-E2090 Research Assignment in Operations Management and Services

Multimedia Application Effective Support of Education

Probability estimates in a scenario tree

Rule-based Expert Systems

Assignment 1: Predicting Amazon Review Ratings

MASTER S THESIS GUIDE MASTER S PROGRAMME IN COMMUNICATION SCIENCE

Visual CP Representation of Knowledge

Julia Smith. Effective Classroom Approaches to.

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

An Empirical Analysis of the Effects of Mexican American Studies Participation on Student Achievement within Tucson Unified School District

FIGURE IT OUT! MIDDLE SCHOOL TASKS. Texas Performance Standards Project

Alignment of Australian Curriculum Year Levels to the Scope and Sequence of Math-U-See Program

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Mathematics Assessment Plan

Mathematics. Mathematics

COMPUTER-ASSISTED INDEPENDENT STUDY IN MULTIVARIATE CALCULUS

Scientific Method Investigation of Plant Seed Germination

Algebra 2- Semester 2 Review

Using the Attribute Hierarchy Method to Make Diagnostic Inferences about Examinees Cognitive Skills in Algebra on the SAT

re An Interactive web based tool for sorting textbook images prior to adaptation to accessible format: Year 1 Final Report

Using focal point learning to improve human machine tacit coordination

AGENDA LEARNING THEORIES LEARNING THEORIES. Advanced Learning Theories 2/22/2016

Page 1 of 11. Curriculum Map: Grade 4 Math Course: Math 4 Sub-topic: General. Grade(s): None specified

Physics 270: Experimental Physics

Transcription:

CMP 882 Machine Learning Decision ree Learning Lecture Scribe for week 7 ebruary 20th By: Mona Vajihollahi mvajihol@sfu.ca Overview: Introduction...2 Decision ree Hypothesis Space...3 Parity unction... 5 Decision Graph... 6 Algebraic Decision Diagram... 6 Inductive Bias...7 Occam s razor... 7 Avoiding Overfitting the Data...7 raining and Validation... 9 Reduced Error Pruning... 9 Rule PostPruning... 9 Other Issues in Decision rees Learning...10 kary Attribute Values:... 10 Real Attribute Values... 10 Attribute Cost... 10 Missing Values... 10 Other Splitting Criteria... 10 References...10

Introduction Decision ree Learning is a method for approximating discretevalued functions, using decision trees. Decision rees represent a disjunction of conjunctions of constraints on the attribute values of instances. his disjunction approximates the target function. Decision tree learning is robust to noisy data and is suitable for the target functions with discrete output values. Most of the algorithms for decision learning use a topdown greedy search in the space of possible decision trees to find the best tree which fits the training data. ID3, a basic algorithm in decision tree learning, construct the tree by answering the question which attribute is the best classifier? in each step. he answer is the attribute with highest Information Gain, which is the expected reduction in the entropy 1 caused by partitioning examples according to the attribute. igure 1 he entropy function of a Boolean classifier. he following example (Exercise 3.2, p 77, Machine Learning, om Mitchell) illustrates the use of entropy and information gain. Consider the following set of training examples: Instance Classification a1 a2 1 2 3 4 5 6 1 he concept of entropy plays an important role in information theory. Roughly speaking, entropy is a mathematical formulation of the uncertainty and/or the amount information in a data set. he (Shannon) entropy of a variable X is defined as:

igure 1, depicts the form of the entropy function relative to a Boolean classification, as the proportion, p, of positive examples varies between 0 and 1. So it is obvious that when the data set contains equal number of positive and negative examples (i.e. p = ½) the entropy function has the largest value, that is 1. In the above example there are 3 positive instances and 3 negative instances, so by the above discussion we can conclude that Entropy = 1. Alternatively, we can calculate the entropy using the following equation: p p Entropy( S) = p log 2 pθ log2 So: Entropy ( S) = 1 log 2 1 log 2 θ = 1 1 2 2 1/ 2 1/ 2 2 2 = Now, we can calculate the Information Gain of a2 relative to this training set, using the following equation: Gain( s, A) = Entropy( S) v Values( A) 1 Sv Entropy( S S So: S a2= S a2= Gain ( S, a ) = Entropy( S) Entropy( S a2= ) Entropy( Sa2= S S = 1 2/3 * 1 2/3 * 1 = 0 2 Decision ree Hypothesis Space he hypothesis space in decision tree learning is the set of all possible decision trees. Decision tree learning algorithms search trough this space to find a decision tree that correctly classifies the training data. In the resulted decision tree each Internal node perform a test on attribute x and branch according to the results. Leaf nodes specify class h(x). hus, the classification will be very easy, using the decision tree and splitting based on some tests. he space of all decision trees is a complete space of finite discretevalued function. Consequently, every Boolean function can be represented by decision trees and there are usually different decision trees for a single Boolean function. Let us see the use of decision trees in representing Boolean functions in some examples (Ex. 3.1, p 77, Machine Learning, om Mitchell). In the following examples we start with testing variable A. However, in actual decision tree building (for learning) this may not be a good idea and we have to choose the variable with the highest information gain for the root. v ) )

a )Α Β A B, are used instead of, to distinguish between attribute values and output results. ( Β ) b) Α C A C B In the second branch we must test all variables, which is not desirable. Entropy tries to address this kind of problems.

c) Α XOR Β A B B In XOR function Entropy does not say anything, so we have to test every variable. Also we can not find any nice dependencies between the variables. Another example of such a function is Parity function. Parity unction In parity function we can never conclude the result is 1 or 0 until we test all attributes and come to the end. When representing parity function with decision trees we have to split on every variable and keep building the tree all the way down. Learners usually like to decompose the problem and that is why these functions are difficult for Decision rees and Bayesian Classifiers to be learned. herefore, Parity function is usually used in testing different learners. ( A B) ( C ) d) D Unlike Parity and XOR we have a break here. (We don not need to test all variables) A B C D C he marked subtrees are identical. o address this problem, that is to avoid replication we could use a different representation, like a graph. We can have a more compact representation by using a Decision Graph. D

Decision Graph Decision Graphs, such as the one depicted in igure 2, are generalizations of decision trees, having decision nodes and leaves. he feature that distinguishes decision graphs from decision trees is that decision graphs may also contain Joins. A Join is represented by two nodes having a common child, and this specifies that two subsets have some common properties, and hence can be considered as one subset. he manner in which objects are categorized by decision graphs is the same as that of decision trees. Each decision tree and decision graph defines a categorization (i.e., some partitioning of the object space into disjoint categories). he set of decision functions representable by graphs is exactly the same as the set representable by trees. However, the set of categorizations which can enter into the definition of a decision function are different. or example, the categorizations for ( A B) ( C D) given in igure 2.a and igure 2.b are different since the decision tree partitions the object space into 7 categories, while the decision graph partitions the object space into 2 categories [1]. a) A Decision ree for ( A B) ( C D) b) A Decision Graph for ( A B) ( C D) igure 2 [1] Algebraic Decision Diagram Another variant of Decision rees are Algebraic Decision Diagrams, which are used in representing target functions. ADDs are a compact, efficiently manipulable data structure for representing realvalued functions over Boolean variables B n R. hey generalize a treestructured representation by allowing nodes to have multiple parents, leading to the recombination of isomorphic subgraphs and hence to a possible reduction in the representation size [2].

Inductive Bias It was mentioned before that the hypothesis space of ID3 is the power set of instances X. his means that there is no Restrictive Bias for decision tree learning. Restrictive Bias, also called Language Bias, limits the set of available hypothesis and we have no such bias for ID3. On the other hand, there is a Preference Bias, or Search Bias, for decision tree learning. he Preference Bias orders the available hypotheses and in ID3 there is a preference for: 1) Shorter rees. 2) rees with high information gain attributes near the root. his bias often goes by the name of Occam s razor 2. Occam s razor Occam s razor says: Prefer the simplest hypothesis that fits the data. In other words the inductive bias for ID3 says that if there is a shorter tree, then it would be taken. So the preference is for shorter trees but it is not always guaranteed. In fact, ID3 is a greedy algorithm so it can not always find the shortest tree. Another question that arises regarding Occam s razor is that What is a short tree? A tree with fewer nodes or a shallow tree? Actually, the general idea behind Occam s razor is that we want to avoid splitting as much as possible. herefore, shallower trees would be better for us. here are a number of arguments in favor and against Occam s razor. here have been many researches on this topic, but we can not clearly answer this question yet: What is bias really? Avoiding Overfitting the Data As discussed before in ID3 the tree is built just deeply enough to perfectly classify the training data. his strategy can produce trees that overfit the training examples. Consider the following Concept C: X1 1 X2 1 X3 0 X4 0 X5 0 Now suppose that the training set contains variables X3, X4 and X5. hen obviously the learner would classify X1 and X2 as 0, i.e. H(X1) = 0 and H(X2) = 0, which is not correct. Here the learner has overfitted the data. Given a hypothesis space H and hypothesis h H, let error train (h) be the error of the hypothesis h over the training examples and error D (h) be the error of the hypothesis over the entire distribution D of data: h is said to overfit the training data, if there exists some other hypothesis h H, such that errortrain ( h) < errortrain ( h' ) and errord ( h) > errord ( h' ). 2 William of Occam was one of the first philosophers who discussed the question of preferring short hypotheses, around the year 1320.

Overfitting usually happens where there are random errors or noise in the training examples, or there is small number of examples for training. In the latter case, we usually have coincidental regularities. his happens when some attribute partitions the training example very well but is irrelevant to the real target function. o see the effect of an error in the training data on building the relevant decision tree, consider the following example. Suppose in the entire distribution of the data we have: If SUNNY = then PLAYENNIS = But in the training example there is an instance with the following values of attributes: X1 X2 SUNNY PLAYENNIS So in reality the tree should not split any more in the branch of SUNNY = (igure 3.a), but having the above instance in the training example the learned tree has to split on every attribute after Sunny in the branch of SUNNY = (igure 3.b). While the resulted tree classifies the training examples well, it fails in classifying the real Data. Obviously, we have an overfitting here. SUNNY SUNNY X1 X1...... a) he decision tree for real target function b) he learned decision tree igure 3 In general, it is very hard to avoid overfitting because usually all we have is the training data. One way is to go for some statistics. We can do a statistical analysis and use the gathered information to answer the question: How much representative the sample is? hen, if the sample is not representative enough, we would have overfitting. or example, in this way statisticians may calculate the error for a given concept and then give a probability measure for all concepts. However, if we have some assumption about the real target function, it would be much easier to find out when overfitting occurs. It is worth to mention that there is a close relation between inductive bias and overfitting and having a good inductive bias can help us to avoid overfitting.

raining and Validation Among the different approaches that have been proposed to avoid overfitting the most common one is called training and validation set approach. In this approach the available training date is separated into two sets: raining Set and the Validation Set. raining set is then used to learn the hypothesis and the validation set is used to evaluate the accuracy of the resulted hypothesis. In this way, we provide a check against overfitting and will use the validation set to prune the tree whenever it overfits the data. he size of validation set has also a significant importance, because it should be large enough to provide a good sample of the instances. Although there are statistical formulas for computing the required size of the validation set, one common heuristic is to use 1/3 of the available examples for the validation set and use the remaining 2/3 for the training. here are different approaches for pruning a tree using the validation set. wo of those approaches are: 1) Reduced Error Pruning 2) Rule Post Pruning Reduced Error Pruning In this approach, the data is split into the training and validation sets and they are used in pruning. Pruning a decision node is removing its subtree and making it a leaf node, then assigning it the most common classification affiliated with that node. In Reduced Error Pruning each node is considered for pruning. he pruning algorithm is as follows: Do until further pruning is harmful: 1. Evaluate the impact on validation set of pruning each possible node (plus those below) 2. Greedily remove the node that most improves the accuracy on the validation set In this way, each pruning is tested over the validation set and the smallest version of the accurate subtree is produced. o address the problem of limited data, other techniques have been proposed which have the same idea behind. hose techniques usually partition the available data several times in different ways and then average the results. Rule PostPruning Another approach for pruning the learned tree is Rule PostPruning. In this approach, we convert the decision tree to the equivalent set of logical rules and then prune each rule. Obviously pruning in the Logical ramework is equivalent to Generalizing the rules. hese generalized rules are then sorted according to their accuracy and used in the same order. Here the validation set is used to estimate the accuracy of the resulted rules.

Other Issues in Decision rees Learning here are still more topics that have been addressed by decision trees. Some of these topics are: kary Attribute Values: Problem: When an attribute has many values, its will be selected because of the high information gain. Solution: Using gain ratio. Real Attribute Values Problem: he attributes tested in the decision nodes must be discrete valued. Solution: Defining new discrete valued attributes and partitioning the real attribute value into discrete intervals. Attribute Cost Problem: Sometimes different attributes have different costs. How can we find the tree with the lowest cost? Solution: Including the cost of each attribute in the Gain function. Missing Values Problem: In some examples the value of an attribute is missed. Solution: here are different strategies, using training examples and sorting through the tree (e.g. If n test A, which has a missing value, assign the most common value among training examples at node n). Other Splitting Criteria We can use different measures for selecting attributes. or example we can measure not only the entropy but also the size of a set to select an attribute. References [1] Jonothan J. Oliver; Decision Graphs: An Extension of Decision rees, 1992 [2] Robert StAubin, Jesse Hoey and Craig Boutilier; APRICODD: Approximate Policy Construction using Decision Diagrams [3] om M. Mitchell; Machine Learning, 1997 [4] Entropy on the World Wide Web, http://www.math.psu.edu/gunesch/entropy.html