Decision Tree Learning

Similar documents
CS Machine Learning

Learning Methods in Multilingual Speech Recognition

Lecture 1: Machine Learning Basics

A Version Space Approach to Learning Context-free Grammars

Rule Learning With Negation: Issues Regarding Effectiveness

Lecture 1: Basic Concepts of Machine Learning

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Proof Theory for Syntacticians

Rule Learning with Negation: Issues Regarding Effectiveness

STT 231 Test 1. Fill in the Letter of Your Choice to Each Question in the Scantron. Each question is worth 2 point.

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Python Machine Learning

RETURNING TEACHER REQUIRED TRAINING MODULE YE TRANSCRIPT

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Artificial Neural Networks written examination

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Left, Left, Left, Right, Left

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Model Ensemble for Click Prediction in Bing Search Ads

Defragmenting Textual Data by Leveraging the Syntactic Structure of the English Language

Mathematics Success Grade 7

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Learning goal-oriented strategies in problem solving

Using focal point learning to improve human machine tacit coordination

COMPUTER-ASSISTED INDEPENDENT STUDY IN MULTIVARIATE CALCULUS

Data Stream Processing and Analytics

Large Kindergarten Centers Icons

Curriculum Design Project with Virtual Manipulatives. Gwenanne Salkind. George Mason University EDCI 856. Dr. Patricia Moyer-Packenham

RANKING AND UNRANKING LEFT SZILARD LANGUAGES. Erkki Mäkinen DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF TAMPERE REPORT A ER E P S I M S

Grade 6: Correlated to AGS Basic Math Skills

DegreeWorks Training Guide

Calibration of Confidence Measures in Speech Recognition

CS 101 Computer Science I Fall Instructor Muller. Syllabus

Prediction of Maximal Projection for Semantic Role Labeling

Multimedia Application Effective Support of Education

A Decision Tree Analysis of the Transfer Student Emma Gunu, MS Research Analyst Robert M Roe, PhD Executive Director of Institutional Research and

Radius STEM Readiness TM

CS 1103 Computer Science I Honors. Fall Instructor Muller. Syllabus

SARDNET: A Self-Organizing Feature Map for Sequences

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

! "! " #!!! # #! " #! " " $ # # $! #! $!!! #! " #! " " $ #! "! " #!!! #

Assignment 1: Predicting Amazon Review Ratings

(Sub)Gradient Descent

Alignment of Australian Curriculum Year Levels to the Scope and Sequence of Math-U-See Program

Multi-label classification via multi-target regression on data streams

Switchboard Language Model Improvement with Conversational Data from Gigaword

Assessing Children s Writing Connect with the Classroom Observation and Assessment

Dublin City Schools Mathematics Graded Course of Study GRADE 4

Visit us at:

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Grammars & Parsing, Part 1:

Genevieve L. Hartman, Ph.D.

CS 446: Machine Learning

A NEW ALGORITHM FOR GENERATION OF DECISION TREES

University of Groningen. Systemen, planning, netwerken Bosman, Aart

Chapter 2 Rule Learning in a Nutshell

Algebra 2- Semester 2 Review

Decision Analysis. Decision-Making Problem. Decision Analysis. Part 1 Decision Analysis and Decision Tables. Decision Analysis, Part 1

Students Understanding of Graphical Vector Addition in One and Two Dimensions

Linking Task: Identifying authors and book titles in verbose queries

Mathematics process categories

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

Discriminative Learning of Beam-Search Heuristics for Planning

Exploration. CS : Deep Reinforcement Learning Sergey Levine

WiggleWorks Software Manual PDF0049 (PDF) Houghton Mifflin Harcourt Publishing Company

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

Mathematics. Mathematics

Sight Word Assessment

J j W w. Write. Name. Max Takes the Train. Handwriting Letters Jj, Ww: Words with j, w 321

First Grade Curriculum Highlights: In alignment with the Common Core Standards

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

FUZZY EXPERT. Dr. Kasim M. Al-Aubidy. Philadelphia University. Computer Eng. Dept February 2002 University of Damascus-Syria

Version Space. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Version Space Term 2012/ / 18

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

Chapter 2. Intelligent Agents. Outline. Agents and environments. Rationality. PEAS (Performance measure, Environment, Actuators, Sensors)

End-of-Module Assessment Task K 2

2 nd grade Task 5 Half and Half

EDCI 699 Statistics: Content, Process, Application COURSE SYLLABUS: SPRING 2016

Rule-based Expert Systems

On-Line Data Analytics

Generative models and adversarial training

Standard 1: Number and Computation

THE UNITED REPUBLIC OF TANZANIA MINISTRY OF EDUCATION SCIENCE AND TECHNOLOGY SOCIAL STUDIES SYLLABUS FOR BASIC EDUCATION STANDARD III-VI

Lahore University of Management Sciences. FINN 321 Econometrics Fall Semester 2017

NUMBERS AND OPERATIONS

Lesson plan for Maze Game 1: Using vector representations to move through a maze Time for activity: homework for 20 minutes

United States Symbols First Grade By Rachel Horen. Featured Selection:

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Naviance Family Connection

A method to teach or reinforce concepts of restriction enzymes, RFLPs, and gel electrophoresis. By: Heidi Hisrich of The Dork Side

Learning to Rank with Selection Bias in Personal Search

ABSTRACT. A major goal of human genetics is the discovery and validation of genetic polymorphisms

Build on students informal understanding of sharing and proportionality to develop initial fraction concepts.

Levels of processing: Qualitative differences or task-demand differences?

Answer Key For The California Mathematics Standards Grade 1

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading

Entrepreneurial Discovery and the Demmert/Klein Experiment: Additional Evidence from Germany

Test Blueprint. Grade 3 Reading English Standards of Learning

Transcription:

Decision Tree Example Decision Tree Learning Ronald J. Williams CSU520, Spring 2008 Interesting? Shape circle square triangle Color Size No red blue green large small Yes No Yes Yes No Interesting=Yes ((Shape=circle)^((Color=red)V(Color=green))) V ((Shape=square)^(Size=large)) Decision Trees: Slide 2 Inducing Decision Trees from Data Suppose we have a set of training data and want to construct a decision tree consistent with that data One trivial way: Construct a tree that essentially just reproduces the training data, with one path to a leaf for each example no hope of generalizing Better way: ID3 algorithm tries to construct more compact trees uses information-theoretic ideas to create tree recursively Decision Trees: Slide 3 Inducing a decision tree: example Suppose our tree is to determine whether it s a good day to play tennis based on attributes representing weather conditions Input attributes Attribute Outlook Temperature Humidity Possible Values Sunny, Overcast, Rain Hot, Mild, Cool High, Normal Strong, Weak Target attribute is PlayTennis, with values Yes or No Decision Trees: Slide 4 Training Data Essential Idea Main question: Which attribute test should be placed at the root? In this example, 4 possibilities Once we have an answer to this question, apply the same idea recursively to the resulting subtrees Base case: all data in a subtree give rise to the same value for the target attribute In this case, make that subtree a leaf with the appropriate label Decision Trees: Slide 5 Decision Trees: Slide 6 1

For example, suppose we decided that should be used as the root Resulting split of the data looks like this: Strong D2, D6, D7, D11, D12, D14 D1, D3, D4, D5, D8, D9, D10, D13 PlayTennis: 3 Yes, 3 No Weak PlayTennis: 6 Yes, 2 No Is this a good test to split on? Or would one of the other three attributes be better? Decision Trees: Slide 7 Digression: Information & Entropy Suppose we want to encode and transmit a long sequence of symbols from the set {a, c, e, g} drawn randomly according to the following probability distribution D: Symbol a c e g Probability 1/8 1/8 1/4 1/2 Since there are 4 symbols, one possibility is to use 2 bits per symbol In fact, it s possible to use 1.75 bits per symbol, on average Can you see how? Decision Trees: Slide 8 Here s one way: Symbol a 000 c 001 e 01 g 1 Encoding Average number of bits per symbol = ⅛ * 3 + ⅛ * 3 + ¼ * 2 + ½ * 1 = 1.75 Information theory: Optimal length code assigns log 2 1/p = - log 2 p bits to a message having probability p Decision Trees: Slide 9 Entropy Given a distribution D over a finite set, where <p 1, p 2,..., p n > are the corresponding probabilities, define the entropy of D by H(D) = - i p i log 2 p i For example, the entropy of the distribution we just examined, <⅛, ⅛, ¼, ½>, is 1.75 (bits) Also called information In general, entropy is higher the closer the distribution is to being uniform Decision Trees: Slide 10 Suppose there are just 2 values, so the distribution has the form <p, 1-p> Here s what the entropy looks like as a function of p: Back to decision trees - almost Think of the input attribute vector as revealing some information about the value of the target attribute The input attributes are tested sequentially, so we d like each test to reveal the maximal amount of information possible about the target attribute value This encourages shallower trees, we hope To formalize this, we need the notion of conditional entropy Decision Trees: Slide 11 Decision Trees: Slide 12 2

Return to our symbol encoding example: Symbol a c e g Probability 1/8 1/8 1/4 1/2 Suppose we re given the identity of the next symbol received in 2 stages: we re first told that the symbol is a vowel or consonant then we learn its actual identity We ll analyze this 2 different ways First consider the second stage conveying the identity of the symbol given prior knowledge that it s a vowel or consonant For this we use the conditional distribution of D given that the symbol is a vowel Symbol a e Probability 1/3 2/3 and the conditional distribution of D given that the symbol is a consonant Symbol c g Probability 1/5 4/5 Decision Trees: Slide 13 Decision Trees: Slide 14 We can compute the entropy of each of these conditional distributions: H(D Vowel) = - 1/3 log 2 1/3 2/3 log 2 2/3 = 0.918 H(D Consonant) = - 1/5 log 2 1/5 4/5 log 2 4/5 = 0.722 We then compute the expected value of this as 3/8 * 0.918 + 5/8 * 0.722 = 0.796 H(D Vowel) = 0.918 represents the expected symbol given that it s a vowel H(D Consonant) = 0.722 represents the expected symbol given that it s a consonant Then the weighted average 0.796 is the expected symbol given whichever is true about it that it s a vowel or that it s a consonant Decision Trees: Slide 15 Decision Trees: Slide 16 Information Gain Thus while it requires an average of 1.75 bits to convey the identity of each symbol, once it s known whether it s a vowel or a consonant, it only requires 0.796 bits, on average, to convey its actual identity The difference 1.75 0.796 = 0.954 is the number of bits of information that are gained, on average, by knowing whether the symbol is a vowel or a consonant called information gain The way we computed this corresponds to the way we ll apply this to identify good split nodes in decision trees But it s instructive to see another way: Consider the first stage specifying whether vowel or consonant The probabilities look like this: Vowel Consonant Probability 3/8 5/8 The entropy of this is - 3/8 * log 2 3/8 5/8 * log 2 5/8 = 0.954 Decision Trees: Slide 17 Decision Trees: Slide 18 3

Now back to decision trees for real We ll illustrate using our PlayTennis data The key idea will be to select as the test for the root of each subtree the one that gives maximum information gain for predicting the target attribute value Since we don t know the actual probabilities involved, we instead use the obvious frequency estimates from the training data Here s our training data again: Training Data Decision Trees: Slide 19 Decision Trees: Slide 20 Which test at the root? We can place at the root of the tree a test for the values of one of the 4 possible attributes Outlook, Temperature, Humidity, or Need to consider each in turn But first let s compute the entropy of the overall distribution of the target PlayTennis values: There are 5 No s and 9 Yes s, so the entropy is - 5/14 * log 2 5/14 9/14 * log 2 9/14 = 0.940 Strong D2, D6, D7, D11, D12, D14 D1, D3, D4, D5, D8, D9, D10, D13 PlayTennis: 3 Yes, 3 No Weak PlayTennis: 6 Yes, 2 No H(PlayTennis =Strong) = - 3/6 * log 2 3/6-3/6 * log 2 3/6 = 1 H(PlayTennis =Weak) = - 6/8 * log 2 6/8-2/8 * log 2 2/8 = 0.811 So the expected value is 6/14 * 1 + 8/14 * 0.811 = 0.892 Therefore, the information gain after the test is applied is 0.940 0.892 = 0.048 Decision Trees: Slide 21 Decision Trees: Slide 22 Doing this for all 4 possible attribute tests yields Entropy here is -2/5 log2 2/5 3/5 log2 3/5 = 0.971 Partially formed tree Attribute tested at root Information Gain Outlook 0.246 Temperature 0.029 Humidity 0.151 0.048 Therefore the root should test for the value of Outlook Correct test for here among Temperature, Humidity, and is the one giving highest information gain with respect to these 5 examples only This node is a leaf since all its data agree on the same value Decision Trees: Slide 23 Decision Trees: Slide 24 4

Extensions Continuous input attributes Sort data on any such attribute and try to identify a high information gain threshold, forming binary split Continuous target attribute Called a regression tree won t deal with it here Avoiding overfitting More on this later Use separate validation set Use tree post-pruning based on statistical tests Decision Trees: Slide 25 Extensions (continued) Inconsistent training data (same attribute vector classified more than one way) Store more information in each leaf Missing values of some attributes in training data Won t deal with this here Missing values of some attributes in a new attribute vector to be classified (or missing branches in the induced tree) Send the new vector down multiple branches corresponding to all values of that attribute, then let all leaves reached contribute to result Decision Trees: Slide 26 5