Decision Tree for Playing Tennis

Similar documents
Chapter 2 Rule Learning in a Nutshell

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

CS Machine Learning

(Sub)Gradient Descent

Lecture 1: Machine Learning Basics

Rule Learning With Negation: Issues Regarding Effectiveness

Learning goal-oriented strategies in problem solving

A Decision Tree Analysis of the Transfer Student Emma Gunu, MS Research Analyst Robert M Roe, PhD Executive Director of Institutional Research and

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Calibration of Confidence Measures in Speech Recognition

Data Stream Processing and Analytics

Lecture 1: Basic Concepts of Machine Learning

Corrective Feedback and Persistent Learning for Information Extraction

Rule Learning with Negation: Issues Regarding Effectiveness

Python Machine Learning

Softprop: Softmax Neural Network Backpropagation Learning

Version Space. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Version Space Term 2012/ / 18

Assignment 1: Predicting Amazon Review Ratings

A Version Space Approach to Learning Context-free Grammars

National Longitudinal Study of Adolescent Health. Wave III Education Data

INPE São José dos Campos

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

Probability and Statistics Curriculum Pacing Guide

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Answer Key For The California Mathematics Standards Grade 1

Probabilistic Latent Semantic Analysis

FRAMEWORK FOR IDENTIFYING THE MOST LIKELY SUCCESSFUL UNDERPRIVILEGED TERTIARY STUDY BURSARY APPLICANTS

CSL465/603 - Machine Learning

Learning Methods in Multilingual Speech Recognition

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

An Empirical Analysis of the Effects of Mexican American Studies Participation on Student Achievement within Tucson Unified School District

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

Prediction of Maximal Projection for Semantic Role Labeling

Using dialogue context to improve parsing performance in dialogue systems

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Mining Student Evolution Using Associative Classification and Clustering

CS 1103 Computer Science I Honors. Fall Instructor Muller. Syllabus

On-Line Data Analytics

Proof Theory for Syntacticians

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Switchboard Language Model Improvement with Conversational Data from Gigaword

The stages of event extraction

An Effective Framework for Fast Expert Mining in Collaboration Networks: A Group-Oriented and Cost-Based Method

The lab is designed to remind you how to work with scientific data (including dealing with uncertainty) and to review experimental design.

Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method

CS 101 Computer Science I Fall Instructor Muller. Syllabus

Discriminative Learning of Beam-Search Heuristics for Planning

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Seminar - Organic Computing

Artificial Neural Networks written examination

Multi-Lingual Text Leveling

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Word learning as Bayesian inference

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Generative models and adversarial training

Radius STEM Readiness TM

Evolutive Neural Net Fuzzy Filtering: Basic Description

Applications of data mining algorithms to analysis of medical data

An overview of risk-adjusted charts

Medical Complexity: A Pragmatic Theory

THE PENNSYLVANIA STATE UNIVERSITY SCHREYER HONORS COLLEGE DEPARTMENT OF MATHEMATICS ASSESSING THE EFFECTIVENESS OF MULTIPLE CHOICE MATH TESTS

Learning From the Past with Experiment Databases

Learning Distributed Linguistic Classes

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Multi-label Classification via Multi-target Regression on Data Streams

STT 231 Test 1. Fill in the Letter of Your Choice to Each Question in the Scantron. Each question is worth 2 point.

A Case Study: News Classification Based on Term Frequency

Using the Attribute Hierarchy Method to Make Diagnostic Inferences about Examinees Cognitive Skills in Algebra on the SAT

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Multimedia Application Effective Support of Education

"f TOPIC =T COMP COMP... OBJ

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

University of Groningen. Systemen, planning, netwerken Bosman, Aart

Impact of Cluster Validity Measures on Performance of Hybrid Models Based on K-means and Decision Trees

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

ABSTRACT. A major goal of human genetics is the discovery and validation of genetic polymorphisms

Self Study Report Computer Science

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

MYCIN. The MYCIN Task

Acquiring Competence from Performance Data

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Learning to Rank with Selection Bias in Personal Search

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Exploration. CS : Deep Reinforcement Learning Sergey Levine

On the Polynomial Degree of Minterm-Cyclic Functions

SARDNET: A Self-Organizing Feature Map for Sequences

End-of-Module Assessment Task K 2

Data Structures and Algorithms

Standards-Based Bulletin Boards. Tuesday, January 17, 2012 Principals Meeting

Large-Scale Web Page Classification. Sathi T Marath. Submitted in partial fulfilment of the requirements. for the degree of Doctor of Philosophy

An Empirical and Computational Test of Linguistic Relativity

Go fishing! Responsibility judgments when cooperation breaks down

A NEW ALGORITHM FOR GENERATION OF DECISION TREES

Unit: Human Impact Differentiated (Tiered) Task How Does Human Activity Impact Soil Erosion?

CSC200: Lecture 4. Allan Borodin

Transcription:

Decision Tree

Decision Tree for Playing Tennis

(outlook=sunny, wind=strong, humidity=normal,? )

DT for prediction C-section risks

Characteristics of Decision Trees Decision trees have many appealing properties Similar to human decision process, easy to understand Deal with both discrete and continuous features Highly flexible hypothesis space, as the # of nodes (or depth) of the tree increase, decision tree can represent increasingly complex decision boundaries

DT can represent arbitrarily complex decision boundaries Y N Y N Y N Y 5 N If needed, the tree can keep on growing until all examples are correctly classified! Although it may not be the best idea

How to learn decision trees? Possible goal: find a decision tree h that achieves minimum error on training data Trivially achievable if use a large enough tree Another possibility: find the smallest decision tree that achieves the minimum training error NP-hard

Greedy Learning For DT We will study a top-down, greedy search approach. Instead of trying to optimize the whole tree together, we try to find one test at a time. Basic idea: (assuming discrete features, relax later) 1. Choose the best attribute to test on at the root of the tree. 2. Create a descendant node for each possible outcome of the test 3. Training examples in training set S are sent to the appropriate descendent node 4. Recursively apply the algorithm at each descendant node to select the best attribute to test using its associated training examples If all examples in a node belong to the same class, turn it into a leaf node, label with the majority class

One possible question: is x <0.5? [13, 15] x < 0.5? [8, 0] [5, 15]?

Continue [13, 15] x < 0.5? [8, 0] [5, 15] y<0.5? [4, 0] [1, 15]? This could keep on going, until all examples are correctly classified.

Choosing the best test 25 14 X1 25 14 X2 T F T F 20 8 5 6 17 3 8 11 Which one is better?

Choosing the Best test: A General View S 25 + 14 - X1 S: current set of training examples T F m branches, one for each possible outcome of the test S1 20 8 5 6 S2,, : m subsets of training examples + - + - Uncertainty of the class label in S Total Expected Remaining Uncertainty after the test

Uncertainty Measure: Entropy H ( y) k i 1 p i log 1 k 2 pi log2 pi i 1 p i

Entropy is a concave function downward H(y) P(y=0) Minimum uncertainty occurs when p 0 =0 or 1

The Information Gain approach: Measuring uncertainty using entropy: 26 + t 7 - T F 21 + 3-5 + 4 -

Mutual information By measuring the reduction of entropy, we are measuring the mutual information between the feature we test on and the class label Where This is also called the information gain criterion

Choosing the Best Feature: Summary t Original uncertainty Total Expected Remaining Uncertainty after the test Measures of Uncertainty Error Entropy Gini Index

Example

Selecting the root test using information gain 9 5 + - Humidity 9 5 + - Outlook High Normal sunny Overcast Rain 3 + 4-6 + 1-2 + 3-4 + 0-3 + 2 -

Continue building the tree 9 5 + - Outlook sunny Overcast Rain 2 + 3 - Yes 3 + 2 - Which test should be placed here? 2 3 + - Humidity High Normal 0 + 3-2 + 0 -

Issues with Multi-nomial Features Multi-nomial features: more than 2 possible values Consider two features, one is binary, the other has 100 possible values, which one you expect to have higher information gain? Conditional entropy of Y given the 100-valued feature will be low why? This bias will prefer multinomial features to binary features Method 1: To avoid this, we can rescale the information gain: H ( y) H ( y x arg max j H ( x ) j Method 2: Test for one value versus all of the others Method 3: Group the values into two disjoint sets and test one set against the other j ) Information gain of

Dealing with Continuous Features Test against a threshold How to compute the best threshold for? Sort the examples according to. Move the threshold from the smallest to the largest value Select that gives the best information gain Trick: only need to compute information gain when class label changes Note that continuous features can be tested for multiple times on the same path in a DT

Considering both discrete and continuous features If a data set contains both types of features, do we need special handling? No, we simply consider all possibly splits in every step of the decision tree building process, and choose the one that gives the highest information gain This include all possible (meaningful) thresholds

Issue of Over-fitting Decision tree has a very flexible hypothesis space As the nodes increase, we can represent arbitrarily complex decision boundaries This can lead to over-fitting t2 t3 Possibly just noise, but the tree is grown larger to capture these examples

Over-fitting

Avoid Overfitting Early stop Stop growing the tree when data split does not offer large benefit (e.g., compare information gain to a threshold, or perform statistical testing to decide if the gain is significant) Post pruning Separate training data into training set and validating set Evaluate impact on validation set when pruning each possible node Greedily prune the node that most improves the validation set performance

Effect of Pruning

Regression Tree Similar ideas can be applied for regression problems Prediction is computed as the average of the target values of all examples in the leave node Uncertainty is measured by sum of squared errors

Example Regression Tree Predicting MPG of a car given its # of cylinders, horsepower, weight, and model year

Summary Decision tree is a very flexible classifier Can model arbitrarily complex decision boundaries By changing the depth of the tree (or # of nodes in the tree), we can increase of decrease the model complexity Handle both continuous and discrete features Handle both classification and regression problems Learning of the decision tree Greedy top-down induction Not guaranteed to find an optimal decision tree DT can overfitting to noise and outliers Can be controlled by early stopping or post pruning