Decision Tree For Playing Tennis

Similar documents
Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

CS Machine Learning

Lecture 1: Machine Learning Basics

Python Machine Learning

Rule Learning With Negation: Issues Regarding Effectiveness

(Sub)Gradient Descent

Rule Learning with Negation: Issues Regarding Effectiveness

Learning From the Past with Experiment Databases

Version Space. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Version Space Term 2012/ / 18

Instructor: Mario D. Garrett, Ph.D. Phone: Office: Hepner Hall (HH) 100

A Decision Tree Analysis of the Transfer Student Emma Gunu, MS Research Analyst Robert M Roe, PhD Executive Director of Institutional Research and

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Softprop: Softmax Neural Network Backpropagation Learning

Chapter 2 Rule Learning in a Nutshell

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Cooperative evolutive concept learning: an empirical study

Data Stream Processing and Analytics

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

A Case Study: News Classification Based on Term Frequency

Probability and Statistics Curriculum Pacing Guide

An Empirical Comparison of Supervised Ensemble Learning Approaches

A Bootstrapping Model of Frequency and Context Effects in Word Learning

Lecture 1: Basic Concepts of Machine Learning

Entrepreneurial Discovery and the Demmert/Klein Experiment: Additional Evidence from Germany

Model Ensemble for Click Prediction in Bing Search Ads

Mathematics Scoring Guide for Sample Test 2005

COMPUTER-ASSISTED INDEPENDENT STUDY IN MULTIVARIATE CALCULUS

Linking Task: Identifying authors and book titles in verbose queries

School Size and the Quality of Teaching and Learning

Artificial Neural Networks written examination

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Probability estimates in a scenario tree

Calibration of Confidence Measures in Speech Recognition

Machine Learning and Development Policy

A. What is research? B. Types of research

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Semi-Supervised Face Detection

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Activity Recognition from Accelerometer Data

Probabilistic Latent Semantic Analysis

Optimizing to Arbitrary NLP Metrics using Ensemble Selection

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

POLA: a student modeling framework for Probabilistic On-Line Assessment of problem solving performance

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Using dialogue context to improve parsing performance in dialogue systems

On-Line Data Analytics

Truth Inference in Crowdsourcing: Is the Problem Solved?

Rule-based Expert Systems

Learning Distributed Linguistic Classes

A Version Space Approach to Learning Context-free Grammars

The Boosting Approach to Machine Learning An Overview

Opinion on Private Garbage Collection in Scarborough Mixed

FRAMEWORK FOR IDENTIFYING THE MOST LIKELY SUCCESSFUL UNDERPRIVILEGED TERTIARY STUDY BURSARY APPLICANTS

Why Did My Detector Do That?!

Essentials of Ability Testing. Joni Lakin Assistant Professor Educational Foundations, Leadership, and Technology

Grammars & Parsing, Part 1:

Combining Proactive and Reactive Predictions for Data Streams

Individual Differences & Item Effects: How to test them, & how to test them well

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

12- A whirlwind tour of statistics

Universidade do Minho Escola de Engenharia

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Quantitative analysis with statistics (and ponies) (Some slides, pony-based examples from Blase Ur)

The lab is designed to remind you how to work with scientific data (including dealing with uncertainty) and to review experimental design.

Word learning as Bayesian inference

Experiment Databases: Towards an Improved Experimental Methodology in Machine Learning

CSL465/603 - Machine Learning

On-the-Fly Customization of Automated Essay Scoring

Toward Probabilistic Natural Logic for Syllogistic Reasoning

Proof Theory for Syntacticians

Comparison of network inference packages and methods for multiple networks inference

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

Generative models and adversarial training

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Edexcel GCSE. Statistics 1389 Paper 1H. June Mark Scheme. Statistics Edexcel GCSE

ACTL5103 Stochastic Modelling For Actuaries. Course Outline Semester 2, 2014

stateorvalue to each variable in a given set. We use p(x = xjy = y) (or p(xjy) as a shorthand) to denote the probability that X = x given Y = y. We al

MYCIN. The MYCIN Task

Assignment 1: Predicting Amazon Review Ratings

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Improving Simple Bayes. Abstract. The simple Bayesian classier (SBC), sometimes called

Short vs. Extended Answer Questions in Computer Science Exams

Evaluating Statements About Probability

Analyzing sentiments in tweets for Tesla Model 3 using SAS Enterprise Miner and SAS Sentiment Analysis Studio

Learning Lesson Study Course

P-4: Differentiate your plans to fit your students

GCSE. Mathematics A. Mark Scheme for January General Certificate of Secondary Education Unit A503/01: Mathematics C (Foundation Tier)

Constructive Induction-based Learning Agents: An Architecture and Preliminary Experiments

Grade 2: Using a Number Line to Order and Compare Numbers Place Value Horizontal Content Strand

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Norms How were TerraNova 3 norms derived? Does the norm sample reflect my diverse school population?

Defragmenting Textual Data by Leveraging the Syntactic Structure of the English Language

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

A NEW ALGORITHM FOR GENERATION OF DECISION TREES

Grade 6: Correlated to AGS Basic Math Skills

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

Transcription:

Decision Tree For Playing Tennis ROOT NODE BRANCH INTERNAL NODE LEAF NODE Disjunction of conjunctions

Another Perspective of a Decision Tree Model Age 60 40 20 NoDefault NoDefault + + NoDefault Default Age, income Case A. 30,, $110K, Default Case B. 50,, $110K, NoDefault Case C. 45,, $90K, NoDefault Case A. 32,, $105K, Default Case B. 49,, $82K, NoDefault Case C. 29,, $50K, NoDefault 60 80 100 Income

Top-Down Tree Induction

Which Column and Split Point? Multitude of techniques: Entropy/Information gain Chi square test (CHAID) Test of independence GINI index

Information Gain

Entropy

Data Set

Choosing the Next Attribute - 1

Choosing the Next Attribute - 2

Representational and Search Bias

Occam s Razor 14 th Century Franciscan friar; William of Occam. The principle states that "Entities should not be multiplied unnecessarily." People often reinvented Occam's Razor Newton - "We are to admit no more causes of natural things than such as are both true and sufficient to explain their appearances." To most scientist the razor is: "when you have two competing theories which make exactly the same predictions, the one that is simpler is the better."

Review of Choosing a Split Entropy= Σ-p.log 2 (p) Entropy Population = 1 Entropy Split on Length =0.42 Entropy Split on Thread =0.85

Stopping Criteria What type of tree will perfectly classify the training data (ie. 100% training set accuracy)? Is this a bad thing?, Why? What does this tell you about the relationship between the dependent and independent attributes? Stop growing the tree when: A certain tree depth is reached Number of records at a node goes below some threshold. All potential splits are insignificant

How Do We Know When We ve Overfitted The Training Data? Is there any other way?

Training Set Error Should Approximately Equal Test Set Error

Trimming/Pruning Trees Stopping criterion can be some what arbitrary. Automatic pruning of trees Ask the data, How far should we split the data. Two general approaches: Use part of the training set as a validation set Use entire training set (usually an MDL approach).

Using Pruning To Prevent Overfitting

Reduced Error Pruning

Reduced Error Pruning

Results of Reduced Error Pruning Consider the use of learning a tree is to make prediction What is the fundamental assumption that this learning algorithm is making

Rule Post-Pruning

X-Fold Cross Validation Used to estimate the accuracy of the learner. Feature selection for other supervised learning algorithms. Fold 1 Fold 2 Fold 3 Fold 4 Fold 5

MDL Base Pruning Minimize Overall Message Length MessLen(Model, Data) = MessLen(Model) + MessLen(Data Model) Encode model using node encoding. Encode model in terms of classification error. Remove a node if it reduces the cost.

Ensemble of Decision Trees Why stop at one decision tree. Adopt the committee of experts approach Build multiple decision trees, each votes on the classification, highest vote wins. What problem will we run up against?

Why Does it Work? Brieman Works because decision tree learners are unstable. Friedman Reduces the variance of the learner without reducing bias. Domingos Underlying learners bias towards simplicity is too great Bagging corrects bias.

C4.5 - Quinlan Goto http://www.cse.unsw.edu.au/~quinlan/ Download C4.5 Release 8 Need to untar it (use tar xvf) In R8/Src type make all, builds c4.5 executable May need to remove contents of getopt.c file. Use nroff doc/c4.5.1 more to read documentation. See me during office hours if you have any problems.

Building a Model Using C4.5 Options c4.5 - form [ -f filestem] [ -u ] [ -s ] [ -p ] [ -v verb ] [ -t trials ] [ -w wsize ] [ -i incr ] [ -g ] [ -m minobjs ] [ -c cf ] C4.5 f golf m 2 outlook = overcast: Play (4.0) outlook = sunny: humidity <= 75 : Play (2.0) humidity > 75 : Don't Play (3.0) outlook = rain: windy = true: Don't Play (2.0) windy = false: Play (3.0) Size Errors 8 0( 0.0%)

Building and Applying a Model Using C4.5 Many data sets in the Data directory can are split into.data (training set) and.test (test set). Use c4.5 f <name> -u To build a model and then test it on the training set. (use labor-neg or vote datasets).

Model Uncertainty What s wrong with making predictions from one model? May have two or more equally accurate models that give different predictions. May have two models that are quite fundamentally different

Ensemble of Models Techniques Bayesian Modeling Averaging Pr(c,x D, H) = Σ h H Pr(c,x h). Pr(h D) Weight each model s prediction by how good the model is. Can this approach be applied to C4.5 Dtrees? Boosting (Bootstrap Aggregation), 1996. Improves accuracy Seminal paper says on 19 of 26 data sets improves accuracy by 4%.

Bagging Take a number of bootstrap samples of the training set. Build a decision tree from each When predicting the category for a test set instance: Each tree gets to vote on the decision Ties are resolved by choosing the most populous class Empirical evidence shows that you get consistently better results on most data set.

The Bagging Algorithm Building the Models For i = 1 to k // k is the number of bags T i =BootStrap(D) // D is the training set Build Model M i from T i (ie. Induce the tree) End Applying the Models To Make a Prediction For a test set example, x For i = 1 to k // k is the number of bags C i =M i (x) End Prediction is the class with the most vote.

Take A Bootstrap Sample Sample with replacement Bootstrapping and model building can be easily parallelized

Bagging - Results

Example of Bagging Problem Single DT Solution 100 DT s Bagging Solution

Boosting The Idea Take weak learners (marginally better than random guessing) make them stronger. Freund and Schapire, 95 AdaBoost AdaBoost premise Each training instances has equal weight Build first Model from training instances Training instances that are classified incorrectly given more weight Build another model with re-weighted instances and so on and so on.

Boosting Psuedo Code

Some Implementation Comments Difficult to parallelize Factoring instance weights into decision tree induction. Tree vote is weighted inversely to error. Adaptive Boosting (AdaBoosting) according to the tree error Free scaled down version of C5.0 incorporates boosting available at http://www.rulequest.com/download.html

Toy Example (Freund COLT 99) Round 1

Round 2 + 3

Final Hypothesis Demo at http://www.cs.huji.ac.il/~yoavf/adaboost/index.html

Some Insights into Boosting Final aggregate model will have no training error (given some conditions). Seems to over-fit but reduces test set error Larger margins on training set correspond to better generalization error Margin(x) = y Σ α j h j (x) / Σ α j

The Performance of Models and Learners Error of the hypothesis vs error of the learning algorithm? Know the training and test set error, good estimate of the learner s performance? Learners Error = noise + bias 2 + variance How we calculate bias and variance for a learner* T 1 n : Training sets drawn randomly from population Bias is the difference in error over all training sets true error. Variance is the variability of the error. Why would a decision tree be biased? Have a high variance?

Errors

Bias and Variance

Retrospective on Decision Trees Representation and search Does Bagging and Boosting change model representation space? Do they change search preference? Order of data presented does not count.