Decision Tree Instability and Active Learning

Similar documents
Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Rule Learning With Negation: Issues Regarding Effectiveness

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Learning From the Past with Experiment Databases

CS Machine Learning

(Sub)Gradient Descent

Lecture 1: Machine Learning Basics

Probabilistic Latent Semantic Analysis

Rule Learning with Negation: Issues Regarding Effectiveness

Python Machine Learning

Experiment Databases: Towards an Improved Experimental Methodology in Machine Learning

Softprop: Softmax Neural Network Backpropagation Learning

Assignment 1: Predicting Amazon Review Ratings

Model Ensemble for Click Prediction in Bing Search Ads

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Evaluating and Comparing Classifiers: Review, Some Recommendations and Limitations

Improving Simple Bayes. Abstract. The simple Bayesian classier (SBC), sometimes called

Multi-label classification via multi-target regression on data streams

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

An Empirical Comparison of Supervised Ensemble Learning Approaches

A Version Space Approach to Learning Context-free Grammars

Data Stream Processing and Analytics

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Numeracy Medium term plan: Summer Term Level 2C/2B Year 2 Level 2A/3C

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Grade 6: Correlated to AGS Basic Math Skills

Learning Methods in Multilingual Speech Recognition

Seminar - Organic Computing

Learning to Rank with Selection Bias in Personal Search

Cooperative evolutive concept learning: an empirical study

Mining Student Evolution Using Associative Classification and Clustering

Chapter 2 Rule Learning in a Nutshell

Team Formation for Generalized Tasks in Expertise Social Networks

Word Segmentation of Off-line Handwritten Documents

An investigation of imitation learning algorithms for structured prediction

Semi-Supervised Face Detection

Multi-Lingual Text Leveling

Switchboard Language Model Improvement with Conversational Data from Gigaword

A Bootstrapping Model of Frequency and Context Effects in Word Learning

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

A Case Study: News Classification Based on Term Frequency

Predicting Students Performance with SimStudent: Learning Cognitive Skills from Observation

Discriminative Learning of Beam-Search Heuristics for Planning

CSL465/603 - Machine Learning

Online Updating of Word Representations for Part-of-Speech Tagging

Universidade do Minho Escola de Engenharia

Software Maintenance

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Word learning as Bayesian inference

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Using dialogue context to improve parsing performance in dialogue systems

An Effective Framework for Fast Expert Mining in Collaboration Networks: A Group-Oriented and Cost-Based Method

On-Line Data Analytics

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Reducing Features to Improve Bug Prediction

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Reduce the Failure Rate of the Screwing Process with Six Sigma Approach

Using Deep Convolutional Neural Networks in Monte Carlo Tree Search

School Competition and Efficiency with Publicly Funded Catholic Schools David Card, Martin D. Dooley, and A. Abigail Payne

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Combining Proactive and Reactive Predictions for Data Streams

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Corrective Feedback and Persistent Learning for Information Extraction

The Boosting Approach to Machine Learning An Overview

stateorvalue to each variable in a given set. We use p(x = xjy = y) (or p(xjy) as a shorthand) to denote the probability that X = x given Y = y. We al

Generative models and adversarial training

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy

Linking Task: Identifying authors and book titles in verbose queries

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

Accuracy (%) # features

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Detecting English-French Cognates Using Orthographic Edit Distance

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Diagnostic Test. Middle School Mathematics

Build on students informal understanding of sharing and proportionality to develop initial fraction concepts.

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Transfer Learning Action Models by Measuring the Similarity of Different Domains

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

arxiv: v1 [cs.lg] 15 Jun 2015

Truth Inference in Crowdsourcing: Is the Problem Solved?

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

Calibration of Confidence Measures in Speech Recognition

On Human Computer Interaction, HCI. Dr. Saif al Zahir Electrical and Computer Engineering Department UBC

An Empirical Analysis of the Effects of Mexican American Studies Participation on Student Achievement within Tucson Unified School District

Innovative Methods for Teaching Engineering Courses

The stages of event extraction

Redirected Inbound Call Sampling An Example of Fit for Purpose Non-probability Sample Design

Probability and Statistics Curriculum Pacing Guide

Letter-based speech synthesis

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

Learning goal-oriented strategies in problem solving

CS 446: Machine Learning

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Transcription:

Decision Tree Instability and Active Learning Kenneth Dwyer and Robert Holte University of Alberta November 14, 2007 Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 1

Instability and Decision Tree Induction Quantifying Stability Instability in Active Learning Experiments Results Conclusions and Future Work Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 2

What is Learner Instability? Definition A learning algorithm is said to be unstable if it is sensitive to small changes in the training data Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 3

What is Learner Instability? Definition A learning algorithm is said to be unstable if it is sensitive to small changes in the training data Problems caused by instability Estimates of predictive accuracy can exhibit high variance Difficult to extract knowledge from the model; or the knowledge that is obtained may be unreliable Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 3

What is Learner Instability? Example Understanding low yield in a manufacturing process: The engineers frequently have good reasons for believing that the causes of low yield are relatively constant over time. Therefore the engineers are disturbed when different batches of data from the same process result in radically different decision trees. The engineers lose confidence in the decision trees, even when we can demonstrate that the trees have high predictive accuracy. [Turney, 1995] Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 4

Review: Decision Tree Induction Using the C4.5 decision tree software [Quinlan, 1996] Task: Given a collection of labelled examples, build a decision tree that accurately predicts the class labels of unseen examples Type Colour DriverAge Risk Sport Silver 24 High Sport Red 37 High Economy Black 19 High Economy Silver 21 High Sport Black 39 High Sport Silver 46 Low Economy Black 62 Low Economy Red 26 Low Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 5

Type Colour DriverAge Risk Sport Silver 24 High Sport Red 37 High Economy Black 19 High Economy Silver 21 High Sport Black 39 High Sport Silver 46 Low Economy Black 62 Low Economy Red 26 Low DriverAge <= 24 Colour Type True False S B R Sport Economy Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 6

Type Colour DriverAge Risk Sport Silver 24 High Sport Red 37 High Economy Black 19 High Economy Silver 21 High Sport Black 39 High Sport Silver 46 Low Economy Black 62 Low Economy Red 26 Low DriverAge <= 24 True False High DriverAge <= 24 Colour Type True False S B R Sport Economy Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 6

Type Colour DriverAge Risk Sport Silver 24 High Sport Red 37 High Economy Black 19 High Economy Silver 21 High Sport Black 39 High Sport Silver 46 Low Economy Black 62 Low Economy Red 26 Low DriverAge <= 24 True False High Colour Type S R B Sport Economy Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 6

Type Colour DriverAge Risk Sport Silver 24 High Sport Red 37 High Economy Black 19 High Economy Silver 21 High Sport Black 39 High Sport Silver 46 Low Economy Black 62 Low Economy Red 26 Low DriverAge <= 24 True False High Type Sport Economy Low Colour Type S R B Sport Economy Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 6

Type Colour DriverAge Risk Sport Silver 24 High Sport Red 37 High Economy Black 19 High Economy Silver 21 High Sport Black 39 High Sport Silver 46 Low Economy Black 62 Low Economy Red 26 Low DriverAge <= 24 True False High Type Sport Economy Low Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 6

Type Colour DriverAge Risk Sport Silver 24 High Sport Red 37 High Economy Black 19 High Economy Silver 21 High Sport Black 39 High Sport Silver 46 Low Economy Black 62 Low Economy Red 26 Low DriverAge <= 24 True False High Type Sport High Economy Low Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 6

DriverAge <= 24 True High False Type Sport Economy High Low Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 7

DriverAge <= 24 True High False Type Sport Economy High Low Classify an unseen example: DriverAge=32, Type=Economy, Colour=Black Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 7

DriverAge <= 24 True High False Type Sport Economy High Low Classify an unseen example: DriverAge=32, Type=Economy, Colour=Black Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 7

Decision Tree Splitting Criteria The best attribute and split at a given node are determined by a splitting criterion Each criterion is defined by an impurity function f (p +, p ) Here, p+ and p represent the probabilities of each class within a given subset of examples formed by the split Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 8

Decision Tree Splitting Criteria The best attribute and split at a given node are determined by a splitting criterion Each criterion is defined by an impurity function f (p +, p ) Here, p+ and p represent the probabilities of each class within a given subset of examples formed by the split C4.5 uses an entropy-based criterion (i.e. gain ratio) f (p+, p ) = (p + ) log 2 (p + ) + (p ) log 2 (p ) Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 8

Decision Tree Splitting Criteria The best attribute and split at a given node are determined by a splitting criterion Each criterion is defined by an impurity function f (p +, p ) Here, p+ and p represent the probabilities of each class within a given subset of examples formed by the split C4.5 uses an entropy-based criterion (i.e. gain ratio) f (p+, p ) = (p + ) log 2 (p + ) + (p ) log 2 (p ) Another impurity function, called DKM, was proposed by Dietterich, Kearns, and Mansour [Dietterich et al., 1996] f (p+, p ) = 2 p + p Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 8

Decision Tree Instability (C4.5 algorithm) UCI Lymphography dataset (attributes renamed) A <= 3 B + C D A <= 1 + + + - - - + 106 training examples Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 9

Decision Tree Instability (C4.5 algorithm) UCI Lymphography dataset (attributes renamed) D A <= 3 + + E B B + - A <= 1 F A <= 3 C D - + C + - G A <= 1 + + + - - H - + - - + + + - 106 training examples 107 training examples Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 9

Instability and Decision Tree Induction Quantifying Stability Instability in Active Learning Experiments Results Conclusions and Future Work Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 10

Types of Stability We distinguish between two types of stability: semantic and structural stability Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 11

Types of Stability We distinguish between two types of stability: semantic and structural stability Given similar data samples, a decision tree learning algorithm is: semantically stable if it produces trees that make similar predictions structurally stable if it produces trees that are syntactically similar Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 11

Quantifying Stability Semantic stability Measure the expected agreement between two decision trees Defined as the probability that two trees predict the same class label for a randomly chosen example [Turney, 1995] Estimate the agreement of two trees by having the trees classify a set of randomly chosen unlabelled examples Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 12

Quantifying Stability Semantic stability Measure the expected agreement between two decision trees Defined as the probability that two trees predict the same class label for a randomly chosen example [Turney, 1995] Estimate the agreement of two trees by having the trees classify a set of randomly chosen unlabelled examples Structural stability No widely-accepted measure exists for decision trees We propose a novel measure, called region stability Compare the decision regions (or leaves) in one tree with those of another Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 12

Semantic Stability (Example) x=5 x <= 5 True False Tree 1 y=3 True y <= 3 False 0 x=5 y <= 3 True False Tree 2 y=3 True x <= 5 False 0

Semantic Stability (Example) Tree 1 x=5 y=3 Semantic Stability The probability that the two trees assign the same class label to an unseen example 0 x=5 Tree 2 y=3 0

Semantic Stability (Example) Tree 1 x=5 y=3 Semantic Stability The probability that the two trees assign the same class label to an unseen example 0 1 x=5 Classify unlabelled examples 1 x=1, y=1 (same label) Tree 2 y=3 0 1

Semantic Stability (Example) x=5 Semantic Stability Tree 1 2 y=3 The probability that the two trees assign the same class label to an unseen example 0 1 Classify unlabelled examples x=5 1 x=1, y=1 (same label) 2 x=6, y=4 (same label) Tree 2 2 y=3 0 1

Semantic Stability (Example) x=5 Semantic Stability Tree 1 0 1 2 3 y=3 The probability that the two trees assign the same class label to an unseen example Classify unlabelled examples x=5 1 x=1, y=1 (same label) 2 x=6, y=4 (same label) Tree 2 2 3 x=9, y=2 (same label) y=3 0 1 3

Semantic Stability (Example) Tree 1 Tree 2 0 0 1 1 x=5 x=5 2 2 4 4 3 3 y=3 y=3 Semantic Stability The probability that the two trees assign the same class label to an unseen example Classify unlabelled examples 1 x=1, y=1 (same label) 2 x=6, y=4 (same label) 3 x=9, y=2 (same label) 4 x=8, y=8 (same label)

Semantic Stability (Example) Tree 1 Tree 2 0 0 1 1 x=5 x=5 2 2 4 4 3 3 y=3 y=3 Semantic Stability The probability that the two trees assign the same class label to an unseen example Classify unlabelled examples 1 x=1, y=1 (same label) 2 x=6, y=4 (same label) 3 x=9, y=2 (same label) 4 x=8, y=8 (same label) Score = 4/4 = 1

Region Stability Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 14

Region Stability Each leaf in a decision tree is a decision region Defined by the unordered set of tests along the path from the root to the leaf Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 14

Region Stability Each leaf in a decision tree is a decision region Defined by the unordered set of tests along the path from the root to the leaf Two decision regions are equivalent if they perform the same set of tests and predict the same class label Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 14

Region Stability Each leaf in a decision tree is a decision region Defined by the unordered set of tests along the path from the root to the leaf Two decision regions are equivalent if they perform the same set of tests and predict the same class label We estimate the region stability of two trees by having the trees classify a set of randomly chosen unlabelled examples Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 14

Region Stability (Example) x=5 x <= 5 True False Tree 1 y=3 True y <= 3 False 0 x=5 y <= 3 True False Tree 2 y=3 True x <= 5 False 0

Region Stability (Example) Tree 1 x=5 y=3 Region Stability The probability that the two trees classify an unseen example in equivalent decision regions 0 x=5 Tree 2 y=3 0

Region Stability (Example) Tree 1 x=5 y=3 Region Stability The probability that the two trees classify an unseen example in equivalent decision regions 0 1 x=5 Classify unlabelled examples 1 x=1, y=1 (different) Tree 2 y=3 0 1

Region Stability (Example) x=5 Region Stability Tree 1 2 y=3 The probability that the two trees classify an unseen example in equivalent decision regions 0 1 Classify unlabelled examples x=5 1 x=1, y=1 (different) 2 x=6, y=4 (equivalent) Tree 2 2 y=3 0 1

Region Stability (Example) x=5 Region Stability Tree 1 0 1 2 3 y=3 The probability that the two trees classify an unseen example in equivalent decision regions Classify unlabelled examples x=5 1 x=1, y=1 (different) 2 x=6, y=4 (equivalent) Tree 2 2 3 x=9, y=2 (different) y=3 0 1 3

Region Stability (Example) Tree 1 Tree 2 0 0 1 1 x=5 x=5 2 2 4 4 3 3 y=3 y=3 Region Stability The probability that the two trees classify an unseen example in equivalent decision regions Classify unlabelled examples 1 x=1, y=1 (different) 2 x=6, y=4 (equivalent) 3 x=9, y=2 (different) 4 x=8, y=8 (equivalent)

Region Stability (Example) Tree 1 Tree 2 0 0 1 1 x=5 x=5 2 2 4 4 3 3 y=3 y=3 Region Stability The probability that the two trees classify an unseen example in equivalent decision regions Classify unlabelled examples 1 x=1, y=1 (different) 2 x=6, y=4 (equivalent) 3 x=9, y=2 (different) 4 x=8, y=8 (equivalent) Score = 2/4 = 0.5

Region Stability: Continuous Attributes True boundary at.6 Tree 1 0.55 1 Tree 2 0.5 1 Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 16

Region Stability: Continuous Attributes True boundary at.6 Tree 1 0.55 1 Tree 2 0.5 1 Specify a value ε [0, 100]% Thresholds that are within this range of one another are considered to be equal Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 16

Instability and Decision Tree Induction Quantifying Stability Instability in Active Learning Experiments Results Conclusions and Future Work Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 17

C4.5 Instability Example UCI Lymphography dataset (attributes renamed) D A <= 3 + + E B B + - A <= 1 F A <= 3 C D - + C + - G A <= 1 + + + - - H - + - - + + + - 106 training examples 107 training examples Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 18

C4.5 Instability Example UCI Lymphography dataset (attributes renamed) D A <= 3 + + E B B + - A <= 1 F A <= 3 C D - + C + - G A <= 1 + + + - - H - + - - + + + - 106 training examples Active Learning 107 training examples Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 18

Active Learning Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 19

Active Learning In a passive learning setting, the learner is provided with a set of training examples (typically drawn at random) Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 19

Active Learning In a passive learning setting, the learner is provided with a set of training examples (typically drawn at random) In active learning [Cohn et al., 1992], the learner controls the examples that it uses to train a classifier Three main active learning paradigms: 1. Pool-based 2. Stream-based 3. Membership queries Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 19

Active Learning In a passive learning setting, the learner is provided with a set of training examples (typically drawn at random) In active learning [Cohn et al., 1992], the learner controls the examples that it uses to train a classifier Three main active learning paradigms: 1. Pool-based 2. Stream-based 3. Membership queries We focus on pool-based active learning, or selective sampling Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 19

Active Learning In a passive learning setting, the learner is provided with a set of training examples (typically drawn at random) In active learning [Cohn et al., 1992], the learner controls the examples that it uses to train a classifier Three main active learning paradigms: 1. Pool-based 2. Stream-based 3. Membership queries We focus on pool-based active learning, or selective sampling Active learning methods have been shown to make more efficient use of unlabelled data Yet, no attention has been given to their stability Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 19

Selective Sampling Given: A pool of unlabelled data U and some labelled data L Repeat until (some stopping criterion is met): 1. Train a classifier on the labelled data L 2. Select a batch of m examples from the pool U, obtain their labels, and add them to the training set L Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 20

Selective Sampling Given: A pool of unlabelled data U and some labelled data L Repeat until (some stopping criterion is met): 1. Train a classifier on the labelled data L 2. Select a batch of m examples from the pool U, obtain their labels, and add them to the training set L We empirically studied 4 selective sampling methods that can use C4.5 as a base learner: 1. Uncertainty sampling [Lewis and Catlett, 1994] 2. Query-by-bagging [Abe and Mamitsuka, 1998] 3. Query-by-boosting [Abe and Mamitsuka, 1998] 4. Bootstrap-LV [Saar-Tsechansky and Provost, 2004] Random sampling served as a baseline comparison Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 20

Uncertainty Sampling x=5 Sampling strategy Select the examples for which the current prediction is least confident y=3 0

Uncertainty Sampling Sampling strategy x=5 4 Select the examples for which the current prediction is least confident Unlabelled data (the pool) 0 1 2 3 y=3

Uncertainty Sampling Sampling strategy x=5 4 Select the examples for which the current prediction is least confident Unlabelled data (the pool) 0 1 2 3 y=3 1 x=1, y=1 (Conf: 6/10 = 0.6)

Uncertainty Sampling Sampling strategy x=5 4 Select the examples for which the current prediction is least confident Unlabelled data (the pool) 1 2 3 y=3 1 x=1, y=1 (Conf: 6/10 = 0.6) 2 x=3, y=4 (Conf: 6/10 = 0.6) 0

Uncertainty Sampling Sampling strategy x=5 4 Select the examples for which the current prediction is least confident Unlabelled data (the pool) 1 2 3 y=3 1 x=1, y=1 (Conf: 6/10 = 0.6) 2 x=3, y=4 (Conf: 6/10 = 0.6) 3 x=9, y=2 (Conf: 2/4 = 0.5) 0

Uncertainty Sampling Sampling strategy x=5 4 Select the examples for which the current prediction is least confident Unlabelled data (the pool) 1 2 3 y=3 1 x=1, y=1 (Conf: 6/10 = 0.6) 2 x=3, y=4 (Conf: 6/10 = 0.6) 3 x=9, y=2 (Conf: 2/4 = 0.5) 0 4 x=8, y=8 (Conf: 7/7 = 1)

Uncertainty Sampling Sampling strategy x=5 4 Select the examples for which the current prediction is least confident Unlabelled data (the pool) 1 2 3 y=3 1 x=1, y=1 (Conf: 6/10 = 0.6) 2 x=3, y=4 (Conf: 6/10 = 0.6) 3 x=9, y=2 (Conf: 2/4 = 0.5) 0 4 x=8, y=8 (Conf: 7/7 = 1) Request the label for 3

Query-by-Bagging x=5 Sampling strategy Member 1 0 y=3 Build a committee (of trees) from the labelled data Select the examples for which the committee vote is most evenly split x=5 Member 2 y=3 0

Query-by-Bagging Member 1 0 x=5 4 2 3 1 y=3 Sampling strategy Build a committee (of trees) from the labelled data Select the examples for which the committee vote is most evenly split x=5 4 Unlabelled data (the pool) Member 2 0 1 2 3 y=3

Query-by-Bagging Member 1 0 x=5 4 2 3 1 y=3 Sampling strategy Build a committee (of trees) from the labelled data Select the examples for which the committee vote is most evenly split x=5 4 Unlabelled data (the pool) 1 x=1, y=1 (Disagree: +, ) Member 2 0 1 2 3 y=3

Query-by-Bagging Member 1 x=5 4 2 3 1 0 y=3 Sampling strategy Build a committee (of trees) from the labelled data Select the examples for which the committee vote is most evenly split x=5 4 Unlabelled data (the pool) 1 x=1, y=1 (Disagree: +, ) Member 2 0 1 2 3 y=3 2 x=3, y=4 (Agree: +,+)

Query-by-Bagging Member 1 0 x=5 4 2 3 1 y=3 Sampling strategy Build a committee (of trees) from the labelled data Select the examples for which the committee vote is most evenly split x=5 4 Unlabelled data (the pool) 1 x=1, y=1 (Disagree: +, ) Member 2 0 1 2 3 y=3 2 x=3, y=4 (Agree: +,+) 3 x=9, y=2 (Disagree: +, )

Query-by-Bagging Member 1 0 x=5 4 2 3 1 y=3 Sampling strategy Build a committee (of trees) from the labelled data Select the examples for which the committee vote is most evenly split x=5 4 Unlabelled data (the pool) 1 x=1, y=1 (Disagree: +, ) Member 2 0 1 2 3 y=3 2 x=3, y=4 (Agree: +,+) 3 x=9, y=2 (Disagree: +, ) 4 x=8, y=8 (Agree:, )

Query-by-Bagging Member 1 0 x=5 4 2 3 1 y=3 Sampling strategy Build a committee (of trees) from the labelled data Select the examples for which the committee vote is most evenly split x=5 4 Unlabelled data (the pool) 1 x=1, y=1 (Disagree: +, ) Member 2 0 1 2 3 y=3 2 x=3, y=4 (Agree: +,+) 3 x=9, y=2 (Disagree: +, ) 4 x=8, y=8 (Agree:, )

Other Sampling Methods Query-by-Boosting Committee is formed using the AdaBoost.M1 algorithm [Freund and Schapire, 1996] Committee member t i has voting weight β i = where ɛ i is the weighted error rate of t i ɛ i 1 ɛ i, Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 23

Other Sampling Methods Query-by-Boosting Committee is formed using the AdaBoost.M1 algorithm [Freund and Schapire, 1996] Committee member t i has voting weight β i = where ɛ i is the weighted error rate of t i Bootstrap-LV (Local Variance) ɛ i 1 ɛ i, Bagging; Examples are selected by sampling (without replacement) from the distribution D(x), x U Di (x) is inversely proportional to the variance in the class probability estimates (CPEs) for example x i Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 23

Other Sampling Methods Query-by-Boosting Committee is formed using the AdaBoost.M1 algorithm [Freund and Schapire, 1996] Committee member t i has voting weight β i = where ɛ i is the weighted error rate of t i Bootstrap-LV (Local Variance) ɛ i 1 ɛ i, Bagging; Examples are selected by sampling (without replacement) from the distribution D(x), x U Di (x) is inversely proportional to the variance in the class probability estimates (CPEs) for example x i Direct selection versus Weight sampling Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 23

Committee-based Selective Sampling L Bagging or Boosting U C4.5 Selection (Voting) Measure stability, accuracy, etc. Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 24

Instability and Decision Tree Induction Quantifying Stability Instability in Active Learning Experiments Results Conclusions and Future Work Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 25

Experiments Questions being addressed Do certain selective sampling methods grow more stable decision trees than others? Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 26

Experiments Questions being addressed Do certain selective sampling methods grow more stable decision trees than others? Are committee-based sampling methods effective at selecting examples for training a single decision tree? Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 26

Experiments Questions being addressed Do certain selective sampling methods grow more stable decision trees than others? Are committee-based sampling methods effective at selecting examples for training a single decision tree? Can changing C4.5 s splitting criterion improve stability? Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 26

Experimental Procedure 16 UCI datasets [Newman et al., 1998] Only datasets that contained at least 500 examples Multi-class problems converted to two-class Missing values removed Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 27

Experimental Procedure 16 UCI datasets [Newman et al., 1998] Only datasets that contained at least 500 examples Multi-class problems converted to two-class Missing values removed Each dataset was partitioned as follows: Initial 15% Unlabelled(Pool) 52% Evaluation 33% Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 27

Experimental Procedure 16 UCI datasets [Newman et al., 1998] Only datasets that contained at least 500 examples Multi-class problems converted to two-class Missing values removed Each dataset was partitioned as follows: Initial 15% Unlabelled(Pool) 52% Evaluation 33% Other parameters: Learning stopped once 2/3 of the pool examples labelled Committees consisted of 10 classifiers Region stability computed using ɛ = {0, 5, 10}% Results averaged over 25 runs (diff. initial training data) Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 27

Experimental Procedure (Continued) We measured three (3) types of active learning stability Tree i was compared with... L 01 t 01,1 t 01,2 t 01,3... t 01,n L 02 t 02,1 t 02,2 t 02,3... t 02,n. L 25 t 25,1 t 25,2 t 25,3... t 25,n Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 28

Experimental Procedure (Continued) We measured three (3) types of active learning stability Tree i was compared with... the tree grown on iteration i 1 (previous tree) L 01 t 01,1 t 01,2 t 01,3... t 01,n L 02 t 02,1 t 02,2 t 02,3... t 02,n. L 25 t 25,1 t 25,2 t 25,3... t 25,n Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 28

Experimental Procedure (Continued) We measured three (3) types of active learning stability Tree i was compared with... the tree grown on iteration i 1 (previous tree) the tree grown on iteration n (final tree) L 01 t 01,1 t 01,2 t 01,3... t 01,n L 02 t 02,1 t 02,2 t 02,3... t 02,n. L 25 t 25,1 t 25,2 t 25,3... t 25,n Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 28

Experimental Procedure (Continued) We measured three (3) types of active learning stability Tree i was compared with... the tree grown on iteration i 1 (previous tree) the tree grown on iteration n (final tree) the trees grown on iteration i when given different initial training data L L 01 t 01,1 t 01,2 t 01,3... t 01,n L 02 t 02,1 t 02,2 t 02,3... t 02,n. L 25 t 25,1 t 25,2 t 25,3... t 25,n Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 28

Experimental Procedure (Continued) We measured three (3) types of active learning stability Tree i was compared with... the tree grown on iteration i 1 (previous tree) the tree grown on iteration n (final tree) the trees grown on iteration i when given different initial training data L L 01 t 01,1 t 01,2 t 01,3... t 01,n L 02 t 02,1 t 02,2 t 02,3... t 02,n. L 25 t 25,1 t 25,2 t 25,3... t 25,n These are called PrevStab, FinalStab, and RunStab Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 28

Evaluation Statistical significance was assessed by comparing the average ranks of the sampling methods. Recommended procedure for comparing multiple learning methods [Demšar, 2006]. Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 29

Evaluation Statistical significance was assessed by comparing the average ranks of the sampling methods. Example Recommended procedure for comparing multiple learning methods [Demšar, 2006]. Dataset 1 Dataset 2 Dataset 3 Avg. Rank Method 1 Method 2 Method 3 Method 4 Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 29

Evaluation Statistical significance was assessed by comparing the average ranks of the sampling methods. Example Recommended procedure for comparing multiple learning methods [Demšar, 2006]. Method 1 Method 2 Method 3 Method 4 Dataset 1 1 4 2 3 Dataset 2 Dataset 3 Avg. Rank Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 29

Evaluation Statistical significance was assessed by comparing the average ranks of the sampling methods. Example Recommended procedure for comparing multiple learning methods [Demšar, 2006]. Method 1 Method 2 Method 3 Method 4 Dataset 1 1 4 2 3 Dataset 2 2 3 1 4 Dataset 3 Avg. Rank Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 29

Evaluation Statistical significance was assessed by comparing the average ranks of the sampling methods. Example Recommended procedure for comparing multiple learning methods [Demšar, 2006]. Method 1 Method 2 Method 3 Method 4 Dataset 1 1 4 2 3 Dataset 2 2 3 1 4 Dataset 3 1 4 2.5 2.5 Avg. Rank Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 29

Evaluation Statistical significance was assessed by comparing the average ranks of the sampling methods. Example Recommended procedure for comparing multiple learning methods [Demšar, 2006]. Method 1 Method 2 Method 3 Method 4 Dataset 1 1 4 2 3 Dataset 2 2 3 1 4 Dataset 3 1 4 2.5 2.5 Avg. Rank 1.333 3.667 1.833 3.167 Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 29

Evaluation (Continued) For a given {statistic, sampling method, splitting criterion, data set} tuple, we get a sequence of scores How do we rank the sampling australian methods? Mean error rate.150.145.140.135.130 Random QBag QBoost BootLV Uncert..125 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Fraction of pool examples labelled

Averaging Scores Summary statistic: sequence of scores single number 1. Compute the average score s i at each iteration i (i.e. over the 25 runs) 2. The overall score is a weighted average 1 n n i=1 w i s i, where w i = 2i n(n+1) Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 31

Averaging Scores Summary statistic: sequence of scores single number 1. Compute the average score s i at each iteration i (i.e. over the 25 runs) 2. The overall score is a weighted average 1 n n i=1 w i s i, where w i = 2i n(n+1) The weight increases linearly as a function of i We argue that stability and accuracy are most important in the later stages of active learning e.g. Stability in early rounds is of little value if stability deteriorates in later rounds Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 31

Example: Averaging Scores and Ranking kr vs kp Mean structural FinalStab score (epsilon = 0) 1.0 0.9 0.8 0.7 0.6 0.5 0.4 Random QBag BootLV Uncert. Ranks/Scores 1. QBag (.953) 2. Random (.858) 3. BootLV (.644) 4. Uncert (.638) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Fraction of pool examples labelled Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 32

Statistical Significance [Demšar, 2006] Dataset Random QBag QBoost BootLV Uncert (R) (G) (T) (L) (U) anneal.144 (4).121 (1).135 (3).125 (2).150 (5) australian.129 (1.5).129 (1.5).131 (5).130 (3.5).130 (3.5) car.090 (5).077 (1).082 (4).078 (2).081 (3) german.293 (5).274 (1).285 (2).290 (4).289 (3) hypothyroid.006 (5).002 (2).002 (2).002 (2).004 (4) kr-vs-kp.014 (5).007 (1.5).008 (3).007 (1.5).010 (4) letter.015 (5).011 (2).011 (2).011 (2).013 (4) nursery.056 (5).038 (1.5).039 (3).038 (1.5).044 (4) pendigits.016 (5).010 (1.5).010 (1.5).012 (4).011 (3) pima-indians.286 (5).283 (2).280 (1).284 (3).285 (4) segment.020 (5).011 (1).012 (2.5).012 (2.5).019 (4) tic-tac-toe.217 (5).197 (1).201 (2).207 (3).211 (4) vehicle.227 (1).231 (5).229 (3.5).228 (2).229 (3.5) vowel.056 (5).033 (1).036 (2).037 (3).049 (4) wdbc.073 (4).068 (2).067 (1).069 (3).076 (5) yeast.256 (4.5).250 (1).253 (2.5).256 (4.5).253 (2.5) Avg. rank (4.375) (1.625) R,U (2.500) R (2.719) R (3.781) Apply the Friedman and Nemenyi significance tests e.g. At α =.05, the critical difference is 1.527 Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 33

Instability and Decision Tree Induction Quantifying Stability Instability in Active Learning Experiments Results Conclusions and Future Work Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 34

Error Rates Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 35

Error Rates The committee-based sampling methods achieved lower error rates than did Uncertainty or Random At first glance, this might not appear to be a novel or interesting result Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 35

Error Rates The committee-based sampling methods achieved lower error rates than did Uncertainty or Random At first glance, this might not appear to be a novel or interesting result Important difference from previous active learning studies: A committee of C4.5 trees selected examples that were used to train a single C4.5 tree, which was evaluated In prior research, e.g., Query-by-bagging selected examples for training a bagged ensemble of trees Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 35

Error Rates The committee-based sampling methods achieved lower error rates than did Uncertainty or Random At first glance, this might not appear to be a novel or interesting result Important difference from previous active learning studies: A committee of C4.5 trees selected examples that were used to train a single C4.5 tree, which was evaluated In prior research, e.g., Query-by-bagging selected examples for training a bagged ensemble of trees When trained on the same data sample, a committee of trees is likely to be more accurate than a single tree Yet, a committee of trees is no longer interpretable [Breiman, 1996] Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 35

Error Rates (Continued) We typically observed a banana shape, indicating kr vs kp efficient use of unlabelled data (below: kr-vs-kp) Mean error rate.035 Random QBag.030 QBoost BootLV Uncert..025.020.015.010 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 36

Tree Size The selective sampling methods consistently yielded larger trees than did Random sampling vowel (below: vowel) Mean number of leaf nodes 16 Random QBag QBoost 14 BootLV Uncert. 12 10 8 6 4 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 37

Tree Size and Intelligibility Trees grown using Query-by-bagging (QBag) contained 38 percent more leaves, on average, than those of Random Yet, we argue that this did not usually result in a loss of intelligibility Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 38

Tree Size and Intelligibility Trees grown using Query-by-bagging (QBag) contained 38 percent more leaves, on average, than those of Random Yet, we argue that this did not usually result in a loss of intelligibility There is no agreed-upon criterion for distinguishing between a tree that is interpretable and a tree that is not Let s consider one simple criterion: There might exist a threshold t, such that any tree containing more than t leaves is uninterpretable On a given dataset, if QBag s leaf count is greater than t while Random s is at most t, then QBag has sacrificed intelligibility Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 38

Tree Size and Intelligibility (Continued) QBag more complex t Both unintelligible QBag Tree Size D1 D5 D2 D3 D4 t Random more complex Both intelligible Random Tree Size Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 39

Tree Size and Intelligibility (Continued) QBag more complex t Both unintelligible QBag Tree Size D1 D5 D2 D3 D4 t Random more complex Both intelligible Random Tree Size We examined all integer values of t between 1 and 25, and found QBag to be more complex on at most 5 datasets (t = 13) Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 39

Stability Query-by-bagging (QBag) grew the most semantically and structurally stable trees Its stability pendigits gains across runs were highly significant Mean structural RunStab score (eps = 0.05) 1.0 0.8 0.6 0.4 0.2 0.0 Random QBag QBoost BootLV Uncert. 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Avg. Ranks RunStab, ɛ =.05 1. QBag (1.66) 2. QBoost (2.19) 3. BootLV (2.59) 4. Random (4.19) 5. Uncert (4.38) Left: letter Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 40

Query-by-bagging (QBag) grew the most semantically and structurally stable trees Its stability pendigits gains across runs were highly significant Mean structural RunStab score (eps = 0.05) 1.0 0.8 0.6 0.4 0.2 0.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Fraction of pool examples labelled Random QBag QBoost BootLV Uncert. Direct selection vs. Weight sampling Committee of trees vs. Single tree Avg. Ranks RunStab, ɛ =.05 1. QBag (1.66) 2. QBoost (2.19) 3. BootLV (2.59) 4. Random (4.19) 5. Uncert (4.38) Left: letter

Splitting Criteria: Entropy vs. DKM We employed the Wilcoxon signed-ranks test DKM was more structurally stable and more accurate than entropy Structural stability of all 5 sampling methods improved when using DKM The best method, QBag, exhibited even better performance when paired with DKM Differences in semantic stability and tree size were, for the most part, insignificant Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 42

Instability and Decision Tree Induction Quantifying Stability Instability in Active Learning Experiments Results Conclusions and Future Work Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 43

Main Contributions 1. How should decision tree (in)stability be measured? We proposed a novel structural stability measure for d-trees, called region stability, along with active learning versions Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 44

Main Contributions 1. How should decision tree (in)stability be measured? We proposed a novel structural stability measure for d-trees, called region stability, along with active learning versions 2. How stable are some well-known active learning methods that use the C4.5 decision tree learner? Query-by-bagging was found to be more stable and more accurate than its competitors Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 44

Main Contributions 1. How should decision tree (in)stability be measured? We proposed a novel structural stability measure for d-trees, called region stability, along with active learning versions 2. How stable are some well-known active learning methods that use the C4.5 decision tree learner? Query-by-bagging was found to be more stable and more accurate than its competitors 3. Can stability be improved in this setting by changing C4.5 s splitting criterion? The DKM splitting criterion was shown to improve the stability and accuracy of C4.5 in active learning Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 44

Future Work Incremental Tree Induction [Utgoff et al., 1997] Tree is restructured when new training data arrive On average, requires less computation than growing a new tree from scratch Error-correction mode: Only add a new example if the existing tree would misclassify it Alternatively, we could add all new examples, but only update the tree if an example is misclassified These good enough trees might be more stable Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 45

Future Work (Continued) Learning under Covariate Shift [Bickel et al., 2007] Active learning constructs a training set whose distribution may differ arbitrarily from the original I could be the case that ptrain (x) p test (x) The expected loss is minimized when training examples are weighted by: p test (x) p train (x) Is such a correction beneficial in active learning? Are techniques for dealing with class imbalance are more appropriate? Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 46

Conclusions When training a single C4.5 tree in an active learning setting, one should use the DKM splitting criterion and select examples with Query-by-bagging This combination yields the most stable and accurate decision trees Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 47

Conclusions When training a single C4.5 tree in an active learning setting, one should use the DKM splitting criterion and select examples with Query-by-bagging This combination yields the most stable and accurate decision trees We should be aware of the potential instability of machine learning algorithms, particularly when attempting to extract knowledge from a classifier Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 47

Thank You!? Kenneth Dwyer, University of Alberta Decision Tree Instability and Active Learning 48

Selected References Abe, N. and Mamitsuka, H. (1998). Query learning strategies using boosting and bagging. In Proc. ICML 98, pages 1 9. Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2):123 140. Cohn, D. A., Atlas, L. E., and Ladner, R. E. (1992). Improving generalization with active learning. Machine Learning, 15(2):201 221. Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets. JMLR, 7:1 30. Dietterich, T. G., Kearns, M., and Mansour, Y. (1996). Applying the weak learning framework to understand and improve C4.5. In Proc. ICML 96, pages 96 104. Lewis, D. D. and Catlett, J. (1994). Heterogeneous uncertainty sampling for supervised learning. In Proc. ICML 94, pages 148 156. Quinlan, J. R. (1996). Improved use of continuous attributes in C4.5. JAIR, 4:77 90. Saar-Tsechansky, M. and Provost, F. (2004). Active sampling for class probability estimation and ranking. Machine Learning, 54(2):153 178. Turney, P. D. (1995). Bias and the quantification of stability. Machine Learning, 20(1-2):23 33.