A practical way of handling missing values in combination with tree-based learners

Similar documents
Rule Learning With Negation: Issues Regarding Effectiveness

Rule Learning with Negation: Issues Regarding Effectiveness

Learning From the Past with Experiment Databases

A Case Study: News Classification Based on Term Frequency

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

CSL465/603 - Machine Learning

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

CS Machine Learning

Reducing Features to Improve Bug Prediction

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Lecture 1: Machine Learning Basics

Impact of Cluster Validity Measures on Performance of Hybrid Models Based on K-means and Decision Trees

Mining Association Rules in Student s Assessment Data

CS4491/CS 7265 BIG DATA ANALYTICS INTRODUCTION TO THE COURSE. Mingon Kang, PhD Computer Science, Kennesaw State University

On-Line Data Analytics

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Australian Journal of Basic and Applied Sciences

Experiment Databases: Towards an Improved Experimental Methodology in Machine Learning

Python Machine Learning

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

ScienceDirect. A Framework for Clustering Cardiac Patient s Records Using Unsupervised Learning Techniques

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

(Sub)Gradient Descent

Predicting Students Performance with SimStudent: Learning Cognitive Skills from Observation

Applications of data mining algorithms to analysis of medical data

An Empirical Analysis of the Effects of Mexican American Studies Participation on Student Achievement within Tucson Unified School District

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

CS 446: Machine Learning

Laboratorio di Intelligenza Artificiale e Robotica

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Assessing System Agreement and Instance Difficulty in the Lexical Sample Tasks of SENSEVAL-2

Miami-Dade County Public Schools

An extended dual search space model of scientific discovery learning

Using dialogue context to improve parsing performance in dialogue systems

Linking Task: Identifying authors and book titles in verbose queries

Content-based Image Retrieval Using Image Regions as Query Examples

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge

Learning Methods for Fuzzy Systems

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

Switchboard Language Model Improvement with Conversational Data from Gigaword

A GENERIC SPLIT PROCESS MODEL FOR ASSET MANAGEMENT DECISION-MAKING

Entrepreneurial Discovery and the Demmert/Klein Experiment: Additional Evidence from Germany

Handling Concept Drifts Using Dynamic Selection of Classifiers

Lecture 1: Basic Concepts of Machine Learning

Learning Methods in Multilingual Speech Recognition

CS 1103 Computer Science I Honors. Fall Instructor Muller. Syllabus

Analyzing sentiments in tweets for Tesla Model 3 using SAS Enterprise Miner and SAS Sentiment Analysis Studio

K-Medoid Algorithm in Clustering Student Scholarship Applicants

Laboratorio di Intelligenza Artificiale e Robotica

A Note on Structuring Employability Skills for Accounting Students

Generation of Attribute Value Taxonomies from Data for Data-Driven Construction of Accurate and Compact Classifiers

The Impact of Honors Programs on Undergraduate Academic Performance, Retention, and Graduation

A NEW ALGORITHM FOR GENERATION OF DECISION TREES

How to Judge the Quality of an Objective Classroom Test

Constructive Induction-based Learning Agents: An Architecture and Preliminary Experiments

Lectora a Complete elearning Solution

Evidence for Reliability, Validity and Learning Effectiveness

Mining Student Evolution Using Associative Classification and Clustering

OFFICE OF ENROLLMENT MANAGEMENT. Annual Report

Like much of the country, Detroit suffered significant job losses during the Great Recession.

Pp. 176{182 in Proceedings of The Second International Conference on Knowledge Discovery and Data Mining. Predictive Data Mining with Finite Mixtures

Managing Experience for Process Improvement in Manufacturing

STA 225: Introductory Statistics (CT)

BUILD-IT: Intuitive plant layout mediated by natural interaction

Algebra 2- Semester 2 Review

re An Interactive web based tool for sorting textbook images prior to adaptation to accessible format: Year 1 Final Report

TIMSS ADVANCED 2015 USER GUIDE FOR THE INTERNATIONAL DATABASE. Pierre Foy

Courses in English. Application Development Technology. Artificial Intelligence. 2017/18 Spring Semester. Database access

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

A Version Space Approach to Learning Context-free Grammars

STAT 220 Midterm Exam, Friday, Feb. 24

*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN

Spanish Users and Their Participation in College: The Case of Indiana

Stochastic Calculus for Finance I (46-944) Spring 2008 Syllabus

Probabilistic Latent Semantic Analysis

Data Fusion Through Statistical Matching

BENCHMARK TREND COMPARISON REPORT:

Universidade do Minho Escola de Engenharia

Welcome to. ECML/PKDD 2004 Community meeting

GDP Falls as MBA Rises?

IT Students Workshop within Strategic Partnership of Leibniz University and Peter the Great St. Petersburg Polytechnic University

Multivariate k-nearest Neighbor Regression for Time Series data -

Australia s tertiary education sector

Education: Integrating Parallel and Distributed Computing in Computer Science Curricula

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Activity Recognition from Accelerometer Data

CS 101 Computer Science I Fall Instructor Muller. Syllabus

Word Segmentation of Off-line Handwritten Documents

Firms and Markets Saturdays Summer I 2014

Longest Common Subsequence: A Method for Automatic Evaluation of Handwritten Essays

Longitudinal Analysis of the Effectiveness of DCPS Teachers

Cooperative evolutive concept learning: an empirical study

Redirected Inbound Call Sampling An Example of Fit for Purpose Non-probability Sample Design

Improving Simple Bayes. Abstract. The simple Bayesian classier (SBC), sometimes called

Transcription:

A practical way of handling missing values in combination with tree-based learners V.J.J. Kusters S.J.J. Leemans D.M.M. Schunselaar F. Staals Eindhoven University of Technology, Den Dolech 2, 5612AZ Eindhoven Abstract In this age of information, techniques for analysing the huge amounts of data that are generated every day are becoming increasingly important. We would often like to use simple tree-based learners on these data sets. Unfortunately, the performance of these algorithms decreases rapidly as more missing values, or nulls, are present in the data set. In this paper we present a practical way of imputing missing values before learning a tree-based model. This method is then applied to the Census-Income (KDD) data set with various amounts of missing values. Additionally, to confirm our expectations that the imputation is less useful when using a noise-tolerant learner, the same procedure is executed with a Naive Bayes learner. The proposed method was shown to greatly increase the accuracy of tree-based learners on a data set with many missing values. 1 Introduction As our civilization continues to rely more and more on computer systems, the amount of information that is collected and generated increases vastly. Since data sets quickly become too large to analyse by hand, computer tools are required to classify data and find relationships. Various methods exist for the classification of tuples into categories, but tree-based learners are among the easiest to use. They match the way a human would tackle a classification problem, making them easy to understand. Consequently, tree-based learners are widely used in the industry to solve classification problems. Many real-world data sets contain missing values. Unfortunately, the presence of large amounts of missing values severely impacts the accuracy of tree-based learners. In this paper we present a practical method of imputing missing values prior to learning a model. Our hypothesis is: Theorem 1.1. A method that imputes missing values before applying a tree-based learner produces a more accurate model than a method that applies a tree-based learner directly. We will test our hypothesis on the Census-Income (KDD) data set provided by Asuncion and Newman[1]. This data set contains census data of the American Department of Commerce. Every tuple represents a person and his or her age, education level, race, etcetera. We will learn classification models on this data set to categorise persons based on their yearly income. A person should be categorised in the 50 000- category if their income is is at most $50.000, and in 50 000+ otherwise. class. Furthermore, we examine the effects of missing value imputation on the results of a Naive Bayes learner. In our experiment we compare two methods to learn a classification on our data set. One of those methods will impute the missing values before learning a classification model, while the other method will directly learn the classification model. A detailed description of both methods and the required preprocessing steps is given in Section 2. The results of our experiment show that the method which imputes missing values results in a significantly more accurate model when used with a decision tree. When we use a Naive Bayes learner, there is no significant change in the accuracy. Detailed results are presented in Section 3.

2 Approach To test our hypothesis, we set up our experiment as depicted in Figure 1. We apply a learner to the data set directly (normal method) and apply the same learner after imputing missing values (imputation method) and compare the results. These methods learn a model on the training set. Given a tuple, these models classify the tuple into the 50 000+ or the 50 000- class. The normal method immediately learns a classification model on the training set. Validation of the results is done by applying the model to the test set. This method is shown on the right side of Figure 1. The imputation method learns the missing value models and imputes the missing values in the training set, before learning the classification model. We use the approach by Quinlan [6] to impute the missing values. This approach involves learning a model for every attribute of the data set. We use the models we learned on the training set and apply them to the test set to validate the results. This method is shown on the left side of Figure 1. We use a decision tree learner to compute the classification model. The learner is the default decision tree as provided by Rapid Miner[9] and is similar to the C4.5 decision tree as presented by Quinlan [8]. To learn the missing value models we will use the same decision tree learner that is used to compute the classification model. Note that decision trees are sensitive to missing values. This is intentional: to achieve good results, we want to use a learner that works well on a particular data set. If a learner, which is not sensitive to missing values, produced good results on our data set, we could simply use that learner directly; i.e. there would be no need to impute missing values in the first place. We have focused our research on imputing nominal attributes. Various other authors, such as Fujikawa and Ho[3] have done research on the imputation of numerical attributes but this is beyond the scope of our paper. training set test set training set impute missing values model learn model learn model validate model impute missing values validate model 1 2 Figure 1: A model of our test setup Finally, we will run the same procedure again, but replace the tree-based learner for a Naive Bayes learner and compare the results. However before we can execute the experiment in Figure 1, some preprocessing steps have to be executed. The following steps are executed in the order in which they are listed: Sampling To increase the speed of computation, we consider only the first 15.000 rows of the data set and remove the rest. Since the original data set is not ordered in a particular way this is equivalent to taking a random sample.

Missing Value Introduction We introduce extra missing values to investigate the effect of increasing amounts of nulls for missing values imputation. For every experiment, we choose a probability p. A value in the data set then has probability p of being converted to a null. Magnani refers to this as the Missing Completely At Random method [5]. p is chosen from the set {0, 0.05, 0.10, 0.15, 0.20, 0.25}. Duplication Since our 50 000+ class contains only 6% of the rows in the data set, a naive classification approach would classify everything in the 50 000- class. This would result in an accuracy of 94%. In such a classification it would not be possible to assess the results of the missing value imputation. To prevent the learner from classifying everything in the 50 000- class, we duplicate the rows of the 50 000+ class 15 times. The classes in the resulting data set are of roughly equal size. After duplicating the rows, we shuffle the data set to prevent clustering of the 50 000+ rows. Attribute Filtering One of the attributes in our data set is the instance weight attribute. According to the attribute description accompanying the data set, this attribute should not be used for classification. Hence, it was removed. Validation We split our data set into two disjoint sets: a training set, which we will use to learn our model, and a test set, on which we will validate our model. Since we randomised the data set after duplicating the rows in the 50 000+ class we can now take the first n rows for the training set and the last n rows for the test set. To make sure our sample is of sufficient size we tested with n = 2000 and n = 7500. Implementation The sampling, missing value introduction, and duplication operations described in the previous paragraphs were implemented by a custom Python script. The attribute filtering and validation steps were implemented by various operators in Rapidminer[9]. After these steps we use the Rapidminer[9] MissingValueImputation operator to learn, apply and store the missing value models. 3 Experimental Validation In our experiment we found support for the following claims. Claim 3.1. The accuracy of a decision tree learner declines significantly as the number of missing values in a data set increases. Claim 3.2. Missing value imputation grows more effective when using a decision tree learner as the number of missing values in a data set increases. Claim 3.3. The accuracy of a Naive Bayes learner does not decline significantly as the number of missing values in a data set increases. Claim 3.4. The number of missing values in a data set has no significant effect on a naive bayes learner. Applying the algorithm described in the previous section for varying amounts of missing values and learners yields the results in Table 1. Figure 2a shows a decline of the decision tree s accuracy when the number of missing values in the data set increases. This supports Claim 3.1. Additionally, the difference between the accuracies of both methods increases as the data set contains more missing values, which supports Claim 3.2. The difference is especially clear with the larger sample of 7500, initially the difference is less than 1%, but if we increase the amount of missing values by 25%, the accuracy difference increases to 18%. The smaller sample of 2000 tuples exhibits the same, though less extreme, behaviour. Figure 2b shows a radically different behavior: there is almost no difference between the imputation method and the normal method. Furthermore, the accuracy of the Naive Bayes learner shows no significant decrease when the number of missing values increases. The difference between the accuracies for the 0%-experiment and the 25%-experiment is less than a percent for the larger sample. This supports both Claims 3.3 and 3.4.

Learner Introduction MVI Accuracy 2000 7500 decision tree 0% 83,75% 90,69% decision tree 0% 83,65% 90,51% decision tree 5% 77,50% 73,03% decision tree 5% 79,40% 78,15% decision tree 10% 73,45% 78,76% decision tree 10% 79,40% 85,87% decision tree 15% 71,15% 76,51% decision tree 15% 74,35% 83,11% decision tree 20% 73,00% 67,08% decision tree 20% 80,05% 84,31% decision tree 25% 70,40% 66,68% decision tree 25% 79,95% 84,92% Naive Bayes 0% 85,85% 84,99% Naive Bayes 0% 86,10% 84,92% Naive Bayes 5% 83,55% 83,96% Naive Bayes 5% 83,55% 84,25% Naive Bayes 10% 83,25% 84,85% Naive Bayes 10% 83,10% 85,23% Naive Bayes 15% 85,80% 84,25% Naive Bayes 15% 85,10% 84,27% Naive Bayes 20% 85,60% 83,73% Naive Bayes 20% 85,50% 83,05% Naive Bayes 25% 84,05% 84,11% Naive Bayes 25% 83,55% 84,01% Table 1: Decision tree and Naive Bayes results on various amounts of missing values. Decision Tree Naïve Bayes 95 95 90 90 Accuracy 85 80 75 Accuracy 85 80 75 70 70 65 0 5 10 15 20 25 Percentage Extra Missing Values Introduced 65 0 5 10 15 20 25 Percentage Extra Missing Values Introduced (a) Decision tree results (b) Naive Bayes results Sample size 2000 Sample size 2000, with missing value imputation Sample size 7500 Sample size 7500, with missing value imputation Figure 2: Results on the Census Income data set.

4 Related Work Several authors have done research related to decision trees and missing values. Magnani and Zamboni[5] stated that there are three types of missing value introduction: Missing completely at random, Missing at random and Not missing at random. In our research the Missing completely at random method was used. Furthermore, other authors have investigated the effect of replacing missing values. Buck[2] describes a method to replace missing values of an attribute A with the mean value for attribute A. There are also several articles written about actual imputation of missing values. A method using clustering is given in Fujikawa and Ho[3]; Zhang et.al[11] describe a method using non-parametric regression; and a method to replace missing values by the literal string unknown is presented in Quinlan[7]. Schoier[10] makes a distinction between Hot-deck imputation, in which the missing value imputation is based only upon information in the data set, and Cold-deck imputation, in which also information from other data sets is used to impute the missing values. A completely different way of handling missing values is presented by Gorodetsky in Gorodetsky et.al[4]. Their work presents a method for classification learners to work around missing values instead of imputing them. 5 Conclusions and Future work This paper supports the claim that imputing missing values prior to computing a decision tree will result in a more accurate classification model. It shows the accuracy improves significantly if the data set contains a substantial amount of missing values. As we expected, imputing missing values does not improve the accuracy of Naive Bayes, which is less sensitive to missing values. Our work only presents the effect of missing value imputation when used with a decision tree and a Naive Bayes learner. Further research with other classification learners, other missing value imputation learners, other data sets, and larger amounts of missing values is needed in order to increase our knowledge of the advantages and limitations of missing value imputation. References [1] A. Asuncion and D.J. Newman. UCI machine learning repository, 2007. [2] SF Buck. A method of estimation of missing values in multivariate data suitable for use with an electronic computer. Journal of the Royal Statistical Society. Series B (Methodological), pages 302 306, 1960. [3] Y. Fujikawa and T.B. Ho. Cluster-based algorithms for dealing with missing values. Lecture Notes in Computer Science, pages 549 554, 2002. [4] V. Gorodetsky, O. Karsaev, and V. Samoilov. Direct Mining of Rules from Data with Missing Values. Studies in Computational Intelligence, TY Lin, S. Ohsuga, CJ Liau, XT Hu, S. Tsumoto (Eds.). Foundation of Data Mining and Knowledge Discovery, Springer, pages 233 264, 2005. [5] M. Magnani and M.A. Zamboni. Techniques for dealing with missing data in knowledge discovery tasks. Obtido em http://magnanim. web. cs. unibo. it/index. html em, 15(01):2007, 2004. [6] JR Quinlan. Induction of decision trees. Machine learning, 1(1):81 106, 1986. [7] JR Quinlan. Unknown attribute values in induction. In Proceedings of the sixth international workshop on Machine learning table of contents, pages 164 168. Morgan Kaufmann Publishers Inc. San Francisco, CA, USA, 1989. [8] J.R. Quinlan. C4. 5: programs for machine learning. Morgan Kaufmann, 1993. [9] RapidMiner. Rapidminer. http://rapid-i.com/, visited on 19-06-2009. [10] G. Schoier. On partial nonresponse situations: The hot deck imputation method. Retrieved May, 14:2004, 2004.

[11] C. Zhang, X. Zhu, J. Zhang, Y. Qin, and S. Zhang. GBKII: An Imputation Method for Missing Values. Lecture Notes in Computer Science, 4426:1080, 2007.