A practical way of handling missing values in combination with tree-based learners

A practical way of handling missing values in combination with tree-based learners V.J.J. Kusters S.J.J. Leemans D.M.M. Schunselaar F. Staals Eindhoven University of Technology, Den Dolech 2, 5612AZ Eindhoven Abstract In this age of information, techniques for analysing the huge amounts of data that are generated every day are becoming increasingly important. We would often like to use simple tree-based learners on these data sets. Unfortunately, the performance of these algorithms decreases rapidly as more missing values, or nulls, are present in the data set. In this paper we present a practical way of imputing missing values before learning a tree-based model. This method is then applied to the Census-Income (KDD) data set with various amounts of missing values. Additionally, to confirm our expectations that the imputation is less useful when using a noise-tolerant learner, the same procedure is executed with a Naive Bayes learner. The proposed method was shown to greatly increase the accuracy of tree-based learners on a data set with many missing values. 1 Introduction As our civilization continues to rely more and more on computer systems, the amount of information that is collected and generated increases vastly. Since data sets quickly become too large to analyse by hand, computer tools are required to classify data and find relationships. Various methods exist for the classification of tuples into categories, but tree-based learners are among the easiest to use. They match the way a human would tackle a classification problem, making them easy to understand. Consequently, tree-based learners are widely used in the industry to solve classification problems. Many real-world data sets contain missing values. Unfortunately, the presence of large amounts of missing values severely impacts the accuracy of tree-based learners. In this paper we present a practical method of imputing missing values prior to learning a model. Our hypothesis is: Theorem 1.1. A method that imputes missing values before applying a tree-based learner produces a more accurate model than a method that applies a tree-based learner directly. We will test our hypothesis on the Census-Income (KDD) data set provided by Asuncion and Newman[1]. This data set contains census data of the American Department of Commerce. Every tuple represents a person and his or her age, education level, race, etcetera. We will learn classification models on this data set to categorise persons based on their yearly income. A person should be categorised in the 50 000- category if their income is is at most $50.000, and in 50 000+ otherwise. class. Furthermore, we examine the effects of missing value imputation on the results of a Naive Bayes learner. In our experiment we compare two methods to learn a classification on our data set. One of those methods will impute the missing values before learning a classification model, while the other method will directly learn the classification model. A detailed description of both methods and the required preprocessing steps is given in Section 2. The results of our experiment show that the method which imputes missing values results in a significantly more accurate model when used with a decision tree. When we use a Naive Bayes learner, there is no significant change in the accuracy. Detailed results are presented in Section 3.

2 Approach To test our hypothesis, we set up our experiment as depicted in Figure 1. We apply a learner to the data set directly (normal method) and apply the same learner after imputing missing values (imputation method) and compare the results. These methods learn a model on the training set. Given a tuple, these models classify the tuple into the 50 000+ or the 50 000- class. The normal method immediately learns a classification model on the training set. Validation of the results is done by applying the model to the test set. This method is shown on the right side of Figure 1. The imputation method learns the missing value models and imputes the missing values in the training set, before learning the classification model. We use the approach by Quinlan [6] to impute the missing values. This approach involves learning a model for every attribute of the data set. We use the models we learned on the training set and apply them to the test set to validate the results. This method is shown on the left side of Figure 1. We use a decision tree learner to compute the classification model. The learner is the default decision tree as provided by Rapid Miner[9] and is similar to the C4.5 decision tree as presented by Quinlan [8]. To learn the missing value models we will use the same decision tree learner that is used to compute the classification model. Note that decision trees are sensitive to missing values. This is intentional: to achieve good results, we want to use a learner that works well on a particular data set. If a learner, which is not sensitive to missing values, produced good results on our data set, we could simply use that learner directly; i.e. there would be no need to impute missing values in the first place. We have focused our research on imputing nominal attributes. Various other authors, such as Fujikawa and Ho[3] have done research on the imputation of numerical attributes but this is beyond the scope of our paper. training set test set training set impute missing values model learn model learn model validate model impute missing values validate model 1 2 Figure 1: A model of our test setup Finally, we will run the same procedure again, but replace the tree-based learner for a Naive Bayes learner and compare the results. However before we can execute the experiment in Figure 1, some preprocessing steps have to be executed. The following steps are executed in the order in which they are listed: Sampling To increase the speed of computation, we consider only the first 15.000 rows of the data set and remove the rest. Since the original data set is not ordered in a particular way this is equivalent to taking a random sample.

Missing Value Introduction We introduce extra missing values to investigate the effect of increasing amounts of nulls for missing values imputation. For every experiment, we choose a probability p. A value in the data set then has probability p of being converted to a null. Magnani refers to this as the Missing Completely At Random method [5]. p is chosen from the set {0, 0.05, 0.10, 0.15, 0.20, 0.25}. Duplication Since our 50 000+ class contains only 6% of the rows in the data set, a naive classification approach would classify everything in the 50 000- class. This would result in an accuracy of 94%. In such a classification it would not be possible to assess the results of the missing value imputation. To prevent the learner from classifying everything in the 50 000- class, we duplicate the rows of the 50 000+ class 15 times. The classes in the resulting data set are of roughly equal size. After duplicating the rows, we shuffle the data set to prevent clustering of the 50 000+ rows. Attribute Filtering One of the attributes in our data set is the instance weight attribute. According to the attribute description accompanying the data set, this attribute should not be used for classification. Hence, it was removed. Validation We split our data set into two disjoint sets: a training set, which we will use to learn our model, and a test set, on which we will validate our model. Since we randomised the data set after duplicating the rows in the 50 000+ class we can now take the first n rows for the training set and the last n rows for the test set. To make sure our sample is of sufficient size we tested with n = 2000 and n = 7500. Implementation The sampling, missing value introduction, and duplication operations described in the previous paragraphs were implemented by a custom Python script. The attribute filtering and validation steps were implemented by various operators in Rapidminer[9]. After these steps we use the Rapidminer[9] MissingValueImputation operator to learn, apply and store the missing value models. 3 Experimental Validation In our experiment we found support for the following claims. Claim 3.1. The accuracy of a decision tree learner declines significantly as the number of missing values in a data set increases. Claim 3.2. Missing value imputation grows more effective when using a decision tree learner as the number of missing values in a data set increases. Claim 3.3. The accuracy of a Naive Bayes learner does not decline significantly as the number of missing values in a data set increases. Claim 3.4. The number of missing values in a data set has no significant effect on a naive bayes learner. Applying the algorithm described in the previous section for varying amounts of missing values and learners yields the results in Table 1. Figure 2a shows a decline of the decision tree s accuracy when the number of missing values in the data set increases. This supports Claim 3.1. Additionally, the difference between the accuracies of both methods increases as the data set contains more missing values, which supports Claim 3.2. The difference is especially clear with the larger sample of 7500, initially the difference is less than 1%, but if we increase the amount of missing values by 25%, the accuracy difference increases to 18%. The smaller sample of 2000 tuples exhibits the same, though less extreme, behaviour. Figure 2b shows a radically different behavior: there is almost no difference between the imputation method and the normal method. Furthermore, the accuracy of the Naive Bayes learner shows no significant decrease when the number of missing values increases. The difference between the accuracies for the 0%-experiment and the 25%-experiment is less than a percent for the larger sample. This supports both Claims 3.3 and 3.4.

Learner Introduction MVI Accuracy 2000 7500 decision tree 0% 83,75% 90,69% decision tree 0% 83,65% 90,51% decision tree 5% 77,50% 73,03% decision tree 5% 79,40% 78,15% decision tree 10% 73,45% 78,76% decision tree 10% 79,40% 85,87% decision tree 15% 71,15% 76,51% decision tree 15% 74,35% 83,11% decision tree 20% 73,00% 67,08% decision tree 20% 80,05% 84,31% decision tree 25% 70,40% 66,68% decision tree 25% 79,95% 84,92% Naive Bayes 0% 85,85% 84,99% Naive Bayes 0% 86,10% 84,92% Naive Bayes 5% 83,55% 83,96% Naive Bayes 5% 83,55% 84,25% Naive Bayes 10% 83,25% 84,85% Naive Bayes 10% 83,10% 85,23% Naive Bayes 15% 85,80% 84,25% Naive Bayes 15% 85,10% 84,27% Naive Bayes 20% 85,60% 83,73% Naive Bayes 20% 85,50% 83,05% Naive Bayes 25% 84,05% 84,11% Naive Bayes 25% 83,55% 84,01% Table 1: Decision tree and Naive Bayes results on various amounts of missing values. Decision Tree Naïve Bayes 95 95 90 90 Accuracy 85 80 75 Accuracy 85 80 75 70 70 65 0 5 10 15 20 25 Percentage Extra Missing Values Introduced 65 0 5 10 15 20 25 Percentage Extra Missing Values Introduced (a) Decision tree results (b) Naive Bayes results Sample size 2000 Sample size 2000, with missing value imputation Sample size 7500 Sample size 7500, with missing value imputation Figure 2: Results on the Census Income data set.

4 Related Work Several authors have done research related to decision trees and missing values. Magnani and Zamboni[5] stated that there are three types of missing value introduction: Missing completely at random, Missing at random and Not missing at random. In our research the Missing completely at random method was used. Furthermore, other authors have investigated the effect of replacing missing values. Buck[2] describes a method to replace missing values of an attribute A with the mean value for attribute A. There are also several articles written about actual imputation of missing values. A method using clustering is given in Fujikawa and Ho[3]; Zhang et.al[11] describe a method using non-parametric regression; and a method to replace missing values by the literal string unknown is presented in Quinlan[7]. Schoier[10] makes a distinction between Hot-deck imputation, in which the missing value imputation is based only upon information in the data set, and Cold-deck imputation, in which also information from other data sets is used to impute the missing values. A completely different way of handling missing values is presented by Gorodetsky in Gorodetsky et.al[4]. Their work presents a method for classification learners to work around missing values instead of imputing them. 5 Conclusions and Future work This paper supports the claim that imputing missing values prior to computing a decision tree will result in a more accurate classification model. It shows the accuracy improves significantly if the data set contains a substantial amount of missing values. As we expected, imputing missing values does not improve the accuracy of Naive Bayes, which is less sensitive to missing values. Our work only presents the effect of missing value imputation when used with a decision tree and a Naive Bayes learner. Further research with other classification learners, other missing value imputation learners, other data sets, and larger amounts of missing values is needed in order to increase our knowledge of the advantages and limitations of missing value imputation. References [1] A. Asuncion and D.J. Newman. UCI machine learning repository, 2007. [2] SF Buck. A method of estimation of missing values in multivariate data suitable for use with an electronic computer. Journal of the Royal Statistical Society. Series B (Methodological), pages 302 306, 1960. [3] Y. Fujikawa and T.B. Ho. Cluster-based algorithms for dealing with missing values. Lecture Notes in Computer Science, pages 549 554, 2002. [4] V. Gorodetsky, O. Karsaev, and V. Samoilov. Direct Mining of Rules from Data with Missing Values. Studies in Computational Intelligence, TY Lin, S. Ohsuga, CJ Liau, XT Hu, S. Tsumoto (Eds.). Foundation of Data Mining and Knowledge Discovery, Springer, pages 233 264, 2005. [5] M. Magnani and M.A. Zamboni. Techniques for dealing with missing data in knowledge discovery tasks. Obtido em http://magnanim. web. cs. unibo. it/index. html em, 15(01):2007, 2004. [6] JR Quinlan. Induction of decision trees. Machine learning, 1(1):81 106, 1986. [7] JR Quinlan. Unknown attribute values in induction. In Proceedings of the sixth international workshop on Machine learning table of contents, pages 164 168. Morgan Kaufmann Publishers Inc. San Francisco, CA, USA, 1989. [8] J.R. Quinlan. C4. 5: programs for machine learning. Morgan Kaufmann, 1993. [9] RapidMiner. Rapidminer. http://rapid-i.com/, visited on 19-06-2009. [10] G. Schoier. On partial nonresponse situations: The hot deck imputation method. Retrieved May, 14:2004, 2004.

[11] C. Zhang, X. Zhu, J. Zhang, Y. Qin, and S. Zhang. GBKII: An Imputation Method for Missing Values. Lecture Notes in Computer Science, 4426:1080, 2007.