Investigation of Property Valuation Models Based on Decision Tree Ensembles Built over Noised Data

Investigation of Property Valuation Models Based on Decision Tree Ensembles Built over Noised Data Tadeusz Lasota 1, Tomasz Łuczak 2, Michał Niemczyk 2, Michał Olszewski 2, Bogdan Trawiński 2 1 Wrocław University of Environmental and Life Sciences, Dept. of Spatial Management ul. Norwida 25/27, 50-375 Wrocław, Poland 2 Wrocław University of Technology, Institute of Informatics, Wybrzeże Wyspiańskiego 27, 50-370 Wrocław, Poland tadeusz.lasota@up.wroc.pl, {tomasz.luczak, bogdan.trawinski}@pwr.wroc.pl, {michal.niemczyk, michal.olszewski}@student.pwr.wroc.pl Abstract. The ensemble machine learning methods incorporating bagging, random subspace, random forest, and rotation forest employing decision trees, i.e. Pruned Model Trees, as base learning algorithms were developed in WEKA environment. The methods were applied to the real-world regression problem of predicting the prices of residential premises based on historical data of sales/purchase transactions. The accuracy of ensembles generated by the methods was compared for several levels of noise injected into an attribute, output, and both attribute and output. Ensembles built using rotation forest outperformed other models. In turn, random subspace method resulted in the models that were the most resistant to noised data. Keywords: pruned model trees, bagging, random subspaces, random forest, rotation forest, cross-validation, property valuation, noised data 1 Introduction The issue of dealing with noisy data is one of key aspects in supervised machine learning to create reliable data-driven models. Noisy data may strongly affect the accuracy of resulting data models and can result in decreasing system performance in terms of predictive accuracy, processing efficiency and the size of the learner. Several works on the impact of noise, mainly in the context of classification problems and class noise, have been published. In [1] increasing the size of training set by adding noise to the training objects was explored for different amount and directions of noise injection. It was shown theoretically and empirically that the k-nearest neighbors directed noise injection was preferable over the Gaussian spherical noise injection when using multilayer perceptrons. In [2] noise was injected to both input attributes and output classes. The results varied depending on the noise type and the specific data set being processed. Naïve Bayes turned out to be the most robust algorithm, and SMO (support vector machine) the least. In [3] it was observed that the attribute noise was less harmful in comparison with class noise. Moreover, the higher the correlation between an attribute and the class, the more negative impact the attribute noise may

have. The authors recommend to handle noisy instances before a learner is generated. In [4] two different class noise types were applied to training sets. Fuzzy Rule Based Classification Systems revealed a good tolerance to class noise in comparison to the C4.5 crisp algorithm which is considered resistant to noise. In [5] the performance of several ensemble models learned from imbalanced and noisy binary-class data was compared. As the result clear preference of bagging over boosting was shown. We have studied recently the impact of noised data on the performance of ensemble models for regression problem [6]. We injected noise to output values and showed that random subspace and random forest techniques, where the diversity of component models is achieved by manipulating of features, were more resistant to noise than classic resampling techniques such as bagging, repeated holdout, and repeated cross-validation. For a few years we have been investigating techniques for developing an intelligent system to assist with of real estate appraisal devoted to a broad spectrum of users interested in the premises management. The outline of the system to be exploited on a cloud computing platform is presented in Fig. 1. Public registers and cadastral systems create a complex data source for the intelligent system of real estate market. The core of the system are valuation models including models constructed according to the professional standards as well as data-driven models generated using machine learning algorithms. So far, we have investigated several methods to construct ensembles of regression models to be incorporated into the system including various resampling techniques, random subspaces, random forests, and rotation forests. As base learning algorithms weak learners as evolutionary fuzzy systems, neural networks, and decision trees were employed [7], [8], [9], [10], [11], [12], [13]. Fig. 1. Outline of the intelligent system of real estate market The first goal of the investigation presented in this paper is to compare empirically ensemble machine learning methods incorporating bagging, random subspace, random forest, and rotation forest employing decision trees as base learners. Bagging, which stands for bootstrap aggregating, devised by Breiman [14] is one of the most intuitive and simplest ensemble algorithms providing good performance. Another approach to ensemble learning is called the random subspaces, also known as attribute bagging. This approach seeks learners diversity in feature space subsampling [15]. The method called random forest merges these two approaches was worked out by

Breiman [16]. Random forest uses bootstrap selection for supplying individual learner with training data and limits feature space by random selection. Rodríguez et al. [17] proposed in 2006 a new classifier ensemble method, called rotation forest, applying Principal Component Analysis (PCA) to rotate the original feature axes in order to obtain different training sets for learning base classifiers. The second goal is to examine the performance of the ensemble methods dealing with noisy data. The noise was artificially injected into an attribute, output, and both attribute and output. The susceptibility to noised data can be an important criterion for the selection of appropriate machine learning methods to our automated valuation system. We do not konw the purpose of property valuation. For example, the prices estimated to secure loans may differ substantially from the prices appraised to calculate taxes. We do not what sort of properties and their locations were in vogue at the moment of the sale. Moreover, the market instability and uncertainty cause the investors to take irrational sales/purchase decisons. Hence, we may assume that the historical data, we use to create real estate valuation models, contain much noise. 2 Methods Used and Experimental Setup We conducted a series of experiments to compare bagging (Bag), random subspace (RaS), random forest (RaF), and rotation forest (RtF) models with and single models (Sgl) in respect of its predictive accuracy using cadastral data on sales/purchase transactions of residential premises. All tests were accomplished using WEKA (Waikato Environment for Knowledge Analysis), a non-commercial and open source data mining system [18]. WEKA contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes. The decision tree WEKA algorithm, very often used for building and exploring ensemble models, namely Pruned Model Tree (M5P), was employed to carry out the experiments. M5P implements routines for generating M5 model trees The algorithm is based on decision trees, however, instead of having values at tree's nodes, it contains a multivariate linear regression model at each node. The input space is divided into cells using training data and their outcomes, then a regression model is built in each cell as a leaf of the tree. Real-world dataset used in experiments was drawn from an unrefined dataset containing above 100 000 records referring to residential premises transactions accomplished in one Polish big city with the population of 640 000 within 14 years from 1998 to 2011. The final dataset counted the 9795 samples. Four following attributes were pointed out as main price drivers by professional appraisers: usable area of a flat (Area), age of a building construction (Age), number of storeys in the building (Storeys), the distance of the building from the city centre (Centre), in turn, price of premises (Price) was the output variable. For random subspace, random forest, and rotation forest approaches four more features were employed: number of rooms in the flat including a kitchen (Rooms), geodetic coordinates of a building (Xc and Yc), and its distance from the nearest shopping center (Shopping). Due to the fact that the prices of premises change substantially in the course of time, the whole 14-year dataset cannot be used to create data-driven models using

machine learning. Therefore it was split into subsets covering individual years, and we might assume that within one year the prices of premises with similar attributes were roughly comparable. Starting from the beginning of 1998 the prices were updated for the last day of subsequent years using the trends modelled by polynomials of degree four. We might assume that one-year datasets differed from each-other and might constitute different observation points to compare the accuracy of ensemble models in our study and carry out statistical tests. The sizes of 14 one-year datasets are given in Table 1. Table 1. Number of instances in one-year datasets 1998 1999 2000 2001 2002 2003 2004 446 646 554 626 573 790 774 2005 2006 2007 2008 2009 2010 2011 740 776 442 734 821 1296 577 Following methods were applied in the experiments: Sgl M5P algorithm with the number of features equal to 4. In this case single models were built, therefore there was only one iteration of the algorithm. Bag Bagging with M5P algorithm, size of each bag was set to 100% of the training set, number of bagging iterations was set to 50. RaS Random subspace with M5P algorithm, size of each subspace was set to 75% of all attributes, number of random subspace iterations was set to 50. RaF Random forest Bagging with M5P as the Filtered Classifier and Random Subset as a filter. Size of each bag was set to 100% of the training set, number of bagging iterations was set to 50. Number of attributes in random subset was set to 75% of all attributes. RtF Rotation forest with M5P algorithm set as a classifier. Maximum and minimum number of groups was set to 4. The percentage of instances to be removed was set to 20%. As a projection filter the Principial Components with default parameters were used. Number of rotation forest iterations was set to 50. Fig. 2. Outline of experiment with random forest method within 10cv frame. The procedure is repeated 10 times according to 10cv schema. For each method 10-fold cross validation repeated ten times was used as a result generator. Schema of an experiment using RaF within WEKA 10cv frame is shown in

Fig. 2. As a performance function the root mean square error (RMSE) was used, and as aggregation functions of ensembles arithmetic mean was employed. During our research we analyzed the impact of data noise on the performance of the described ensemble methods. During the first run of experiment no value was changed. Next, we replaced 1%, 5%, 10%, 20%, 30%, 40%, 50% randomly selected input values (Area) in training and in testing set with noised values. Then, we did the same processing with the output value (Price). Finally, we replaced in the same way both input values (Area) and output values (Price) simultaneously. The noised values were generated randomly from range [Q1-1.5 x IQR, Q3+1.5 x IQR], where Q1 and Q3 denote value of first and third quartile, and IQR stands for the interquartile range. This assured that the numbers replacing the original values were not outliers. The schemata illustrating three modes of noise injection are given in Figures 3, 4, and 5. Fig. 3. Schema illustrating injection of noise into input variable (A) Fig. 4. Schema illustrating injection of noise into output variable (O) Fig. 5. Schema illustrating injection of noise into both an input and output variables (AO) 3 Results of Experiments The accuracy of Sgl, Bag, RaS, RaF, and RtF models created using M5P for nonnoised data, data with 10% injected noise into the attribute Area (A), the output Price (O), and both attribute and output (AO) is shown in Figures 6-9, respectively. In the charts it is clearly seen that RtF ensembles reveal the best performance, whereas the

biggest values of RMSE provide the Sgl and RaF models. Moreover, noise injected into the output results in higher error rate than noise introduced into the attribute. The Friedman tests performed in respect of RMSE values of all models built over 14 one-year datasets showed that there are significant differences among models for each noise injection mode considered. Average rank positions, determined by Friedman test, of single and ensemble models for different levels of injected noise into the attribute (A), the output (O), and both attribute and output (AO) are shown in Tables 2, 3, and 4, respectively. In all tables the lower rank value the better model. In each table the RtF models are in the first place and Sgl and RaF ones occupy the last positions. The further Wilcoxon paired tests indicated that there were no statistically significant differences between RtF and Bag models for (O), and between RtF and RaS models for (AO), as well as between RaF and Sgl models for all noise injection modes. Fig. 6. Performance of single and ensemble models for non-noised data Fig. 7. Performance of single and ensemble models for 10% noise injected into attribute (A)

Fig. 8. Performance of single and ensemble models for 10% noise injected into output (O) Fig. 9. Performance of models for 10% noise injected into both attribute and output (AO) Table 2. Average rank positions of single and ensemble models for different levels of injected noise into attribute (A) determined during Friedman test Noise/Rank 1st 2nd 3rd 4th 5th 0% RtF (1.29) Bag (2.36) RaS (3.36) Sgl (3.79) RaF (4.21)..5% RtF (1.14) RaS (2.07) Bag (3.00) RaF (4.36) Sgl (4.43) 10% RtF (1.07) RaS (1.93) Bag (3.21) RaF (4.14) Sgl (4.64) 20% RtF (1.21) RaS (1.79) Bag (3.29) RaF (4.00) Sgl (4.71) 30% RtF (1.14) RaS (1.86) Bag (3.36) RaF (3.79) Sgl (4.86) 40% RtF (1.14) RaS (1.86) Bag (3.29) RaF (3.86) Sgl (4.86) 50% RtF (1.00) RaS (2.00) Bag (3.29) RaF (3.93) Sgl (4.79) Table 3. Average rank positions of single and ensemble models for different levels of injected noise into output (O) determined during Friedman test Noise/Rank 1st 2nd 3rd 4th 5th 0% RtF (1.29) Bag (2.36) RaS (3.36) Sgl (3.79) RaF (4.21)..5% RtF (1.43) Bag (2.21) RaS (3.43) Sgl (3.79) RaF (4.14) 10% RtF (1.43) Bag (2.00) RaS (3.50) Sgl (3.50) RaF (4.57) 20% RtF (1.57) Bag (2.43) Sgl (3.36) RaS (3.43) RaF (4.21) 30% RtF (1.79) Bag (2.36) Sgl (3.21) RaS (3.50) RaF (4.14) 40% RtF (2.00) Bag (2.07) Sgl (3.14) RaS (3.50) RaF (4.29) 50% RtF (2.00) Bag (2.14) Sgl (3.07) RaS (3.64) RaF (4.14)

Table 4. Average rank positions of single and ensemble models for different levels of injected noise into both attribute and output (AO) determined during Friedman test Noise/Rank 1st 2nd 3rd 4th 5th 0% RtF (1.29) Bag (2.36) RaS (3.36) Sgl (3.79) RaF (4.21)..5% RtF (1.07) RaS (2.50) Bag (2.86) RaF (4.14) Sgl (4.43) 10% RtF (1.29) RaS (1.86) Bag (3.07) RaF (4.29) Sgl (4.50) 20% RtF (1.36) RaS (1.64) Bag (3.57) RaF (3.86) Sgl (4.57) 30% RtF (1.14) RaS (1.86) Bag (3.43) RaF (4.07) Sgl (4.50) 40% RtF (1.36) RaS (1.64) Bag (3.43) RaF (3.79) Sgl (4.79) 50% RtF (1.14) RaS (1.86) Bag (3.21) RaF (3.86) Sgl (4.93) Table 5. Median of percentage loss of performance for data with noise vs non-noised data for different levels of injected noise into attribute (A) Noise Sgl Bag RaS RaF RtF 1% 4.7% 4.7% 3.6% 3.9% 4.3% 5% 13.8% 14.2% 8.3% 10.3% 10.9% 10% 22.4% 22.9% 12.7% 18.1% 16.0% 20% 35.8% 37.0% 17.7% 30.5% 22.1% 30% 44.0% 44.8% 22.6% 39.1% 27.3% 40% 54.4% 52.4% 27.1% 46.8% 29.0% 50% 54.8% 57.9% 28.4% 49.5% 33.5% Table 6. Median of percentage loss of performance for data with noise vs non-noised data for different levels of injected noise into output (O) Noise Sgl Bag RaS RaF RtF 1% 2.6% 2.7% 2.8% 2.8% 2.9% 5% 16.4% 15.6% 14.8% 13.8% 16.4% 10% 23.6% 25.4% 28.6% 26.4% 32.5% 20% 45.3% 45.7% 41.4% 41.8% 46.7% 30% 61.8% 64.2% 61.4% 58.3% 67.1% 40% 72.9% 75.7% 72.1% 68.3% 80.3% 50% 84.6% 86.2% 82.8% 79.4% 88.9% Table 7. Median of percentage loss of performance for data with noise vs non-noised data for different levels of injected noise into both attribute and output (AO) Noise Sgl Bag RaS RaF RtF 1% 7.2% 6.0% 5.9% 5.9% 6.4% 5% 22.9% 22.7% 21.3% 21.6% 23.4% 10% 41.1% 41.9% 37.4% 39.8% 40.1% 20% 64.7% 68.1% 55.9% 60.4% 62.1% 30% 79.6% 80.1% 68.1% 71.9% 75.6% 40% 90.6% 91.5% 80.1% 84.5% 86.2% 50% 93.7% 90.7% 82.5% 89.9% 90.5% As for the susceptibility to noise of individual ensemble methods the general outcome is as follows. Injecting subsequent levels of noise results in worse and worse accuracy. Percentage loss of performance for data with 1%, 5%, 10%, 20%, 30%, 40%, and 50% noise versus non-noised data was computed for each one-year dataset. The aggregate results in terms of median over all datasets are presented in Tables 5, 6, and 7. The amount of loss is different for individual datasets and it increases with the growth of percentage of noise. The most important observation is that in each case the

average loss of accuracy for RaS is lower than for the other models. We obtained similar results in our previous research into susceptibility to noise of ensemble models built with genetic fuzzy systems as basic learning methods [6]. 4 Conclusions and Future Work A series of experiments aimed to compare ensemble machine learning methods encompassing bagging, random subspace, random forest, and rotation forest was conducted. The ensemble models were created using decision tree algorithm over real-world data taken from a cadastral system. Moreover, the susceptibility to noise of these ensemble methods was examined. The noise was injected into an attribute, output, and both attribute and output by replacing the original values by the numbers randomly drawn from the range of values excluding outliers. The overall results of our investigation were as follows. Ensembles built using rotation forest outperform any other models. On the other hand, single models and ensembles created with random forests revealed the worst performance. In turn, random subspace method resulted in the models the most resistant to noised data. We intend to continue our research into resilience to noise of regression algorithms employing other machine learning techniques such as neural networks and support vector regression. We also plan to noise data using different probability distributions. Acknowledgments. This paper was partially supported by the Polish National Science Centre under grant no. N N516 483840. References 1. Skurichina, M., Raudys, S., Duin, R.P.W.: K-Nearest Neighbors Directed Noise Injection in Multilayer Perceptron Training. IEEE Transactions on Neural Networks 11(2), 504-- 511 (2000) 2. Nettleton, D.F., Orriols-Puig, A., Fornells, A.: A study of the effect of different types of noise on the precision of supervised learning techniques. Artificial Intelligence Review 33(4), 275--306 (2010) 3. Zhu, X., Wu, X.: Class Noise vs. Attribute Noise: A Quantitative Study of Their Impacts. Artificial Intelligence Review 22, 177--210 (2004). 4. Sáez, J.A., Luengo, J., Herrera, F.: Fuzzy Rule Based Classification Systems versus Crisp Robust Learners Trained in Presence of Class Noise's Effects: a Case of Study. 11th International Conference on Intelligent Systems Design and Applications (ISDA2011), Córdoba, Spain, pp. 1229--1234 (2011) 5. Khoshgoftaar, T.M., Van Hulse, J., Napolitano, A.: Comparing Boosting and Bagging Techniques With Noisy and Imbalanced Data With Noisy and Imbalanced Data. IEEE Transactions on System, Man, and Cybernetics Part A: Systems and Humans 41:3, 552-- 568 (2011) 6. Lasota, T., Telec, Z., Trawiński, B., Trawiński G.: Investigation of Random Subspace and Random Forest Regression Models Using Data with Injected Noise. In M. Graña et al. (Eds.): KES 2012, LNAI 7828, pp. 1--10, Springer, Heidelberg (2013)

7. Graczyk, M., Lasota, T., Trawiński, B., Trawiński, K.: Comparison of Bagging, Boosting and Stacking Ensembles Applied to Real Estate Appraisal. In N.T. Nguyen, M.T. Le, J. Świątek (Eds.): ACIIDS 2010, LNAI 5991, pp. 340--350. Springer, Heidelberg (2010) 8. Kempa, O., Lasota, T., Telec, Z., Trawiński, B.: Investigation of bagging ensembles of genetic neural networks and fuzzy systems for real estate appraisal. N.T. Nguyen, C.-G. Kim, A. Janiak (Eds.): ACIIDS 2011, LNAI 6592, pp. 323--332, Springer, Heidelberg (2011) 9. Lasota, T., Telec, Z., Trawiński, G., Trawiński B.: Empirical Comparison of Resampling Methods Using Genetic Fuzzy Systems for a Regression Problem. In H. Yin et al. (Eds.): IDEAL 2011, LNCS 6936, pp. 17--24, Springer, Heidelberg (2011) 10. Lasota, T., Telec, Z., Trawiński, G., Trawiński B.: Empirical Comparison of Resampling Methods Using Genetic Neural Networks for a Regression Problem In E. Corchado et al. (Eds.): HAIS 2011, LNAI 6679, pp. 213--220, Springer, Heidelberg (2011) 11. Lasota, T., Łuczak, T., Trawiński B.: Investigation of Random Subspace and Random Forest Methods Applied to Property Valuation Data. In P. Jędrzejowicz et al. (Eds.): ICCCI 2011, Part I, LNCS 6922, pp. 142--151, Springer, Heidelberg (2011) 12. Lasota, T., Telec, Z., Trawiński, B., Trawiński G.: Investigation of Rotation Forest Ensemble Method Using Genetic Fuzzy Systems for a Regression Problem. J.-S. Pan, S.- M. Chen, N.T. Nguyen (Eds.): ACIIDS 2012, LNAI 7196, pp. 393--402, Springer, Heidelberg (2012) 13. Lasota, T., Łuczak, T., Trawiński B.: Investigation of Rotation Forest Method Applied to Property Price Prediction. In L. Rutkowski et al. (Eds.): ICAISC 2012, Part I, LNCS 7267, pp. 403--411, Springer, Heidelberg (2012) 14. Breiman, L.: Bagging Predictors. Machine Learning 24(2), 123--140 (1996) 15. Ho, T.K.: The Random Subspace Method for Constructing Decision Forests. IEEE Transactions on Pattern Analysis and Machine Intelligence 20(8), 832--844 (1998) 16. Breiman, L.: Random Forests. Machine Learning 45(1), 5--32 (2001) 17. Rodrígeuz, J.J., Kuncheva, I., Alonso, C.J.: Rotation forest: A new classifier ensemble method. IEEE Transactions on Pattern Analysis and Machine Intelligence 28(10), 1619-- 1630 (2006) 18. Witten, I.H., Frank, E., Hall, M.A.: Data Mining: Practical Machine Learning Tools and Techniques, 3rd Edition, Morgan Kaufmann, San Francisco (2011)