EUS SVMs: Ensemble of Under-Sampled SVMs for Data Imbalance Problems

EUS SVMs: Ensemble of Under-Sampled SVMs for Data Imbalance Problems Pilsung Kang and Sungzoon Cho Seoul National University, San 56-1, Shillim-dong, Kwanak-gu, 151-744, Seoul, Korea {xfeel80,zoon}@snu.ac.kr http://dmlab.snu.ac.kr Abstract. Data imbalance occurs when the number of patterns from a class is much larger than that from the other class. It often degenerates the classification performance. In this paper, we propose an Ensemble of Under-Sampled SVMs or EUS SVMs. We applied the proposed method to two synthetic and six real data sets and we found that it outperformed other methods, especially when the number of patterns belonging to the minority class is very small. 1 Introduction In classification, data imbalance occurs when the number of patterns of a class is much larger than that of the other class. Most classification algorithms are trained under the assumption that the ratio of the classes is almost equal. In real classification tasks, however, this assumption is often violated. Fraud detection[1], for instance, is the classification task that identifies the customers who are likely to commit a fraud among the customer database of the company. In this task, the number of fraudulent customers is much smaller than that of normal customers, so data imbalance occurs. In addition, data imbalance is reported in a wide range of classification tasks, such as Oil Spill Detection[2], Response Modeling[3], Remote Sensing[4], and Scene Classification[5]. Data imbalance is one of the causes that degrade the performance of machine learning algorithms including Support Vector Machines(SVMs) in classification tasks. This comes from two major causes. First, the simple accuracy as an objective function used in most classification tasks is inadequate for the task having data imbalance. For example, let us consider a classification problem in which there are two classes, 1% of the patterns belonging to the minority class and 99% of the patterns belonging to the majority class. If a classifier made a decision that all patterns should be classified into the majority class, it would achieve 99% of accuracy. This can be considered as a good performance in terms of simple accuracy, but this is of no use since the classifier does not catch any important information on the patterns of the minority class. The second cause comes from the distribution of the classes. Since the number of majority class patterns exceeds that of the minority class, the majority class is likely to invade Corresponding author

the territory of the minority class so that the class boundary becomes vulnerable to be distorted. In order to deal with the inappropriateness of simple accuracy in data imbalance problems, some other objective functions have been addressed in previous work[6][7][8]. Though the formulations of the functions are different from each other, they are all mainly focusing on considering the accuracy of both majority class and minority class. In order to deal with the problem caused by the skewed data distribution, three methods are commonly proposed. First, undersampling[9][10] method balances the ratio of the classes by sampling a small number of patterns from the majority class. Not only can under-sampling method elevate the classification performance but also reduce the time complexity since it samples a small number of patterns from the majority class. However, undersampling method also has a potential disadvantage of distorting the distribution of the majority class. If the sampled patterns from the majority class does not represent the original distribution, it may degenerate the classification performance. This potential drawback comes true when the number of minority class patterns is very small. Second, over-sampling method[8][11] balances the ratio of the classes by copying patterns from the minority class. Since over-sampling does not lose the information on all patterns, it can achieve a relatively high performance. However, the required time to train the classifier increases since the number of patterns used in training is much larger than the number of the original patterns. Third, modifying-cost method[4] dictates that misclassified patterns originally belonging to the minority class receive larger penalty than those belonging to the majority class. Modifying cost method can handle the data imbalance without changing the original data distribution. When data is highly imbalanced, however, its effect on the classification performance is not as good as that of under-sampling method or over-sampling method. In this paper, we propose an Ensemble of Under-Sampled SVMs, EUS SVMs. Although SVMs show a good generalization ability in many pattern classification tasks, its performance can be boosted by adopting an ensemble scheme[12]. In addition, the ensemble scheme can lower the variation of each individual classifier so that the performance of the classifier can be more stable. Thus, EUS SVMs can integrate the strength of both SVMs and the ensemble scheme. EUS SVMs build multiple different training sets by sampling patterns from the majority class and combining them with the minority class patterns. Each training set is used for training an individual SVM classifier. The output of the ensemble is produced by aggregating the outputs of all individual classifiers. By adopting the ensemble technique, EUS SVMs can not only make up for the sampling dependency of under-sampling method, but also achieve a reasonable time complexity compared to over-sampling method. We apply EUS SVMs to two synthetic and six real data sets, and the results show that EUS SVMs outperform the other methods. The rest of this paper is structured as follows. In section 2, we demonstrate the effect of data imbalance with synthetic data sets and performance of three approaches. In section 3, we introduce the propose method, EUS SVMs. In sec-

Fig. 1. The class boundary of 4 4 checker board data sets(set A) with SVMs base classifier tion 4, we present the experimental settings and analyze the result. In section 5, with a conclusion, we discuss future work. 2 The Effect of Data Imbalance Before we start, let us consider a performance measure appropriate for imbalanced data sets. Suppose that positive patterns are the patterns belonging to the minority class and that negative patterns are the patterns belonging to the majority class. Usual classification tasks use simple accuracy computed by (T P +T N) (T P +F N+F P +T N) when TP, TN, FP, FN represent true positive, true negative, false positive, and false negative respectively. However, as mentioned in Section 1, simple accuracy heavily relies on TN(True Negative) rather than TP(True Positive) when data is imbalanced. Thus, the classifier tends to classify most patterns as negative to achieve a high simple accuracy. In order to prevent this, some other performance measures are have been considered[6][7][8]. In this paper, we adopt Geometric Mean, which considers both the accuracies of the minority class and the majority class equally. A+, the accuracy of the minority class, is computed by (T P ) (T P +F N). A-, the accuracy of the majority class, is computed by (T N) (F P +T N). Then, geometric mean is computed by (A+) (A ). A synthetic data set was generated to understand the effect of data imbalance on the SVM classifier when the number of minority class patterns is not so small in an absolute term. Six 4 4 checker board data sets(set A) were generated. The number of minority class patterns is 320 for all data sets. The ratio of the classes varies from 1:1 to 1:50. The class boundary of each data set using SVM as a base classifier is shown in Fig. 1. The solid line represents the class boundary determined by the SVM classifier. When the number of each class s patterns is not so much different(fig. 1(a)), the generated class boundary is good enough to represent the original class boundary. As the degree of imbalance increases(fig. 1(b)), however, the boundary of the majority class invades the area of the minority class. When data imbalance is extreme(fig. 1(c)), the majority class pushes out the minority class, so the area of the minority class assigned by the classifier is very small. The performance of this experiment is shown in Fig.

1 0.6 0.4 0.2 A+ A Accuracy Geometric Mean 0 1:1 1:3 1:5 1:10 1:30 1:50 Class Ratio Fig. 2. The performance of the SVMs with various imbalance ratios 2. As the degree of imbalance increases, the accuracy of the minority class(a+) decreases rapidly and so does the geometric mean. Simple accuracy, however, tends to increase despite the decrease of A+. This is mainly because the effect of the accuracy of the majority class(a-) on simple accuracy is much greater than A+ when the degree of imbalance is high. This clearly shows that simple accuracy is inappropriate as a performance measure in data imbalance cases. Fig. 3 shows the performances and the elapsed times of existing methods, under-sampling, over-sampling, and modifying cost. Since set A1, whose class ratio is 1:1, is the set having the perfect balanced data, we evaluated the geometric mean and the elapsed time of the set A1 for no sampling only. Modifying cost method, as shown in Fig. 3, has little effect on increasing the performance of the classifier in terms of geometric mean compared to no sampling. Modifying cost method even takes very long time to train the classifier in comparison with no sampling method when the degree of imbalance is very high. Both undersampling and over-sampling seem to cope with the difficulties caused by data imbalance in terms of geometric mean, especially over-sampling representing the highest values in all cases. In terms of time complexity, however, over-sampling is very sensitive to the number of patterns while under-sampling is robust to it. Since over-sampling increases the number of the minority class patterns so that it equals to the number of the majority class patterns, the number of total training patterns becomes twice the number of the majority class patterns. Under-sampling, on the other hand, decreases the number of the majority class patterns so that it equals to the number of the minority class patterns. Therefore, as the degree of imbalance increases, training time of under-sampling does not increase. When the number of minority class data is not sufficient, however, the sampled data from the majority class may not represent the entire distribution of the majority class. Therefore, under-sampling may perform badly when there are only a few minority class patterns. Therefore, we generate another synthetic data set to see what happens to under-sampling when the number of minority class patterns is small in an absolute term. To do this, we generated five 4 4 checker board data sets(set B) having only 80 minority class patterns. Since 1:1 and 1:3 were found to be not seriously imbalanced, we removed these ratios and

1 0.9 0.7 0.6 (a) Geometric Mean of Each Method 0.5 No Sampling 0.4 Under Sampling 0.3 Over Sampling Modifying Cost 0.2 1:1 1:3 1:5 1:10 1:30 1:50 Class Ratio(Minority Class:Majority Class) 10 4 10 2 10 0 (b) Elapsed Time of Each Method No Sampling Under Sampling Over Sampling Modifying Cost 10 2 1:1 1:3 1:5 1:10 1:30 1:50 Class Ratio(Monirity Class:Majority Class) Fig. 3. Geometric mean and elapsed time of existing methods (a) Sufficient Minority Class Patterns 8 A+ A 6 G Mean 4 2 0.78 0.76 0.74 1:1 1:3 1:5 1:10 1:30 1:50 Class Ratio(Minority Class:Majority Class) (b) Insufficient Minority Class Patterns 8 6 4 2 0.78 0.76 0.74 1:5 1:10 1:30 1:50 1:100 Class Ratio(Minority Class:Majority Class) Fig. 4. A+, A-, and geometric mean of under-sampling method with (a) sufficient minority class patterns and (b) insufficient minority class patterns added a new ratio, 1:100. The classification results of under-sampling method of two data sets, Set A(sufficient minority class patterns) and Set B(insufficient minority class patterns) are shown in Fig. 4. Under-sampling method achieved good geometric means in both data sets regardless the degree of imbalance. However, note that high geometric means were achieved in Set B(insufficient minority class patterns) by sacrificing the accuracy of the majority class while the accuracies of the majority class and the minority class are not so much different in Set A(sufficient minority class patterns). Since the number of patterns sampled from the majority class was not enough to represent the whole distribution of the majority class, the minority class invaded the area of the majority class where the majority class s patterns were not selected. Thus, the classifier overestimated the area of the minority class and resulted in high A+ and low A-. This phenomenon usually happens when a high degree of imbalance occurs with the small number of minority class patterns. 3 EUS SVMs: Ensemble of Under-Sampled SVMs Under-sampling uses only one training set consisting of the sampled majority class patterns and all minority class patterns. In this case, the boundary between two classes is vulnerable to the selected majority class patterns, which results

Fig. 5. The procedure of EUS SVMs do partition the training data into majority and minority class for i=1 to N(the number of ensemble population) build the majority subset by random sampling from the majority class whose size is equal to that of minority class construct the training subset by combining the majority subset and minority class train an SVM with the training subset end do combine N outputs of ensemble by a pre-determined rule Fig. 6. EUS SVMs Algorithm in a low and unstable performance. If we employ multiple training sets, majority class patterns have better chances to be included in the training sets. The more patterns included in the training sets, the less likely to distort the data distribution. Thus, we propose an ensemble approach. Given the two data sets of the minority class and the majority class, the majority class patterns are sampled without repetition to construct a subset of majority class. The number of patterns in the majority subset is equal to the number of the minority class patterns(see Fig. 5 and Fig. 6). The sampling is repeated until predetermined number(n) of majority class subsets are built. Note that each majority subset sampling is performed using the entire majority class patterns. Each majority subset is then combined with the minority class patterns to construct a training data subset, which is perfectly balanced. Each training data subset is used for constructing an individual classifier. Finally, the outputs of all individual classifiers are aggregated to produce the output of the ensemble.

Fig. 7. Class boundaries determined by no-sampling method with SVMs in 4 4 checker board data set(set B)[(a)-(e)] and spiral data set[(f)-(j)] Table 1. Description of real data sets Data Set Minority Majority Total Minority Patterns Patterns Patterns Ratio Vehicle 2 212 634 846 25.06 % Vehicle 3 217 629 846 25.65 % Ann-thyroid 13 93 3,488 3,581 2.60 % Ann-thyroid 23 191 3,488 3,679 5.19 % Sick-euthyroid 238 1,774 2,012 11.83 % Mammography 260 10,923 11,183 2.32 % 4 Experimental Settings and Results 4.1 Data We used two synthetic problems and six real data sets to verify the effectiveness of EUS SVMs. Five 4 4 checker board data sets(set B) and five spiral data sets were generated. Fig. 7 shows the class boundary of each set when a single SVM classifier is trained with no sampling. Six real data sets with imbalance problem were selected from UCI Machine Learning Repository[13]. Many of the data sets have more than two classes. Since our object is to deal with imbalance, we corrected them into two-class problems. Vehicle 2(3) refers to a problem where only class 2(3) is treated the minority class while the rest is treated as the majority class. Similarly, Ann-thyroid 13(23) refers to a problem where class 1(2) is the minority class while class 3 is treated as the majority class. Since sick-euthyroid and mammography data sets originally consist of two classes, we used them without any class modification 1. Only Ann-thyroid data set has both train data set and test data set. Other data sets were tested using 5-fold cross validation. The overall description of real data sets are shown in Table 1. 1 We would like to thank professor Nitesh V. Chawla for providing us with mammography data set.

4.2 Ensemble Aggregation Methods The output of EUS SVMs can be different depending on the aggregation method of the ensemble. In our experiment, we employed three aggregation methods to determine the output of EUS SVMs. First is majority voting. Each individual classifier votes for one of the candidate outputs. The candidate output that has the largest votes becomes the representative output of the ensemble. Second is weighted voting. Once all individual classifiers are finished training, each has its own training error. The output of an individual classifier with small training error contributes to the output of the ensemble more than an individual classifier with large training error. Third is function value aggregation. As SVM is originally designed for two class classification, it has a binary output. The binary output comes from the absolute value of the objective function of SVM. When the objective function value is converted into the binary value, important information on the pattern is lost such as how far the pattern is from the class boundary. The bigger the absolute value, the further the pattern from the class boundary. Therefore, in function value aggregation, the output of the ensemble is determined by adding all the objective function values of the individual classifiers. 4.3 Experimental Results Note that the geometric means of no sampling in 4 4 checker board data set(set B) are 0.732, 0.663, 0.498, 0.335 and 0.316 corresponding to the imbalance ratios of 1:5, 1:10, 1:30, 1:50, and 1:100 respectively. The geometric means of no sampling in spiral data set are 0.756, 0.724, 0.700, 0.568, and 0.524 corresponding to the imbalance ratios of 1:5, 1:10, 1:30, 1:50, and 1:100 respectively. The experimental results of under-sampling method and three EUS SVMs with synthetic data sets and real data sets are shown in Fig. 8 and Table 2 respectively. Three results can be summarized as follows. First, both under-sampling and EUS SVMs are effective to deal with data imbalance. They significantly outper- 86 85 84 83 82 Under Sampling EUS SVMs(MV) 81 EUS SVMs(WV) EUS SVMs(FVA) 1:5 1:10 1:30 1:50 1:100 (a) 4X4 Checker Board Data set (Set B) 6 4 2 0.78 0.76 0.74 0.72 1:5 1:10 1:30 1:50 1:100 (b) Spiral Data Set Fig. 8. Geometric means of (a) 4 4 checker board data set(set B) and (b) spiral Data set (MV:Majority Voting, WV:Weighted Voting, FVA:Function Value Aggregation)

Table 2. Geometric means of no-sampling, under-sampling and three EUS SVMs with real data sets (MV:Majority Voting, WV:Weighted Voting, FVA:Function Value Aggregation) Data set Nosamplinsampling (MV) (WV) (FVA) Under- EUS SVMs EUS SVMs EUS SVMs Vehicle 2 0.7896 225 367 370 390 Vehicle 3 161 503 672 643 663 Ann-thyroid 13 0.9201 0.9664 0.9668 0.9671 0.9701 Ann-thyroid 23 0.9375 0.9589 0.9699 0.9712 0.9684 Sick-euthyroid 599 935 0.9066 0.9079 0.9087 Mammography 0.7487 0.9001 0.9079 0.9080 0.9110 form no sampling, especially when the degree of imbalance increases. Second, although both under-sampling and EUS SVMs work well on the imbalanced training sets, EUS SVMs outperform under-sampling in all cases especially when the original class boundary is very complicated and when the degree of imbalance is high as with spiral data sets. Third, there is no significant difference between the ensemble aggregation methods. Two implicit characteristics of EUS SVMs result in better classification performance. First, EUS SVMs use multiple training sets with balanced patterns. This reduces the possibility of sampling distorting the data distribution so that the classifier is prevented from over-fitting to the minority class. Second, the ensemble pursues diversity to increase the generalization ability by employing a number of individual classifiers. Thus, the classification performance can be better than that of a single classifier. 5 Conclusion Data imbalance is one of the issues that have been widely researched in pattern recognition and machine learning fields. In this paper, we investigate the effect of data imbalance on the performance of the classifier using 2-dimensional synthetic data sets. Among under-sampling, over-sampling, and modifying cost methods, under-sampling was found to be the best method in terms of both classification performance and time complexity. Under-sampling, however, is likely to distort data distribution when there are a very small number of minority class patterns in a highly imbalanced data set. In order to overcome the drawback of undersampling, we proposed Ensemble of Under-Sampled SVMs(EUS SVMs). On two synthetic and six real data sets with various degrees of imbalance, EUS SVMs outperformed under-sampling in all cases in terms of geometric mean. There are some limitations of our work, which lead us to future work. First, we randomly generated patterns from the majority class to build an ensemble training set. More sophisticated sampling methods can be considered to represent the data distribution better. Second, we did not focus on minority class since

there are a small number of patterns in the minority class. In order to boost the classification performance, some over-sampling methods, such as noise addition, can be implemented when constructing ensemble training sets. Acknowledgement This work was supported by grant No. R01-2005-000-103900-0 from the Basic Research Program of the Korea Science and Engineering Foundation, the Brain Korea 21 program in 2006 and partially supported by Engineering Research Institute of SNU. References 1. Fawcett, T., Provost, F.: Adaptive Fraud Detection. Data Mining and Knowledge Discovery 1(3). (1997) 291 316 2. Kubat, M., Holte, R., Matwin, S.: Machine Learning for the detection of oil spills in satellite radar images. Machine Learning 30(2). (1998) 195 215 3. Shin, H.J., Cho, S.Z.: Response Modeling with Support Vector Machine. Expert Systems with Applications 30(4). (1997) 746 760 4. Bruzzone, L., Serpico, S.B.: Classification of imbalanced remote-sensing data by neural networks. Pattern Recognition Letters 18(11-13). (1997) 1323 1328 5. Yan, R., Liu, Y., Jin, R., Hauptman, A.: On Predicting Rare Classes with SVM Ensembles in Scene Classification. IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP 03) (2003) 6. Kubat, M., Holte, R., Matwin, S.: Learning when Negative Examples Abound, In Proceedings of the 9th European Conference on Machine Learning(ECML 97) (1997) 7. Dumais, S., Platt, J., Hecherman, D., Sahami, M.: Inductive Learning Algorithms and Representations for Text Categorization. In Proceedings of the Seventh International Conference on Information and Knowledge Management (1998) 8. Chawla, N.V., Hall, L., Kegelmeyer, W.: SMOTE: Synthetic Minority Oversampling Techniques. Journal of Artificial Intelligence Research 16. (2002) 321 357 9. Kubat, M., Matwin, S.: Addressing the Curse of Imbalanced Training Sets: One- Sided Selection. In Proceedings of the Fourteenth International Conference on Machine Learning. (1997) 179 186 10. Chris, D., Holte, R.C.: C4.5, Class Imbalance, and Cost Sensitivity: Why Under- Sampling beats Over-Sampling. In Proceedings of the International Conference on Machine Learning (ICML 2003) Workshop on Learning from Imbalanced Data Sets II. (2003) 11. Chawla, N.V., Lazarevic, A., Hall, L.O., Bowyer, K.: SMOTEBoost: Improving Prediction of the Minority Class in Boosting. 7th European Conference on Principles and Practice of Knowledge Discovery in Databases. (2003) 107 119 12. Kim, H.C., Pang, S., Je, H.M., Kim, D.J., Bang, S.Y.,: Constructing Support Vector Machine Ensemble. Pattern Recognition 36. (2003) 2757 2767 13. UCI Machine Learning Repository: http://www.ics.uci.edu/ mlearn/mlrepository.html