Refine Decision Boundaries of a Statistical Ensemble by Active Learning

Refine Decision Boundaries of a Statistical Ensemble by Active Learning a b * Dingsheng Luo and Ke Chen a National Laboratory on Machine Perception and Center for Information Science, Peking University, Beijing, 87, China b School of Computer Science, The University of Birmingham, Edgbaston, Birmingham B5 2TT, United Kingdom Abstract For pattern classification, the decision boundaries are gradually constructed in a statistical ensemble through a divide-and-conquer procedure based on resampling techniques. Hence a resampling criterion critically governs the process of forming the final decision boundaries. Motivated by active learning ideas, we propose an alternative resampling criterion based on the zero-one loss measure in this paper, where all the patterns in the training set are ranked in terms of their difficulty for classification no matter whether a pattern has been incorrectly classified or not. Our resampling criterion incorporated by Adaboost has been applied to benchmark handwritten digit recognition and text-independent speaker identification tasks. Comparative results demonstrate that our method refines decision boundaries and therefore yields the better generalization performance. I. INTRODUCTION Recent studies show that statistical ensemble learning has turned out to be an effective way in improving generalization capability of a learning system. For pattern classification, a statistical ensemble method gradually constructs the decision boundaries by a divide-and-conquer procedure. In this divide-and-conquer procedure, the less accurate decision boundaries are first constructed to roughly classify patterns such that informative patterns can be found. By means of the informative patterns the rough decision boundaries are gradually improved as the ensemble grows. The final decision boundaries are not fixed until the error-free performance is obtained. From the perspective of statistical learning, the process of constructing decision boundaries in statistical ensemble learning can be interpreted as exploring the decision boundaries of large margins [], which leads to the good generalization performance. For growing a statistical ensemble, a resampling criterion plays a crucial role in selecting data to construct decision boundaries. In general, most of existing resampling criteria are based on traditional error-based measures with respect to a distribution over examples and the misclassified portion of training patterns are merely considered in the subsequent resampling. Unlike the aforementioned resampling criteria, a so-called pseudo-loss error measure has been proposed for data selection in Adaboost where all the patterns have been considered during resampling [2]. Since the pseudo-loss measure not only focuses on the hardest patterns that are misclassified but also considers other patterns that are correctly classified, the better generalization performance has been obtained [2] due to the proper use of more information. Although all the patterns are considered for resampling in the pseudo-loss measure, those patterns being correctly classified are treated to be equally important. Our early studies in active learning indicate that those patterns may play different roles in the construction of decision boundaries even though all of them are correctly classified [3]. By further distinguishing between them with a learning algorithm, the better generalization performance has been achieved. Motivated by the aforementioned work [2],[3], we proposed a novel resampling criterion on the basis of the zero-one loss for minimum-error-rate classification [4]. The proposed criterion provides a unified measure for detecting informative patterns from all the patterns in the training set no matter whether a pattern is misclassified or not. In particular, the patterns being correctly classified are also ranked in terms of their difficulties for classification, which leads to an active data selection procedure for all the patterns in comparison with traditional error-based resampling criteria. We have applied our resampling criterion to Adaboost to tackle real world classification problems, optical character recognition and speaker identification, by means of benchmark databases. Comparative results demonstrate that our method refines decision boundaries of Adaboost and hence yields the better generalization performance. The rest of the paper is organized as follows. Section II presents the motivation and our resampling criterion. Section III describes the system for simulations and report comparative results. Conclusions are drawn in the last section. II. ACTIVE RESAMPLING CRITERION In this section, we first present the motivation on the use of active learning in a resampling criterion and then propose an active resampling criterion to construct statistical ensembles for pattern classification. A. Motivation A pattern classification problem can be described as follows: Given a training set of n examples S = {<x, C >, <x 2, C 2 >,, <x n, C n >} where x i is an instance drawn from some certain space X, and C i C (C = {,, M}) is the class label associated with x i. The learning problem for classification is, based on the training set S, to find a classifier which is expected to make a maximum correct * He is now with Department of Computation, UMIST, Manchester M60 QD, United Kingdom.

prediction to any instances x, x X. There are various kinds of classifiers that do not output the pure -of-m representation but offer a confidence for each class. Such classifiers can be converted into a probabilistic form by the following transformation: ŷ i (x) = M e j= 2 ( yi + ) e ( y j + ) 2 probability that the input pattern (being tested) belongs to a specific class. For decision-making, the maximum a posterior (MAP) rule is applied such that * C = arg max yˆ (2) j M ( i =,, M ) () where y i is the ith output component of a classifier. From the probabilistic point of view, each can be interpreted by the By the MAP rule, traditional statistical ensemble methods, e.g. boost, divides training patterns into two categories: easy and hard portions based on whether a correct class label is assigned to a pattern. Apparently, the MAP decision-making ŷ i j rule is suitable for testing an unknown pattern and tends to be necessary. When such a rule is used in a training stage, however, it would incur losses of useful information. Fig. depicts an example to demonstrate such a problem. For two patterns belonging to the same class (class 5), both of them have been correctly classified by a classifier. However, the two patterns convey unequal information. Apparently, the classifier can more likely produce the correct label for the pattern corresponding in Fig. (a) than that shown in Fig. (b) in terms of the probabilistic justification. In other words, the pattern shown in Fig. (b) is more informative given that it tends to be closer to decision boundaries [3]. Unfortunately, the previous resampling criteria in statistical ensemble methods merely focus on the misclassified patterns and fail to consider the distinction among patterns that have been correctly classified. Motivated by our previous work in active learning where such information has turned to be useful for refining decision boundaries [3], we propose a resampling criterion to find all possible informative patterns for construction of statistical ensemble classifiers, which is expected to provide a more active date selection procedure for refining the decision boundaries. B. Active Difficulty Measure (a) According to Bayesian decision theory [4], the zero-one loss function provides a criterion to obtain the minimum error rate or to make the maximum correct prediction. According to the zero-one loss criterion, an ideal classifier always outputs the correct -of-m representation, where the ith component corresponds to class C i, such that for x Ci, if j = i d j (x) = (3) 0 if j i In other words, the ith element of such an ideal output vector is one only while other elements are zero. Fig. 2 shows an example where a pattern belonging to class 5 is perfectly classified (c.f. Fig. ). (b) Fig.. The outputs of two patterns belonging to the same class. Fig. 2. The ideal output of a probabilistic classifier for a pattern belonging to class 5.

Obviously, the ideal output vector plays a reference role in the detection of informative patterns. Naturally, we treat the divergence between a practical output vector, e.g. those in Fig., and its corresponding ideal output vector as a difficulty measure for determining how difficult a pattern can be correctly classified. For convenience, we use the Euclidean distance in this paper such that the divergence for a pattern x is defined as div( yˆ( x), = [ d j yˆ j ] M j= Without difficulty, we can prove the following fact: 0 div( y ˆ, 2. The divergence measure unifies two circumstances, i.e., the misclassified case and the confidence of correctly classified case, in terms of how difficult a pattern can be classified. A misclassified pattern must be treated as the most difficult one while a pattern being classified correctly would be assigned a probability or confidence to indicate its difficulty in an uncertain way. For doing so, we define a probabilistic difficulty measure to carry out the above consideration as follows: if div( yˆ( x), > P difficulty = (5) div( yˆ( x), otherwise. As a consequence, the above difficulty measure provides an alternative resampling criterion; all the misclassified patterns can always be selected to form new training subsets while a pattern being classified correctly also has a chance (depending upon its divergence defined in (4)) to be added to above training subsets for the next round training. In comparison with the existing difficulty measures used in statistical ensemble learning, our measure in (5) would be more active to find informative patterns. Therefore, we would name our measure active difficulty measure. When our measure is inserted to Adaboost for replacing the original one, we call this modified version of Adaboost active Adaboost, accordingly, to distinguish from the original one. In addition, our active resampling criterion does not introduce higher computational loads given that the divergence computation in (4) similar to the computation for finding out misclassified patterns by the MAP rule in (3). III. SIMULATIONS In order to evaluate the effectiveness of the proposed method, we have applied the active Adaboost to two real world pattern classification tasks, text-independent speaker identification and handwritten digit recognition. For comparison, we also apply the original Adaboost [5] to the same problems. In this section, we first briefly introduce two benchmark problems. Then we describe the VQ classification system used in our simulations. Finally we report comparative results. 2 (4) A. Text-Independent Speaker Identification Speaker identification is a process that automatically identifies the personal identity based on his/her voice token. By text-independent, it means that such an identification process is carried out regardless of linguistics conveyed in utterances. For simulations in speaker identification, the KING speech corpus of 0-session (S0-S0) is adopted. The database of 5 speakers was collected partly in New Jersey and San Diego. And each session was recorded in both a wide-band (WB) and a narrow-band (NB) channel. There is a significant difference between sessions S0-S05 and S06-S0. Thus, the long time temporal span resulting in voice aging and two distinct recording channels leading to miscellaneous variations provide a desirable corpus to study the mismatch problem. In our simulations, all experiments are grouped into two categories in terms of two sets corresponding to different channels. Each category furthermore contains two groups of experiments. Thus, we have four groups of experiments in our simulations and denote them as WB, WB2, NB and NB2, respectively. Such elaboration expects to introduce different mismatch conditions to different groups of experiments such that WB2 < WB << NB2 < NB. In other words, there is the least mismatch in WB2 while the mismatch in NB is the severest. Note that such an elaborate design has been manifested to introduce different mismatch conditions [6]. In our simulations, the standard spectral analysis is first performed and then Mel-scaled cepstral feature vectors are extracted for training a classifier. B. Handwritten Digit Recognition Handwritten digit recognition is a process to recognize a handwritten digit from its picture. Similar to utterance in speaker recognition, a handwritten digit may have huge various forms due to distinct writing styles. Therefore, there are also miscellaneous mismatch between training data and testing data. For simulations in handwritten digit recognition, we choose the benchmark database, MNIST [7], where there are 60,000 examples for training and 0,000 examples for test. Each digit instance is a two-dimensional binary image and its size in MNIST has been normalized to an image patch of 28x28 pixels without altering their aspect ratio. In order to reduce the dimension to overcome the curse of dimensionality, we first use the wavelet techniques to obtain the low frequency component of the image and then discard the pixels located around the image boundaries given that those pixels are far less informative for classification. Finally, 2x2 images are obtained for simulations where each image is transformed into a 44 dimensional vector. Such a preprocessing procedure is highly consistent with the previous work [7].

C. VQ Classification System and Its Ensemble As a baseline system, vector quantization (VQ) technique has been selected for pattern classification [8]. The idea underlying a VQ-based classifier is creating a codebook for the data of the same class in the training set. The codebook consisting of several codewords encodes inherent characteristics of the class of data. Thus such a codebook is viewed as a model that can characterize the class of data. The training phase of a VQ-based classification system is building a codebook for every class by means of a clustering algorithm. In a testing phase, a VQ-based classification system works as follows. When an unknown pattern is coming, the distance between its feature vector and all the codewords belonging to different classes is evaluated in terms of the same similarity measure as defined in that clustering algorithm for codebook production. As a consequence, a decision is made by the similarity test and, thus, the pattern is labeled by that class that has the shortest distance to the pattern. The VQ classification techniques have been widely applied in speaker recognition where one speaker identity is characterized by a VQ codebook [9]. We also apply the VQ technique to the handwritten digit recognition task. Similarly, the characteristics of each digit are thus modeled by a corresponding codebook. As a result, a VQ classifier is used as a component classifier in an ensemble and an individual baseline system for comparison. In our simulations, AdaBoost has been adopted with different the reweighting techniques including ours to construct an ensemble VQ classifier. This constructive process is recursive until the satisfactory performance is achieved on a training set. On the other hand, a combination strategy plays an important role in the integration of classifiers trained on generated training sets. In our simulations, we employ the arithmetic averaging rule as our strategy of combining component classifiers as suggested by our previous empirical studies [6]. D. Simulation Results In our simulations, the VQ with the standard LBG algorithm [8] is employed to build class models where each VQ codebook consists of 64 codewords. In addition, the ensemble of eleven VQ classifiers always yields satisfactory results on the training sets for two benchmark databases. Due to the limited space, therefore, we report only the final generalization performance by the ensemble although the evolving performance is available as the ensemble grows. Fig. 3 shows the overall generalization performance of speaker identification produced by the baseline system, the original AdaBoost and our active AdaBoost on the WB and the NB testing sets, respectively. It is evident from Fig. 3 that our active resampling criterion performs very well given that the active AdaBoost system outperforms both the Identification Rate (%) Identification Rate (%) 88 86 84 82 80 74 70 66 62 58 54 50 8. 56.86.7 WB-. 63.87 62.72 NB- 87.07.32.43 WB-2 Experiment on WB (a) 60.5 Experiment on NB 7.23 69. NB-2 Baseline AdaBoost Active AdaBoost (b) Fig. 3. The generalization performance on speaker identification. (a) Results on the WB set. (b) Results on the NB sets. baseline system and the original AdaBoost system in all four experiments on WB, WB2, NB and NB2 sets where different mismatch conditions are designed for testing the generalization performance. By further comparing ours with other two methods, the error reduction rates in the WB- and the NB-, where severer mismatch conditions are involved, are better than those of their counterparts in WB-2 and NB-2, respectively. Thus, our simulations clearly demonstrate that along with the use of information carried in those misclassified patterns the use of additional information conveyed in the patterns being classified correctly is further refining decision boundaries of an ensemble classifier against mistmatch. Fig. 4 illustrates the results for the handwritten digit recognition corresponding to digits from 0 to 9. Similarly, the active Adaboost consistently outperforms the base-line system and the original Adaboost for all ten digits. It is worth mentioning that for several digits the base-line system achieves the error-free at the early stage of the ensemble growing. Due to the use of error-based resampling criteria, the original Adaboot does not grow the ensemble anymore. In contrast, our active resampling criterion makes the ensemble further grow by taking advantage of additional

Recongnition Rate (%)..08.8 Recongnition Rate (%).03.38.56 Recongnition Rate (%).63.09.64 (a) Results on digit 0 (b) Results on digit (c) Results on digit 2 Recongnition Rate (%).75.03.67 Recongnition Rate (%).87.37.08 Recongnition Rate (%).43.85.6 (d) Results on digit 3 (e) Results on digit 4 (f) Results on digit 5 Recongnition Rate (%).39.53.6 Recongnition Rate (%)..43.6 Recongnition Rate (%).08.34.48 (g) Results on digit 6 (h) Results on digit 7 (i) Results on digit 8 Recongnition Rate (%).25.74.83 Baseline AdaBoost Active AdaBoost (j) Results on digit 9 Fig. 4. Comparative results for the handwritten digit recognition problem. (a)-(j) Results corresponding to ten digits from 0 to 9.

information conveyed in the patterns being classified correctly, which leads to the effect of refining decision boundaries as shown in Fig. 4. The results also indicate that the idea underlying our method is highly consistent with the use of active data selection for refining the decision boundaries of a strong classifier [3]. IV. CONCLUSION In this paper, we have presented an alternative resampling criterion for active selection of informative patterns to construct a statistical ensemble classifier. In comparison with the existing error-based resampling criteria in statistical ensemble learning, our criterion makes better use of information conveyed in training patterns. Comparative results on two real world problems based on Adaboost, along with others with different statistical ensemble methods (e.g. [0],[]) not reported here, demonstrate that for pattern classification our method yields the better generalization performance by refining decision boundaries with additional information conveyed in training patterns. REFERENCES The Annals of Statistics, vol. 26, pp. 65-686, 8. [2] Y. Freund and R. E. Schapire, Experiments with a new boosting algorithm, Proceedings of International Conference of Machine Learning, pp. 48-56, 9. [3] L. Wang, K. Chen, and H. Chi, Capture interspeaker information with a neural network for speaker identification, IEEE Transactions on Neural Networks, vol. 3, pp. 436-445, 2002. [4] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification (2nd Edition), Wiley-Interscience, 200. [5] Y. Freund and R. E. Schapire, A decision-theoretic generalization of on-line learning and an application to boosting, Journal of Computer and System Sciences, vol. 55, pp.9-39, 9. [6] D. S. Luo and K. Chen, A comparative study of statistical ensemble methods on mismatch conditions, Proceedings of International Joint Conference on Neural Networks, pp.59-64, 2002. [7] URL: http://www.research.att.com/~yann/exdb/mnist/index.html. [8] Y. Linde, A. Buzo, and R. M. Gray, An algorithm for vector quantizer design, IEEE Transactions on Communications, vol. 28, pp. 84-, 0. [9] F. K. Soong, A. E. Rosenberg, L. R. Rabiner,, and B. H. Zhuang, A vector quantization approach to speaker identification, Proceedings of International Conference on Acoustics, Speech and Signal Processing, pp. 387-3, 5. [0] R. E. Schapire, The strength of weak learnability, Machine Learning, vol. 5, pp. -227, 9. [] C. Y. Ji and S. Ma, Combinations of weak classifiers, IEEE Transactions on Neural Networks, vol. 8, pp 32-42, 9. [] R. E. Schapire, Y. Freund, P. Bartlett, and W. S. Lee, Boosting the margin: A new explanation for the effectiveness of voting methods,