Refine Decision Boundaries of a Statistical Ensemble by Active Learning

Similar documents
Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Lecture 1: Machine Learning Basics

A study of speaker adaptation for DNN-based speech synthesis

Word Segmentation of Off-line Handwritten Documents

Human Emotion Recognition From Speech

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

WHEN THERE IS A mismatch between the acoustic

Speech Emotion Recognition Using Support Vector Machine

Learning Methods in Multilingual Speech Recognition

Python Machine Learning

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Data Integration through Clustering and Finding Statistical Relations - Validation of Approach

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Learning Methods for Fuzzy Systems

Modeling function word errors in DNN-HMM based LVCSR systems

Evolutive Neural Net Fuzzy Filtering: Basic Description

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Modeling function word errors in DNN-HMM based LVCSR systems

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Rule Learning With Negation: Issues Regarding Effectiveness

Disambiguation of Thai Personal Name from Online News Articles

Mandarin Lexical Tone Recognition: The Gating Paradigm

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Probabilistic Latent Semantic Analysis

Australian Journal of Basic and Applied Sciences

On-Line Data Analytics

Designing a Rubric to Assess the Modelling Phase of Student Design Projects in Upper Year Engineering Courses

Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment

A Case Study: News Classification Based on Term Frequency

Calibration of Confidence Measures in Speech Recognition

Voice conversion through vector quantization

SARDNET: A Self-Organizing Feature Map for Sequences

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Learning From the Past with Experiment Databases

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Softprop: Softmax Neural Network Backpropagation Learning

Assignment 1: Predicting Amazon Review Ratings

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

(Sub)Gradient Descent

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

CSL465/603 - Machine Learning

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Rule Learning with Negation: Issues Regarding Effectiveness

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

A Reinforcement Learning Variant for Control Scheduling

Entrepreneurial Discovery and the Demmert/Klein Experiment: Additional Evidence from Germany

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Reinforcement Learning by Comparing Immediate Reward

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

A Note on Structuring Employability Skills for Accounting Students

Reducing Features to Improve Bug Prediction

Detecting Wikipedia Vandalism using Machine Learning Notebook for PAN at CLEF 2011

Support Vector Machines for Speaker and Language Recognition

The stages of event extraction

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Semi-Supervised Face Detection

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

The Use of Statistical, Computational and Modelling Tools in Higher Learning Institutions: A Case Study of the University of Dodoma

Segregation of Unvoiced Speech from Nonspeech Interference

Chapter 2 Rule Learning in a Nutshell

How to Judge the Quality of an Objective Classroom Test

CS Machine Learning

Strategies for Solving Fraction Tasks and Their Link to Algebraic Thinking

INPE São José dos Campos

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge

Proceedings of Meetings on Acoustics

Switchboard Language Model Improvement with Conversational Data from Gigaword

ASSESSMENT GUIDELINES (PRACTICAL /PERFORMANCE WORK) Grade: 85%+ Description: 'Outstanding work in all respects', ' Work of high professional standard'

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Software Maintenance

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Mining Association Rules in Student s Assessment Data

Speech Recognition by Indexing and Sequencing

Lecture 1: Basic Concepts of Machine Learning

The Boosting Approach to Machine Learning An Overview

UML MODELLING OF DIGITAL FORENSIC PROCESS MODELS (DFPMs)

Parallel Evaluation in Stratal OT * Adam Baker University of Arizona

Conceptual and Procedural Knowledge of a Mathematics Problem: Their Measurement and Their Causal Interrelations

Mathematics Scoring Guide for Sample Test 2005

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

GCSE English Language 2012 An investigation into the outcomes for candidates in Wales

Transfer Learning Action Models by Measuring the Similarity of Different Domains

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Using EEG to Improve Massive Open Online Courses Feedback Interaction

Transcription:

Refine Decision Boundaries of a Statistical Ensemble by Active Learning a b * Dingsheng Luo and Ke Chen a National Laboratory on Machine Perception and Center for Information Science, Peking University, Beijing, 87, China b School of Computer Science, The University of Birmingham, Edgbaston, Birmingham B5 2TT, United Kingdom Abstract For pattern classification, the decision boundaries are gradually constructed in a statistical ensemble through a divide-and-conquer procedure based on resampling techniques. Hence a resampling criterion critically governs the process of forming the final decision boundaries. Motivated by active learning ideas, we propose an alternative resampling criterion based on the zero-one loss measure in this paper, where all the patterns in the training set are ranked in terms of their difficulty for classification no matter whether a pattern has been incorrectly classified or not. Our resampling criterion incorporated by Adaboost has been applied to benchmark handwritten digit recognition and text-independent speaker identification tasks. Comparative results demonstrate that our method refines decision boundaries and therefore yields the better generalization performance. I. INTRODUCTION Recent studies show that statistical ensemble learning has turned out to be an effective way in improving generalization capability of a learning system. For pattern classification, a statistical ensemble method gradually constructs the decision boundaries by a divide-and-conquer procedure. In this divide-and-conquer procedure, the less accurate decision boundaries are first constructed to roughly classify patterns such that informative patterns can be found. By means of the informative patterns the rough decision boundaries are gradually improved as the ensemble grows. The final decision boundaries are not fixed until the error-free performance is obtained. From the perspective of statistical learning, the process of constructing decision boundaries in statistical ensemble learning can be interpreted as exploring the decision boundaries of large margins [], which leads to the good generalization performance. For growing a statistical ensemble, a resampling criterion plays a crucial role in selecting data to construct decision boundaries. In general, most of existing resampling criteria are based on traditional error-based measures with respect to a distribution over examples and the misclassified portion of training patterns are merely considered in the subsequent resampling. Unlike the aforementioned resampling criteria, a so-called pseudo-loss error measure has been proposed for data selection in Adaboost where all the patterns have been considered during resampling [2]. Since the pseudo-loss measure not only focuses on the hardest patterns that are misclassified but also considers other patterns that are correctly classified, the better generalization performance has been obtained [2] due to the proper use of more information. Although all the patterns are considered for resampling in the pseudo-loss measure, those patterns being correctly classified are treated to be equally important. Our early studies in active learning indicate that those patterns may play different roles in the construction of decision boundaries even though all of them are correctly classified [3]. By further distinguishing between them with a learning algorithm, the better generalization performance has been achieved. Motivated by the aforementioned work [2],[3], we proposed a novel resampling criterion on the basis of the zero-one loss for minimum-error-rate classification [4]. The proposed criterion provides a unified measure for detecting informative patterns from all the patterns in the training set no matter whether a pattern is misclassified or not. In particular, the patterns being correctly classified are also ranked in terms of their difficulties for classification, which leads to an active data selection procedure for all the patterns in comparison with traditional error-based resampling criteria. We have applied our resampling criterion to Adaboost to tackle real world classification problems, optical character recognition and speaker identification, by means of benchmark databases. Comparative results demonstrate that our method refines decision boundaries of Adaboost and hence yields the better generalization performance. The rest of the paper is organized as follows. Section II presents the motivation and our resampling criterion. Section III describes the system for simulations and report comparative results. Conclusions are drawn in the last section. II. ACTIVE RESAMPLING CRITERION In this section, we first present the motivation on the use of active learning in a resampling criterion and then propose an active resampling criterion to construct statistical ensembles for pattern classification. A. Motivation A pattern classification problem can be described as follows: Given a training set of n examples S = {<x, C >, <x 2, C 2 >,, <x n, C n >} where x i is an instance drawn from some certain space X, and C i C (C = {,, M}) is the class label associated with x i. The learning problem for classification is, based on the training set S, to find a classifier which is expected to make a maximum correct * He is now with Department of Computation, UMIST, Manchester M60 QD, United Kingdom.

prediction to any instances x, x X. There are various kinds of classifiers that do not output the pure -of-m representation but offer a confidence for each class. Such classifiers can be converted into a probabilistic form by the following transformation: ŷ i (x) = M e j= 2 ( yi + ) e ( y j + ) 2 probability that the input pattern (being tested) belongs to a specific class. For decision-making, the maximum a posterior (MAP) rule is applied such that * C = arg max yˆ (2) j M ( i =,, M ) () where y i is the ith output component of a classifier. From the probabilistic point of view, each can be interpreted by the By the MAP rule, traditional statistical ensemble methods, e.g. boost, divides training patterns into two categories: easy and hard portions based on whether a correct class label is assigned to a pattern. Apparently, the MAP decision-making ŷ i j rule is suitable for testing an unknown pattern and tends to be necessary. When such a rule is used in a training stage, however, it would incur losses of useful information. Fig. depicts an example to demonstrate such a problem. For two patterns belonging to the same class (class 5), both of them have been correctly classified by a classifier. However, the two patterns convey unequal information. Apparently, the classifier can more likely produce the correct label for the pattern corresponding in Fig. (a) than that shown in Fig. (b) in terms of the probabilistic justification. In other words, the pattern shown in Fig. (b) is more informative given that it tends to be closer to decision boundaries [3]. Unfortunately, the previous resampling criteria in statistical ensemble methods merely focus on the misclassified patterns and fail to consider the distinction among patterns that have been correctly classified. Motivated by our previous work in active learning where such information has turned to be useful for refining decision boundaries [3], we propose a resampling criterion to find all possible informative patterns for construction of statistical ensemble classifiers, which is expected to provide a more active date selection procedure for refining the decision boundaries. B. Active Difficulty Measure (a) According to Bayesian decision theory [4], the zero-one loss function provides a criterion to obtain the minimum error rate or to make the maximum correct prediction. According to the zero-one loss criterion, an ideal classifier always outputs the correct -of-m representation, where the ith component corresponds to class C i, such that for x Ci, if j = i d j (x) = (3) 0 if j i In other words, the ith element of such an ideal output vector is one only while other elements are zero. Fig. 2 shows an example where a pattern belonging to class 5 is perfectly classified (c.f. Fig. ). (b) Fig.. The outputs of two patterns belonging to the same class. Fig. 2. The ideal output of a probabilistic classifier for a pattern belonging to class 5.

Obviously, the ideal output vector plays a reference role in the detection of informative patterns. Naturally, we treat the divergence between a practical output vector, e.g. those in Fig., and its corresponding ideal output vector as a difficulty measure for determining how difficult a pattern can be correctly classified. For convenience, we use the Euclidean distance in this paper such that the divergence for a pattern x is defined as div( yˆ( x), = [ d j yˆ j ] M j= Without difficulty, we can prove the following fact: 0 div( y ˆ, 2. The divergence measure unifies two circumstances, i.e., the misclassified case and the confidence of correctly classified case, in terms of how difficult a pattern can be classified. A misclassified pattern must be treated as the most difficult one while a pattern being classified correctly would be assigned a probability or confidence to indicate its difficulty in an uncertain way. For doing so, we define a probabilistic difficulty measure to carry out the above consideration as follows: if div( yˆ( x), > P difficulty = (5) div( yˆ( x), otherwise. As a consequence, the above difficulty measure provides an alternative resampling criterion; all the misclassified patterns can always be selected to form new training subsets while a pattern being classified correctly also has a chance (depending upon its divergence defined in (4)) to be added to above training subsets for the next round training. In comparison with the existing difficulty measures used in statistical ensemble learning, our measure in (5) would be more active to find informative patterns. Therefore, we would name our measure active difficulty measure. When our measure is inserted to Adaboost for replacing the original one, we call this modified version of Adaboost active Adaboost, accordingly, to distinguish from the original one. In addition, our active resampling criterion does not introduce higher computational loads given that the divergence computation in (4) similar to the computation for finding out misclassified patterns by the MAP rule in (3). III. SIMULATIONS In order to evaluate the effectiveness of the proposed method, we have applied the active Adaboost to two real world pattern classification tasks, text-independent speaker identification and handwritten digit recognition. For comparison, we also apply the original Adaboost [5] to the same problems. In this section, we first briefly introduce two benchmark problems. Then we describe the VQ classification system used in our simulations. Finally we report comparative results. 2 (4) A. Text-Independent Speaker Identification Speaker identification is a process that automatically identifies the personal identity based on his/her voice token. By text-independent, it means that such an identification process is carried out regardless of linguistics conveyed in utterances. For simulations in speaker identification, the KING speech corpus of 0-session (S0-S0) is adopted. The database of 5 speakers was collected partly in New Jersey and San Diego. And each session was recorded in both a wide-band (WB) and a narrow-band (NB) channel. There is a significant difference between sessions S0-S05 and S06-S0. Thus, the long time temporal span resulting in voice aging and two distinct recording channels leading to miscellaneous variations provide a desirable corpus to study the mismatch problem. In our simulations, all experiments are grouped into two categories in terms of two sets corresponding to different channels. Each category furthermore contains two groups of experiments. Thus, we have four groups of experiments in our simulations and denote them as WB, WB2, NB and NB2, respectively. Such elaboration expects to introduce different mismatch conditions to different groups of experiments such that WB2 < WB << NB2 < NB. In other words, there is the least mismatch in WB2 while the mismatch in NB is the severest. Note that such an elaborate design has been manifested to introduce different mismatch conditions [6]. In our simulations, the standard spectral analysis is first performed and then Mel-scaled cepstral feature vectors are extracted for training a classifier. B. Handwritten Digit Recognition Handwritten digit recognition is a process to recognize a handwritten digit from its picture. Similar to utterance in speaker recognition, a handwritten digit may have huge various forms due to distinct writing styles. Therefore, there are also miscellaneous mismatch between training data and testing data. For simulations in handwritten digit recognition, we choose the benchmark database, MNIST [7], where there are 60,000 examples for training and 0,000 examples for test. Each digit instance is a two-dimensional binary image and its size in MNIST has been normalized to an image patch of 28x28 pixels without altering their aspect ratio. In order to reduce the dimension to overcome the curse of dimensionality, we first use the wavelet techniques to obtain the low frequency component of the image and then discard the pixels located around the image boundaries given that those pixels are far less informative for classification. Finally, 2x2 images are obtained for simulations where each image is transformed into a 44 dimensional vector. Such a preprocessing procedure is highly consistent with the previous work [7].

C. VQ Classification System and Its Ensemble As a baseline system, vector quantization (VQ) technique has been selected for pattern classification [8]. The idea underlying a VQ-based classifier is creating a codebook for the data of the same class in the training set. The codebook consisting of several codewords encodes inherent characteristics of the class of data. Thus such a codebook is viewed as a model that can characterize the class of data. The training phase of a VQ-based classification system is building a codebook for every class by means of a clustering algorithm. In a testing phase, a VQ-based classification system works as follows. When an unknown pattern is coming, the distance between its feature vector and all the codewords belonging to different classes is evaluated in terms of the same similarity measure as defined in that clustering algorithm for codebook production. As a consequence, a decision is made by the similarity test and, thus, the pattern is labeled by that class that has the shortest distance to the pattern. The VQ classification techniques have been widely applied in speaker recognition where one speaker identity is characterized by a VQ codebook [9]. We also apply the VQ technique to the handwritten digit recognition task. Similarly, the characteristics of each digit are thus modeled by a corresponding codebook. As a result, a VQ classifier is used as a component classifier in an ensemble and an individual baseline system for comparison. In our simulations, AdaBoost has been adopted with different the reweighting techniques including ours to construct an ensemble VQ classifier. This constructive process is recursive until the satisfactory performance is achieved on a training set. On the other hand, a combination strategy plays an important role in the integration of classifiers trained on generated training sets. In our simulations, we employ the arithmetic averaging rule as our strategy of combining component classifiers as suggested by our previous empirical studies [6]. D. Simulation Results In our simulations, the VQ with the standard LBG algorithm [8] is employed to build class models where each VQ codebook consists of 64 codewords. In addition, the ensemble of eleven VQ classifiers always yields satisfactory results on the training sets for two benchmark databases. Due to the limited space, therefore, we report only the final generalization performance by the ensemble although the evolving performance is available as the ensemble grows. Fig. 3 shows the overall generalization performance of speaker identification produced by the baseline system, the original AdaBoost and our active AdaBoost on the WB and the NB testing sets, respectively. It is evident from Fig. 3 that our active resampling criterion performs very well given that the active AdaBoost system outperforms both the Identification Rate (%) Identification Rate (%) 88 86 84 82 80 74 70 66 62 58 54 50 8. 56.86.7 WB-. 63.87 62.72 NB- 87.07.32.43 WB-2 Experiment on WB (a) 60.5 Experiment on NB 7.23 69. NB-2 Baseline AdaBoost Active AdaBoost (b) Fig. 3. The generalization performance on speaker identification. (a) Results on the WB set. (b) Results on the NB sets. baseline system and the original AdaBoost system in all four experiments on WB, WB2, NB and NB2 sets where different mismatch conditions are designed for testing the generalization performance. By further comparing ours with other two methods, the error reduction rates in the WB- and the NB-, where severer mismatch conditions are involved, are better than those of their counterparts in WB-2 and NB-2, respectively. Thus, our simulations clearly demonstrate that along with the use of information carried in those misclassified patterns the use of additional information conveyed in the patterns being classified correctly is further refining decision boundaries of an ensemble classifier against mistmatch. Fig. 4 illustrates the results for the handwritten digit recognition corresponding to digits from 0 to 9. Similarly, the active Adaboost consistently outperforms the base-line system and the original Adaboost for all ten digits. It is worth mentioning that for several digits the base-line system achieves the error-free at the early stage of the ensemble growing. Due to the use of error-based resampling criteria, the original Adaboot does not grow the ensemble anymore. In contrast, our active resampling criterion makes the ensemble further grow by taking advantage of additional

Recongnition Rate (%)..08.8 Recongnition Rate (%).03.38.56 Recongnition Rate (%).63.09.64 (a) Results on digit 0 (b) Results on digit (c) Results on digit 2 Recongnition Rate (%).75.03.67 Recongnition Rate (%).87.37.08 Recongnition Rate (%).43.85.6 (d) Results on digit 3 (e) Results on digit 4 (f) Results on digit 5 Recongnition Rate (%).39.53.6 Recongnition Rate (%)..43.6 Recongnition Rate (%).08.34.48 (g) Results on digit 6 (h) Results on digit 7 (i) Results on digit 8 Recongnition Rate (%).25.74.83 Baseline AdaBoost Active AdaBoost (j) Results on digit 9 Fig. 4. Comparative results for the handwritten digit recognition problem. (a)-(j) Results corresponding to ten digits from 0 to 9.

information conveyed in the patterns being classified correctly, which leads to the effect of refining decision boundaries as shown in Fig. 4. The results also indicate that the idea underlying our method is highly consistent with the use of active data selection for refining the decision boundaries of a strong classifier [3]. IV. CONCLUSION In this paper, we have presented an alternative resampling criterion for active selection of informative patterns to construct a statistical ensemble classifier. In comparison with the existing error-based resampling criteria in statistical ensemble learning, our criterion makes better use of information conveyed in training patterns. Comparative results on two real world problems based on Adaboost, along with others with different statistical ensemble methods (e.g. [0],[]) not reported here, demonstrate that for pattern classification our method yields the better generalization performance by refining decision boundaries with additional information conveyed in training patterns. REFERENCES The Annals of Statistics, vol. 26, pp. 65-686, 8. [2] Y. Freund and R. E. Schapire, Experiments with a new boosting algorithm, Proceedings of International Conference of Machine Learning, pp. 48-56, 9. [3] L. Wang, K. Chen, and H. Chi, Capture interspeaker information with a neural network for speaker identification, IEEE Transactions on Neural Networks, vol. 3, pp. 436-445, 2002. [4] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification (2nd Edition), Wiley-Interscience, 200. [5] Y. Freund and R. E. Schapire, A decision-theoretic generalization of on-line learning and an application to boosting, Journal of Computer and System Sciences, vol. 55, pp.9-39, 9. [6] D. S. Luo and K. Chen, A comparative study of statistical ensemble methods on mismatch conditions, Proceedings of International Joint Conference on Neural Networks, pp.59-64, 2002. [7] URL: http://www.research.att.com/~yann/exdb/mnist/index.html. [8] Y. Linde, A. Buzo, and R. M. Gray, An algorithm for vector quantizer design, IEEE Transactions on Communications, vol. 28, pp. 84-, 0. [9] F. K. Soong, A. E. Rosenberg, L. R. Rabiner,, and B. H. Zhuang, A vector quantization approach to speaker identification, Proceedings of International Conference on Acoustics, Speech and Signal Processing, pp. 387-3, 5. [0] R. E. Schapire, The strength of weak learnability, Machine Learning, vol. 5, pp. -227, 9. [] C. Y. Ji and S. Ma, Combinations of weak classifiers, IEEE Transactions on Neural Networks, vol. 8, pp 32-42, 9. [] R. E. Schapire, Y. Freund, P. Bartlett, and W. S. Lee, Boosting the margin: A new explanation for the effectiveness of voting methods,