Dimensionality Reduction for Active Learning with Nearest Neighbour Classifier in Text Categorisation Problems

Dimensionality Reduction for Active Learning with Nearest Neighbour Classifier in Text Categorisation Problems Michael Davy Artificial Intelligence Group, Department of Computer Science, Trinity College Dublin Michael.Davy@cs.tcd.ie Saturnino Luz Artificial Intelligence Group, Department of Computer Science, Trinity College Dublin Saturnino.Luz@cs.tcd.ie Abstract Dimensionality reduction techniques are commonly used in text categorisation problems to improve training and classification efficiency as well as to avoid overfitting. The best performing dimensionality reduction techniques for text categorisation are supervised, hence utilise the label information of the training data. Active learning is used to reduce the number of labelled training examples for problems where obtaining label information is expensive. Since the vast majority of data supplied to active learning are unlabelled, supervised dimensionality reduction techniques cannot be readily employed. For this reason, active learning in text categorisation problems do not perform dimensionality reduction thereby restricting the choice of classifier. In this paper we investigate unsupervised dimensionality reduction techniques in active learning for text categorisation problems. Two unsupervised techniques are investigated, namely Document Frequency and Principal Components Analysis. We empirically show increased performance of active learning, using a k-nearest Neighbour classifier, when dimensionality reduction is applied using the unsupervised techniques. 1 Introduction Text categorisation is defined to be the task of assigning documents to a set of predefined categories [1]. Automated solutions to text categorisation have been developed using supervised learning where a classifier is induced from a large number of labelled examples. Supervised learning assumes there is an abundance of labelled examples, however, this assumption does not hold for many domains. While labelled examples can be scarce, unlabelled examples are naturally abundant. Active learning is a technique for constructing accurate classifiers from very small amounts of training data. Reductions in the number of labelled examples required are achieved by active learning controlling the training data and only populating it with very informative examples. Conversely, supervised learning has no control over the training data, hence requires far more data to ensure there are sufficient numbers of informative training examples. Orders of magnitude reductions in labelling requirements are achieved when performing active learning on text categorisation problems [5]. In this paper we explore the difficulties arising from performing dimensionality reduction in active learning for text categorisation problems. The most successful dimensionality reduction techniques for text categorisation are supervised feature selection methods [13]. However, performing supervised feature selection is a significant problem for active learning tasks since the majority of supplied training data are unlabelled. As text data is naturally high dimensional, the choice of classifier used in active learning is therefore limited to those which do not suffer from the curse of dimensionality [7]. We investigate the application of unsupervised dimensionality reduction to active learning on text categorisation problems. Reducing the dimensionality while retaining the discriminative features will allow for greater flexibility in the choice of classifier used in active learning. To the best of our knowledge this is the first analysis of the use of unsupervised dimensionality reduction in the context of active learning for text categorisation problems. Empirical evaluation were conducted on the effect of dimensionality reduction to the performance of active learning using the k-nearest Neighbour (knn) algorithm. Two well established unsupervised dimensionality reduction techniques were considered for use in active learning problems. Feature selection is performed using Document Frequency performed with a global policy () while feature extraction is performed using Principal Components Analysis ().

Both techniques offer significant reductions in the size of the input data with and reducing dimensionality by up to 9% and 98% respectfully. We demonstrate that preprocessing the data using the unsupervised dimensionality reduction techniques can significantly increase the performance of active learning using the knn making it more competitive with state-of-the-art classifiers such as Support Vector Machines. A brief description of active learning, in particular, poolbased active learning is given in Section 2. The unsupervised dimensionality reduction techniques are reviewed in section 3. Empirically evaluated on real world text corpora is presented and discussed in section 4. Finally conclusions and future work are given in section 5 Algorithm 1: Pool-Based Active Learner Input: tr - training data Input: ul - unlabelled examples for i =to stopping criteria met do Φ i = Induce(tr) q = QuerySelect(ul, Φ i ) ul ul \{q} l = Oracle(q) tr tr {(q, l)} Output: Φ F = Induce(tr) // Induce // Select // Remove // Label // Update 2 Active Learning The goal of active learning is to produce an accurate classifier (Φ) fromas fewtrainingexamplesas possible. Thisis advantageous for domains where labelled training examples are scarce and the task of labelling is expensive. Typically training data for supervised learning are chosen randomly prior to induction. This is referred to as passive learning since the learner has no control over the which examples constitute the training data. Conversely, active learning allows the learner to construct it s own training data. Starting from a small number of labelled seed examples, an active learner will iteratively select unlabelled examples, acquire correct labels and update the training data. Certain examples will contain more information about the problem than others. Passive learning can potentially label a large number of uninformative examples. Active learning attempts to select (and label) only those examples which contain the most information. Therefore, active learning can significant reduce the number of labelled examples when compared to passive learning. 2.1 Pool-Based Active Learning In this paper we use pool-based active learning [5, 6] where the learner is supplied with a pool of unlabelled examples from which it selects queries. Algorithm 1 gives the outline of a pool-based active learning. The active learner is given a pool of unlabelled examples (ul) and training data (tr) which is seeded with a small number of labelled examples. In each iteration a classifier (Φ i ) is constructed from all the known labelled training data using an induction classification algorithm. The classifier can then be used by the query selection function to help select informative examples by providing predictions on unlabelled data. A query example (q) is selected using the query selection function and removed from the unlabelled pool. The true label (l) of the selected example is obtained from the oracle which is an external entity; assumed to be human and considered infallible. Once the true label is known, the labelled example can be added to the training data where classifiers induced in subsequent iterations will incorporate the information. Common stopping criteria used in active learning are: a limit on the number of examples the oracle is willing to label or stopping once all unlabelled examples have been selected. Once stopped the output of active learning is a classifier (Φ F ) trained on all the known labelled data. 2.1.1 Query Selection The query selection function is a crucial component of active learning and is responsible for selecting informative examples from the pool. A number of selection strategies have emerged in the literature [6, 1]. In this paper we use Uncertainty Sampling (US) [5] as the query selection function. US selects examples which the current classifier (Φ i ) is most uncertain about. Uncertainty is defined as the confidence the classifier has in a prediction. For a probabilistic classifier a prediction close to. or 1. indicates a confident prediction while a prediction close to indicates an uncertain prediction. Unlabelled examples in the pool are sorted according to their prediction uncertainty and the most uncertain example selected as the query, as shown in Equation 1. s =argmin x ul Φ i(x) (1) 3 Dimensionality Reduction for Active Learning While high performance supervised feature selection techniques [13] can be applied in supervised text categorisation problems, the same supervised techniques can not be readily employed in active learning since the majority of

training data supplied are unlabelled. The use of benchmark corpora can allow the use of supervised feature selection [4]. However, in real world applications the label information is not available, which limits the applicability of this kind of approach. In general, dimensionality reduction is not performed for active learning in text categorisation problems. To compensate, classifiers capable of handling high dimensional data are preferred, restricting the choice of classifier used in active learning experiments. In this paper we explore an alternative approach which is suitable for realistic active learning in text categorisation problems. Two well established unsupervised dimensionality reduction techniques are considered for use in conjunction with active learning. 3.1 Document Frequency Global () Document frequency [1] is a feature selection technique where features are chosen based on the number of documents in which they occur. Rare features which only occur in a small number of documents are removed and only the features which occur in a large number of documents are retained. Despite its simplicity the performance of document frequency is comparable to the best performing feature selection methods [13] such as Information Gain. It is worth noting that stopwords are removed before dimensionality reduction is performed. Document Frequency can be performed using either a local or global policy. Local dimension reduction selects a set of terms for each category (context-sensitive). Obviously this requires knowledge of the label information. Conversely, a global policy for document frequency will select a set of the most frequent terms regardless of category, hence does not require label information (context-free). We use document frequency performed globally as an unsupervised feature selection technique. 3.2 Principal Components Analysis () Principal Components Analysis is a method for projecting high dimensional data into a new low dimensional space with minimum loss of information. It is an unsupervised feature extraction technique which discovers the directions of maximal variance in the data. The coordinate system of the original data is orthogonally transformed where the new coordinates are called the principal components (sometimes called principal axes). Principal components can be found by performing eigenvalue decomposition of the covariance matrix constructed from the training data. The solution to the eigenvalue decomposition is a set of eigenvectors which have associated eigenvalues. Eigenvectors are the principal components of the data while the eigenvalues define the amount of variance accounted for by the principal component. Principal components are sorted by their eigenvalues where the first principal component will account for the largest amount variance, the second principal component will account for the second largest amount, and so on. 3.2.1 for Text Categorisation. Given a set of l examples, principal component analysis will first centre the data by constructing the mean of the data µ (as given in Equation 2) and subtracting this from each example. Centering the data is not essential but can remove irrelevant variance as it reduces the overall sum of the eigenvalues. µ = 1 l x i (2) l i=1 The covariance matrix (C) is constructed as the dot product of the centered examples as given in Equation 3 (here centering is incorporated into the construction of the covariance matrix). C = 1 l l (x i µ)(x i µ) T (3) i=1 The eigenvalue problem (Equation 4) is solved by performing eigenvalue decomposition on C. The solution is a set of eigenvectors (v) and their associated eigenvalues (λ). Cv = λv (4) The d largest eigenvalues are sorted (λ 1 λ 2 λ 3... λ d ) in descending order and their associated eigenvectors stacked to form the transformation matrix W = [v 1,v 2,v 3,...,v d ]. For a given example x it can be transformed into the reduced space by Equation 5. y = W T x (5) The value of d is an important factor in the success of. Since the eigenvalues correspond to the amount of variance accounted for by their associated eigenvector, the proportion of variance accounted for by the first d eigenvectors can be calculated as: λ 1 + λ 2 +...+ λ d λ 1 + λ 2 +...+ λ d +...+ λ N In this paper we choose the leading d components which account for 9% of the variance in the data.

4 Empirical Evaluation 4.1 Experimental Setup Experiments were conducted to examine the effect of the proposed unsupervised dimensionality reduction techniques on the performance of active learning. Two standard benchmark corpora previously used in active learning research [11, 8], namely the Reuters-21578 corpus and a subset of the 2 Newsgroup corpus were used. The original feature set was obtained from preprocessing the corpora to remove stopwords and punctuation. Stemming was performed using the Porter stemming algorithm. Reduced feature sets were constructed using the two unsupervised dimensionality reduction techniques performed on the unlabelled and seed data. retained only 1% of the most frequent features while transformed the original data onto a d-dimensional space where d was chosen as the number of principal components which accounted for 9% of the variance in the data. Both the training and test sets were re-expressed in the reduced feature representation. The knn is a high performance classifier [12] for text categorisation, however, it is sensitive to high dimensional data. While it is not commonly used for active learning text categorisation tasks we chose the knn since it will benefit greatly from dimensionality reduction. The output of the knn was transformed into a class membership probability estimate where the distribution is based on the distance of the query example to the k nearest neighbours. The estimate was then used as a measure of uncertainty (as discussed in section 2.1.1). The k value was fixed at 3 in our experiments. The optimal value for k is typically found using validation data, which is not available in active learning. A low value for k is also important for the early iterations of active learning since the number of training examples can be very low. Comparison are made between a baseline knn using the full feature set (), knn using the dimensionality reduced data ( and ) and also a top-line Support Vector Machine trained on the full feature set (). The Spider 1 toolbox for Matlab was used to perform the experiments with the andre optimisation selected for the. Active learning was seeded with 4 positive and 4 negative examples. Just one query example was selected per iteration. Once started, active learning was only stopped when all the unlabelled examples had been selected from the pool. The performance of active learning was measured using the classifier induced in each iteration (Φ i ) evaluated on a test set. Each experiment was run ten times and the results averaged. Within each trial the same seed examples for active learning were supplied each of the techniques. 1 www.kyb.tuebingen.mpg.de/bs/people/spider/ Table 1. Iterations of active learning required to achieve supervised learning performance for R1. Percentage of pool labelled. Full MacroF 1 454 (46%) 324 (33%) 243 (25%) MicroF 1 923 (93%) 444 (45%) 385 (39%) 4.2 Reuters-21578 (R1) We used the R1 [2] which is the top ten most frequent categories of the ModApte split. One-versus-rest experiments were constructed for each individual category. To reduce the computational overhead of performing active learning, a pool of 1, documents were randomly selected from the 9, 63 training documents as used previous active learning research [11]. selected, on average, the leading 36 principal components, which is a 98.5% reduction in dimensionality. retained only the top 1, 987 (1%) features. Due to the unbalanced class distribution the F 1 value of precision (π) and recall (ρ) was chosen as the performance metric, (where F 1 = 2πρ π+ρ ). F 1 was calculated using both macroaveraged and microaveraged variants of precision and recall. Macro F1.75.7.65.6 5 5 5 (a) Macro F 1 Micro F1.9.8.7.6 (b) Micro F 1 Figure 1. Performance of Active Learning for R1. The number of iterations of active learning is given on the X axis and the F 1 is given on the Y axis. Performance of active learning on the R1 data is given in Figure 1. and can be seen lift performance of active learning closer to that achieved by the top-line classifier. Of the two unsupervised dimensionality reduction techniques achieves both a greater reduction in dimensionality and a higher performance increase. The number of iteration of active learning required to produce a classifier (Φ i ) with performance equal to a classifier constructed by supervised learning on all training data

using the full feature set, is given in Table 1. Increasing the performance of active learning subsequently reduces the labelling effort. Both and increase performance resulting in reductions in the number of required labelled examples. Again is seen to outperform. Given the high cost of labelling it is useful to consider halting active learning after a limited number of labels are acquired. Stopping at 25 iterations, the increase in F 1 using compared to is (Macro).496 (Micro) 561 while the increase in F 1 of compared to is (Macro).219 (Micro).82. Bold text indicates statistical significance (α =). were: (A-R) 194, (G-X) 23,(W-H) 232 and (B-C) 21, reducing dimensionality by approximately 81% on average. reduced dimensionality by 9%. 5 5 5 5 1 2 3 4 5 6 7 (a) Atheism-Religion (A-R) 5 5 5 5 (b) Graphics-X (G-X) 4.2.1 Random Feature Selection 5 5 It could be the case that the observed improvements in performance could be due simply to the positive effect reducing the number of features has on the classifier, irrespective of the quality of the reduced set. In order to test that possibility we compared performance of the baseline to random feature selection [3]. 5 5 5 (c) Windows-Hardware (W-H) 5 5 5 (d) Baseball-Cryptography (B-C) Macro F1.6 5 5 5 Micro F1.8.7.6 Figure 3. Performance of Active Learning for 2NG. Iterations of Active Learning is given on the X-axis and the is given on the Y-axis. 5 Full Rand (a) Macro F 1 Full Rand (b) Micro F 1 Figure 2. Performance of Rand compared to Full. Iterations of active learning is given on thex-axisandthethef 1 is given on the Y- axis. Due to the 1v1 problems was used as the performance metric. Figure 3 plots the rate of active learning for the four sub-problems. Both of the unsupervised dimensionality reduction techniques ( and ) increase the performance of active learning. again offers greater reductions in dimensionality and also outperforms on all four problems. Figure 2 plots the performance of random feature selection (Rand) w.r.t the original feature set () on the R1 dataset 2. The performance of Rand is significantly worse which shows features selected by the unsupervised techniques are discriminative. 4.3 2 Newsgroups Subset (2NG) Four 1v1 problems constructed from the the 2 Newsgroups corpus [8]. The problems range in difficulty from easy to hard. Ten 5%/5% training/testing splits of the data were constructed and the results obtained were averaged. The average number of principal components chosen 2 Rand was not run on the 2 Newsgroups dataset for the sake of brevity Table 2. Iterations of active learning required to achieve performance of Supervised learning for 2NG. (Percentage of pool labelled). Full A-R 672 (95%) 597 (85%) 416 (59%) G-X 773 (8%) 661 (68%) 58 (6%) W-H 616 (64%) 553 (57%) 342 (35%) B-C 48 (42%) 428 (44%) 21 (21%) We compared the number of iterations required to a classifier (Φ i ) with equal to that produced by supervised learning on all training data using the full feature set. Table 2. shows a significant reduction in the labelling effort when the dimensionality reduction techniques are employed.

Stopping after just 25 iterations the reduction in of compared to is: (A-R) 74 (G-X) 883 (W-H).643 (B-C) 517 while the reduction in of compared to is: (A-R).255 (G-X).28 (W-H).316 (B-C).386. Bold text indicates statistical significance (α =). 4.4 Discussion Empirical evaluation shows that employing unsupervised dimensionality reduction increases the performance of active learning using a knn. Performing offered some performance increase compared with the baseline () performance. The performance increase was shown to be a result the selection discriminative features since Random feature selection failed to achieve any increase in performance. outperformed in all of the experiments conducted. There are some noticeable differences between the two techniques which may account for the increased performance. While statically reduced the dimensionally of the data, dynamically reduced the dimensionality until the majority of variance in the data was accounted for. In the 2NG experiments, for instance, dimensionality was reduced to just 23 features in the G-X sub-problem. Subsequently classification in the reduced feature set was considerably easier leading to higher performance of active learning and a large reduction in the labelling effort (58 compared to the baseline of 773). While was shown to be best performing technique the computational expense associated with is far greater, which limits its applicability to very large datasets. offers some increased performance at much lower computational expense. 5 Conclusions and Future Work Supervised dimensionality reduction techniques can not be readily employed in active learning scenarios since the majority of training data is unlabelled. The choice of classifier used in active learning is therefore limited to those which do not suffer from the curse of dimensionality. This paper investigated the use of well established unsupervised dimensionality reduction techniques for use in active learning on text categorisation problemsto increase performance and allow for greater flexibility in the choice of classification algorithm. Empirical evaluations on two benchmark corpora show that both Document Frequency performed Globally and Principal Components Analysis significantly increased the performance active learning when using a knn. In both sets of experiments was found to outperform, however this the increased performance comes with the higher computational overhead associated with conducting. We plan to continue this research to look at Kernel Principal Components Analysis (K) [9] which will allow for non-linear principal components to be found. Acknowledgements This research is funded by the Irish Research Council for Science, Engineering and Technology (IRCSET). References [1] M. Davy and S. Luz. Active learning with history-based query selection for text categorisation. Proceedings of the 29th European Conference on Information Retrieval Research, ECIR 27, 4425:695, 27. [2] F. Debole and F. Sebastiani. An analysis of the relative hardness of reuters-21578 subsets. Journal of the American Society for Information Science and Technology, 56(6):584 596, 25. [3] G. Forman. An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research, 3(1):1533 7928, 23. [4] S. Hoi, R. Jin, and M. Lyu. Large-scale text categorization by batch mode active learning. Proceedings of the 15th international conference on World Wide Web, 26. [5] D. Lewis and W. Gale. A sequential algorithm for training text classifiers. Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval, pages 3 12, 1994. [6] A. McCallum and K. Nigam. Employing EM in pool-based active learning for text classification. Proceedings of the 15th International Conference on Machine Learning, pages 35 358, 1998. Uses random initial cases. [7] T. Mitchell. Machine Learning. McGraw-Hill, 1997. [8] G. Schohn and D. Cohn. Less is more: Active learning with support vector machines. Proceedings of the 17th International Conference on Machine Learning, pages 839 846, 2. [9] B. Scholkopf, A. Smola, and K. Muller. Kernel principal component analysis. Advances in Kernel Methods-Support Vector Learning, pages 327 352, 1999. [1] F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1 47, 22. [11] S. Tong and D. Koller. Support vector machine active learning with applications to text classification. Journal of Machine Learning Research, 2:45 66, 21. [12] Y. Yang and X. Liu. A re-examination of text categorization methods. Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pages 42 49, 1999. [13] Y. Yang and J. Pedersen. A comparative study on feature selection in text categorization. Proceedings of the 14th International Conference on Machine Learning, 97, 1997.