Semi-Supervised Learning in Diagnosing the Unilateral Loss of Vestibular Functions

Semi-Supervised Learning in Diagnosing the Unilateral Loss of Vestibular Functions Final Report for COMP150-05, 2011 Spring Mengfei Cao Dept. Computer Science, Tufts University 161 College Ave., Medford, MA, U.S. mcao01@cs.tufts.edu ABSTRACT Although various vestibular tests have been conduct on the subjects in order to diagnose loss of vestibular function, subjects with some specific test results can not be diagnosed. These subjects are refered as in the non-definite group and turn out to be 26.80% in our dataset. This paper presents an approach to recognizing subjects with any possible test results based on a semi-supervised learning strategy. In addition, the redundancy of the various vestibular tests is explored. The approach uses a modified co-training framework; moreover, a special missing values problem among the vestibular tests data, noted as Missing Sets(MS) problem, can be internally solved by the co-training embedding where the whole feature space is split into several sub-test space. The MS problem differs from usual missing values problem in that the missing values tend to distribute intensively along each sub-tests of vestibular tests instead of spreading randomly: either none or all of the test results are missing. According to the nature of the modified cotraining method, the distinguishability of different vestibular tests can be compared and the MS problem can be solved. The work is evaluated on a clinical dataset with the total size of 9102. The experiments illustrate that the modified co-training method provide a stable and well-performed learner; also, there exists a great deal of overlapped information among various vestibular tests. Keywords Semi-supervised learning, minority class problem, missing sets problem, vestibular function diagnosis 1. INTRODUCTION Patients who see a specialist in Vestibular or Balance disorders are asked to undergo seven different physical tests to determine whether they have unilateral loss of vestibular Comp-150, Machine Learning in Predictive Medicine, by Prof. Schmid, Prof. Brodley function (balance) or are normal each test produces 7-22 different measures. We are interested in whether machine learning methods can be applied to this data to diagnose patients as normal or abnormal (i.e., unilateral loss). A challenge in this data is that patients often choose not to undergo one or more of the tests seem intimidating, or they may quit half-way through due to excessive dizziness and nausea. Thus patients may be missing one or more sets of features. We call this problem the missing set problem, and in this paper we provide a method for applying machine learning in this scenario. A second challenge of this data is that our data is imbalanced. In the population at patients who visit a Otoneurology department XX% of patients are abnormal. In a dataset collected by Dr. Lewis and associates at Mass Eye and Ear 11% of the cases are abnormal. A difficulty in creating a classifier for a datasets with a skewed class distribution is that sensitivity often suffers and specificity is boosted [3]. Finally, we were provided with both labeled and unlabeled data. What makes this a particularly interesting challenge is that the unlabeled data that is unlabled because the doctors are unsure of the label given the tests. Note that there are other non-vestibular tests that can help them make a differntial diagnosis in these cases, but they are interested in whether these cases can be classified accurately based on just the vestibular data alone. A related a issue of interest is whether any subset of the seven vestibular tests are redundant with respect to prediction performance in this case considerable time and patients discomfort could be ameliorated. In the remainder of this paper we first review the related work in the Otoneurology literature. We next present our approach to the missing set problem, the class imblance problem and we present a method for applying co-training to utilize the unlabeled data and thus we can classify patients that the physicians were unable to. We present results of an empirical evaluation that the co-training strategy that learns from both labeled and unlabeled data can perform at least as well as the one learnt singly on labeled data, but it has the capability to classifiy the unlabeled data ; also the method to deal with missing set problem improves the prediction performance compared to that without splitting. We conclude that the co-trainging semisupervised learning method fits in the vestibular diagnosis situation and the splitting strategy

helps solve the missing set problem. 2. CLASSIFYING VESTIBULAR FUNCTION Current, there are not automated methods that can effectively diagnose all patients; specifically, for some patient s test results, past strategies will make conflicting diagnoses. For example, in our data there are 26.80% non-definite patients. However, obtaining information in these cases will help for decisions in medication or surgerical interventions. Previous work on diagnosing vestibular loss use 1) a linearized threshold parameter noted as paresis originated from the electronystagmogram test (ENG test) [?][13] or 2) a linearized threshold parameter noted as time constant originated from the sinusoidal vertical-axis rotation test (SVAR test) [5]. However these two parameters do not cover the whole data space of subjects tests, and further make conflicting predictions on the non-definite group of subjects. (See Figure 1). 3. A SEMI-SUPERVISED LEARNING AP- PROACH Because we are given both labeled and unlabeled data, we can formulate the problem as a semi-supervised learning problem. Namely, given a dataset D and a vector Y = {0, 1} m of labels, where D = n and m < n, each element s D represents a subject s whole vestibular tests values and can be noted as a 97-dimensional vector. Specifically, the values in the vector can be nominal or numerical. In addition, these 97 values of each subject come from the full test battery of seven vestibular tests. As for the label vector Y, we use 0 to indicate normal and 1 abnormal. Without loss of generality, we note that a subject s is normal by Y (s) = 0 and abnormal by Y (s) = 1. All subjects that have a label form a subset L of D and the unlabeled dataset U = D L, where the sizes of L and U are m and q, so we have n = m + q. The purpose of the proposed semisupervised learning algorithm is to take advantage of both L and U, together with Y to train a classifier C that for any subject s in the feature space, C would output a solid label to s. In the vestibular function diagnosis problem, there are two issues to tackle with: the missing set issue and the imbalanced data issue. Next we will elaborate our approaches to tackling this and followed is a semi-supervised framework that embed the whole work. Figure 1: Distribution of the Instances w.r.t. two Parameters Before moving to our approach we first enumerate the seven vestibular tests for which we have data [13]. There seven tests are the standard battery of tests given to patients with balance function issues: 1. POSIT test (positional test) 2. Caloric test 3. Gaze test, 4. Saccade test, 5. SVAR test (Sinusoidal Vertical-Axis Rotation test), 6. VVI test (Visual-vestibular Interaction test) 7. EQUI test( motor control test and sensory orgnization test). 3.1 Missing Set During the full vestibular test battery, a subject can choose not to participate in one or more of the seven vestibular tests. In addition, it happens frequently that during a specific subtest, subjects can quit testing half-time due to dizziness the result is that all of the data for this test are discarded. Thus, we have sets of missing values; either all of the values for a test are present or all of the values are missing. It should be noted that this situation is different from the random missing values. We note this specific missing values problem as the Missing Sets (MS) problem. Classic methods for handling missing data are maximum likelihood estimation and bayesian multiple imputation [11]; these methods are applicable to the data missing at random. However, in our case we have missing sets, thus we explore an alternative method for handling missing sets of values. In particular, we first split the whole feature space into k subspaces, one for each of the k sets of features for both labeled L and unlabeled data U, (for the Vestibular data this results in 2 7 sets). We can then treat the labeled splitted data as k training sets of data. Note that each feature is in one and only one of these k sets; because for some datapoints, one or more subspaces are totally filled with missing values (missing set), thus the sizes of the splitted sets don t have to be equal to the size of the original set. We create k sets of training data as follows: For each instance p in L, we put its q( k) sets of features into k sets; thus we have L 1, L 2,..., L k. In the experimental stage, stratified sampling is applied so as to extract the training data and test data; this sampling routine is applied to k sets, and thus obtain k sets of training data. After we form these k sets, we can then train k classifiers, one for each dataset D i, i = 1, 2,...k.

Whenever there comes a new instance, we can combine the prediction of l k classifiers, where l is the number of sets that the instance has the values for, namely in particular, the number of the patients has participated in. The proposed strategy has three advantages: first, it will guarantee that for each learning stage, the learner can take advantage of the current existing data, and avoid the affect of the missing values of other patients; second, it fits perfectly in a semi-supervised learning framework, which combines both the information from labeled subjects and the unlabeled; third, it allows us to consider each subspace separately and thus significant for the problems such as the vestibular project where the relationship between different vestibular tests remains to be explored. Also, this problem has not been discussed before and thus is still an open issue to be optimized. In the Table1), three methods are compared, where the first is to spread the instances with missing values fractional along the branch in the decision tree; the second is to fill the missing values with the avearge values within the same class; the last row represents the proposed approach. The method of filling with average values tends to construct a tree with single node that classifies all the data to be normal and gets extremely low sensitivity ( 0.27% ); compared to the method of spreading the instances with missing values, our approach outperforms in terms of the sensitivity at 95.47%. Although the first method has the highest specificity at 96.80%, however, our approach still keeps specificity 94.92% with nearly 13% gain of the sensitivity. Thus, it might be concluded that our method provides more balanced and still relatively good performance of sensitivity and specificity. 3.2 Class Imbalance The minority class problem is common in terms of the disease statistics. In the clinical vestibular data available, the percentage of the abnormal subjects is as low as 11%. This fact will cause some machine learning techniques to heavily degrade [3]. For instance, when the classifiers are trained using the C4.5 decision tree method on EQUI, GAZE, and POSIT data, all of the decision trees contain only one single node Normal as the root as well as the leaf, and thus will classify all the subjects to be normal. Yet still they keep high accuracies. In order to avoid the effect of the minority class problem and improve the classifiers descriptive capacity, two sampling methods, oversampling and undersampling[9][8][3] are compared with the one without sampling. The oversampling method will over-sample the minority class and results in replications of the minority class data so that during the learning procedure, the minority class data will be weighted more heavily. After duplications, the size of the tree increases but also the noise in the majority class data will be augmented. The undersampling method will sample identical number to minority class of examples from the majority class and form the training data. Thus these two methods will give different results where one biases the sensitivity and the other specificity. According to the results in Figure2, both the sampling techniques bia the sensitivity while degrade in terms of the specificity. Choosing an appropriate threshold for voting among the seven tests, both sample methods can achieve at least one best measurement. In particular, in the case where a false positive is more costy than a false negative, the sampling techniques should be utilized. Since it is hard to quantify a false positive and a false negative, these two together with the method without sampling add up the flexibility for clinicians to choose. Figure 2: Comparison of Different Sampling Methods w.r.t. Different Votes to that Without Splitting 3.3 Co-training Traditional supervised learning strategies only apply L to the learner and thus obtain a classifier that is totally independent from U, which on one hand may not be completely descriptive among the whole feature space, and on the other hand wastes the unlabeled data. What s the most serious shorcoming is that still this classifier can not deal with the non-definite group because no data in this group appears in the training data. Therefore, the semi-supervised framework is more suitable under this situation[2]. Previously the generative models [7] are applied for text classification; the self-training by [14] is a commonly used method for semi-supervised learning; also the method of transductive support vector machine [12] and graph based methods [1] are used in the proposed fields; the method of co-training brought forward by [2] provides a nice intuition for the data that are naturally split to subsets. They work well in the fields such as text classification, image categorization, and object detection; yet the issues such as the local maxima, the high complexity of the models remain to be solved. According to the characteristics of the vestibular data, a method that is able to deal with the minority class issue and the MS problem, and that can easily embed the comparison of distinguishability of different vestibular tests,

Table 1: the Comparison Results with std Among Different Methods of Battling Missing Values(MVs) based on J48 Decisiont Tree Algorithm Methods Accuracy Specificity Sensitivity Spread MVs Along Branches 96.80% ± 0.60% 98.61% ± 0.44% 82.53% ± 3.95% Replace Missing Values 88.74% ± 0.09% 99.98% ± 0.05% 0.27% ± 0.56% Proposed Method 94.92% ± 0.74% 94.85% ± 0.91% 95.47% ± 3.34% is required. We proposed an approach with the modified co-training framework, combining both labeled and unlabeled data, embedding C4.5 algorithms [10] [6] in the 7 vestibular tests separately, resulting in a well-performed learner that is applicable to the data with the MS problem the minority issue. In addition, the proposed co-training based method allowed different supervised learning methods embedded to subspaces, and thus more adaptive to any specific data. As declared before, the purpose is to combine both the labeled data L and the unlabeled data U and train a binary classifier that classifies any subject from their vestibular test results. We propose a framework based on co-training. First, we train a set C 1 of classifiers based on the labeled data L 1,..., L k and their label vector Y. Before this stage, the feature space has been split into k subspaces and thus there are k classifiers after this stage. Second, apply these k classifiers to the unlabeled data U 1,..., U k and the labels vectors Y Ui for each U i, where i = 1,..., k, is obtained. Third, get the label vector Y U for U based on the majority votes of Y U1,..., Y Uk, and merge the two label vectors Y and Y U to be Y all. Then, we can train another set C 2 of classifiers based on the data L 1 +U 1,..., L k +U k and their label vector Y all. Finally, organize the classifiers in C 2 and obtain the final classifier C that is constructed over the whole feature space. The way to organize these separate classifiers can simply be voting or more sophisticatedly, by training over the predictions on training data. The three characteristics of this classifier are: first, it is trained from both the labeled and the unlabeled data; secondly, it is applicable to any subjects with any vestibular tests result, instead of partial diagnosis field in the former work that use the two parameters; thirdly, the intermediate result, C 2 can be used to explore the relationship between different vestibular tests. 3.4 Evaluation Methodology The evaluation of the work is based on the comparison of the classifiers. We apply the modified 10-fold stratified cross validation. Namely, after sampling the labeled data into 10 groups, run the learning process for 10 runs; in each run 9 groups serve as the training data and the rest one group is the test data. Nevertheless, it is not easy to tell whether a false positive decision or a false negative decision is more unforgivable, so the better and common way to quantize the performance of a particular diagnosis classifier is to give the accuracy, the sensitivity and the specificity defined as follows. accuracy = #{correctly classified subjects} #{all the subjects} ; Figure 3: the framework of the proposed approach sensitivity = specificity = #{subjects correctly classified as abnormal} #{the abnormal subjects} ; #{subjects correctly classified as normal} #{the normal subjects} ; 3.5 Exploration of the Relationship between Vestibular Tests Based on the work above, the relationship between the seven vestibular tests can be simply explored by processing the separate classifiers originated by these seven tests, namely C 2. In order to capture the redundancy of these vestibular tests, we calculate a redundancy matrix, where each cell of the matrix is the percentage that of the data for which the two particular classifiers make identical decisions. Each classifier is trained on a different vestibular test. Thus in the vestibular project, we finally construct a 7 7 symmetric redundancy matrix, which the values along the diagnal are 1. Although it is true that since all the classifiers are trying to make the correct classification thus ideally each percentage is trivially 100% for each pair of the tests, however, the fact is that they are not perfect in diagnosing all the subjects in the whole feature space. The induced percentage that captures both the number of correct classifications and the number of incorrect classifications, will indicate how much overlapped information the pair of vestibular tests provides. 4. EXPERIMENTS 4.1 Data The experiments are conduct on the clinical data provided by Prof. Lewis. After the preprocessing of the bilateral subjects and other irrelated issues, there are 9102 subjects vestibular tests results, where 6663 of them are labeled as

Table 2: the size of the feature vector for each test Test size of features GAZE 20 EQUI 22 SACC 6 SVAR 21 VVI 12 CALORIC 6 POSIT 10 total 97 normal or abnormal and 2439 of them are potentially diagnosed as unilateral loss patients or normal. The 10-fold cross validation is applied and all of the results are presented as the average of the 10 folds experiments. As indicated above, for each subject, their tests results are split into seven parts according to the seven subtest items.(table 2) 4.2 Results & Analyses 4.2.1 Comparison of Different Methods to Deal with Missing Values In Table1, three methods are compared. The first classifier is learnt using the default setting of C4.5 decision tree algorithm[10] [4] where the instance with missing values is splitted and spread fractionally along its children in the tree. The second method is the to replace the missing values with the median within the same class. Our method gives the most balanced performance between specificity and sensitivity while the other two both offer a relatively low measurement of sensitivity. 4.2.2 Comparison of the Classifiers Before and After Semi-supervised Learning Table 3 shows the results of classifiers trained simply on the training data(run1) and the classifiers trained after cotraining(run2). For each of the subtests, the supervised algorithm utilized is the C4.5 from WEKA[10][4]. The overall results come from the voting among the seven tests classifiers with voting threshold 1. The test set consists of the labeled data, which can not be diagnosed with the former strategy. In other words, the test set doesn t include any non-definite data. Thus if the non-definite data during the training stage affect the classifiers negatively, the output classifier will degrade a lot. However, in our results(table 3), the accuracy, specificity and the sensitivity don t change much ( differ within 3% ). What s worth noticing is that the classifers output after the second run are trained on both the original labeled data and the non-definite data; therefore, they can also describe the non-definte data and thus applicable to the non-definite data that can t be dealt with in the former literatures. 4.2.3 Exploration Table4 gives the redundancy matrix on the unlabeled data. The entry of the matrix is calculated as the percentage the pair of the tests make the identical decisions. It is unknown whether the decisions of the classifiers are correct or not. What s quite surprising is that the classifiers trained on GAZE and EQUI tests share exactly the same decisions on all the unlabeled data. Assuming the experiments are valid and the test data is sufficiently large, we can reach the conclusion that at least one of these two tests is redundant! 5. CONCLUSIONS By using the modified co-training framework, we obtain the classifier whose training is robust to the situations of minority class problem and the MS problem. Also the redundancy of the different vestibular tests can be quantitatively represented and thus provides the clinical scientists with appropriate information about the values and characteristics of these tests. Further work centers on improving the semisupervised framework and how more thoroughly to explore the relationship between the vestibular tests. For the former part, the attempts of organizing the classifications by different tests, such as to stack the trees, and the attempts to use more sophisticated and suitable method to deal with the minority class problem, might be good starts. 6. ACKNOWLEDGMENTS I would like to thank Prof. Schmid, Prof. Lewis and Prof. Brodley for the valuable chance and a lot of advice on the work. I am grateful for Dr. Small and Dr. Navdeep s advice and help! 7. REFERENCES [1] R. W. Baloh and J. M. R. Furman. Modern vestibular function testing. Medical Progress, 1989. [2] A. Blum and S. Chawla. Learning from labeled and unlabeled data using graph mincuts. In Proceedings of the Eighteenth International Conference on Machine Learning, ICML 01, pages 19 26, 2001. [3] A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In Proceedings of the eleventh annual conference on Computational learning theory, COLT 98, pages 92 100, 1998. [4] N. V. Chawla. C4.5 and imbalanced data sets: investigating the effect of sampling method, probabilistic estimate, and decision tree structure. In In Proceedings of the ICMLŠ03 Workshop on Class Imbalances, 2003. [5] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten. The weka data mining software: an update. SIGKDD Explor. Newsl., 11, 2009. [6] D. Merfeld, R. Lewis, and et al. Potential solutions to several vestibular challenges facing clinicians. Journal of Vestibular Research, 2010. [7] T. M. Mitchell. Machine Learning. McGraw-Hill, New York, 1997. [8] K. Nigam, A. K. Mccallum, S. Thrun, and T. Mitchell. Text classification from labeled and unlabeled documents using em. In Machine Learning, pages 103 134, 1999. [9] T. Oates and D. Jensen. The effects of training set size on decision tree complexity. In In Proceedings of The Fourteenth International Conference on Machine Learning, pages 254 262. Morgan Kaufmann, 1997. [10] T. Oates and D. Jensen. Toward a theoretical understanding of why and when decision tree pruning algorithms fail. In In: Proc. 16th National Conference on Arti Intelligence, pages 372 378. AAAI Press, 1999.

Table 3: Comparison of the Classifiers Before and After Semi-supervised Learning GAZE EQUI SACC SVAR VVI CALORIC POSIT Overall Accuracy of Run1 88.72% 88.72% 88.72% 95.37% 93.28% 96.65% 88.72% 95.23% Accuracy of Run2 88.72% 88.72% 88.69% 94.66% 93.08% 95.70% 88.72% 93.23% Specificity of Run1 100.00% 100.00% 100.00% 97.86% 97.80% 98.90% 100.00% 95.19% Specificity of Run2 100.00% 100.00% 99.97% 96.69% 96.75% 98.02% 100.00% 92.97% Sensitivity of Run1 0.00% 0.00% 0.00% 75.73% 57.73% 78.93% 0.00% 95.60% Sensitivity of Run2 0.00% 0.00% 0.00% 78.67% 64.27% 77.47% 0.00% 95.33% Table 4: the redundancy matrix on the unlabeled data Test GAZE EQUI SACC SVAR VVI CALORIC POSIT GAZE 100.00% 100.00% 99.87% 60.36% 67.85% 66.39% 100.00% EQUI 100.00% 100.00% 99.87% 60.36% 67.85% 66.39% 100.00% SACC 99.87% 99.87% 100.00% 60.49% 67.87% 66.30% 99.87% SVAR 60.36% 60.36% 60.49% 100.00% 85.63% 38.81% 60.36% VVI 67.85% 67.85% 67.87% 85.63% 100.00% 43.48% 67.85% CALORIC 66.39% 66.39% 66.30% 38.81% 43.48% 100.00% 66.39% POSIT 100.00% 100.00% 99.87% 60.36% 67.85% 66.39% 100.00% [11] J. R. Quinlan. C4.5: programs for machine learning. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1993. [12] J. L. Schafer and J. W. Graham. Missing data: our view of the state of the art. Psychological Methods, 2002. [13] M. Seeger. Learning with labeled and unlabeled data. Technical report, Institute for Adaptive and Neural Computation, University of Edinburgh, 2002. [14] C. Wall, R. F. Lewis, and S. D. Rauch. Surgery of the Ear and Temporal Bone, Chapter 5, Evaluation of the Vestibular System. Lippincott Williams Wilkins, 2004. [15] D. Yarowsky. Unsupervised word sense disambiguation rivaling supervised methods. In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, pages 189 196, 1995.