Gene-Expression Microarrays Classification using Feature Selection and Support Vector Machines Darcy Davis Allison Hanuschak - Alina Lazar Department of Computer Science and Information Systems Youngstown State University 1. Introduction Every living organism contains inside its cells genetic material, which is transmitted from one generation to the next. The genetic material encoded in each cell is composed out of nucleic acid (DNA). The DNA molecule is organized into segments called genes. An organism has the same genes in all its cells but they can be in different stages at different time moments. The genetic information stored into DNA may be transcribed into complementary RNA molecules which in turn may be translated into proteins. Many complex human diseases and especially cancer are correlated with abnormal functionality at this level. After 1996, a new technology called DNA microarrays gave researchers the possibility to synthesize a global gene image of the cell. The study of gene-expression microarrays is a reasonably new development in biology that allows thousands of genes to be studied simu ltaneously. Each mircoarray is a silicon chip on which gene probes are align in a grid pattern. Measurements are done by using fluorescent detection. The fact that today one microarray can be used to measure all the human genes has led to advances in the diagnosis and prognosis of diseases and also in the drug discovery [1]. However, the amount of data in each microarray is too overwhelming for manual analysis, since a single sample often contains measurements for around 10,000 genes. Due to this excessive amount of information, efficiently producing results requires automatic computer controlled analysis of data. By using machine learning techniques [2, 3], the computer can be trained to recognize patterns that biologically classify the microarrays. Assuming that half of the instances are from healthy patients and half are from patients that have a disease, especially cancer, by using machine learning algorithms we can find gene combinations to distinguish and separate the healthy patients from the sick ones. The main data analysis techniques [4] used currently in biomedical applications related to microarrays are: classification, clustering and gene selection. In contrast with data sets from other fields a typical microarray dataset has a large number of genes (~10000) and a small number of samples (~100). However, we can expect that not all the genes will carry relevant information for a particular classification task. The process of selecting only the important components it is called gene or feature selection [5]. Supervised learning or classification algorithm can be used to classify and predict diseases outcome. Unlike classification, clustering does not use a tissue annotation as a decision and it is used to discover new biological classes. Several machine learning algorithms [6] have been previously used to classify microarrays datasets, including, decision trees, Fisher linear discriminant analysis, nearest neighbor, neural networks, Bayesian networks and support vector machines. Supervised machine learning method known like support vector machine (SVM) [7, 8] have been used to analyze a preexisting data set of microarrays and diagnose cancer. As it is unreasonable to expect perfect diagnosis with a limited knowledge of cancer, the goal is to optimize the correctness of the diagnosis by employing different methods for using and training the SVM. Applying machine learning algorithms on DNA microarray data sets is of maximum importance for the future medical research related to gene expression analysis for disease classification and genotyping for diagnosis and drug discovery. 2. Specific Questions The goal of our proposed research will be to use supervised learning to classify and predict cancer or other diseases, based on the gene expressions collected from microarrays. These microarrays give us information concerning the rate at which a certain gene is expressing itself, or in other terms, the rate at which its DNA is being transcribed into RNA and then being translated into the corresponding protein. Today, there are many freely 1
available public microarray data sets available to analyze and utilize in our research. Table 1 summarizes the data sets that will be used in the present research. Table 1. Publicly Available Microarray Datasets Name National Center for Biotechnology Information Stanford Microarray Database URL http://www.ncbi.nlm.nih.gov/geo/ http://genome-www5.stanford.edu/ University of Pittsburgh Microarray Dataset Collection Kent Ridge Bio-medical Data Set Repository http://bioinformatics.upmc.edu/help/upittged.html http://sdmc.lit.org.sg/gedatasets/datasets.html Known sets of data will be used to train the machine learning protocols to categorize cancer patients according to their prognosis. Consequently, the accuracy of the routines developed will be tested against a separate set of known data. The outcome of this study will provide information regarding the efficiency of the machine learning techniques, in particular SVM methods, in discovering patterns related to genetic disorders, and also will allow the identification of relevant types of gene expressions. These could possibly be abnormal expression rates for a particular gene, the presence or absence of a particular gene or sequence of genes, or a pattern of unusual expression across a gene subset. Subsequently, SVM methods with different parameters will be applied to identify the best ones in terms of accuracy, efficiency and least false positive outcome [9]. It is envisioned that this would thereby provide help to guide physicians in determining the best treatment for a patient, for example regarding the aggressiveness of a course of treatment on which to place a patient. 3. Methods Two of the most important and hard problems in microarray data analysis relate to the dimensionality of the data and to noise. Because many data analysis techniques involve exhaustive search over the object space, they are very sensitive to the size of the data in terms of time complexity. In case of microarrays, the solution is to reduce the search space vertically (in terms of genes) by using a feature selection method. The other problem is that errors occur during actual data collection and they are referred as noise in the data. Supervised learning methods based on statistical learning theory, for classification and regression, provide good generalization and classification accuracy on real data. However, their inherent trade-off is their computational expense. Recently, support vector machines (SVM) [10] have become a popular tool for learning methods since they translate the input data into a larger feature space where the instances are linear separable, thus increasing efficiency. In the SVM methods a kernel which can be considered a similarity measure is used to recode the input data. The kernel is used accompanied by a map function Φ. Even if the mathematics behind the SVM is straight forward, finding the best choices for the kernel function and parameters can be challenging, when applied to real data sets. We will use the Libsvm developed by Chang [11]. Usually, the recommended kernel function [12] for nonlinear problems is the Gaussian radial basis function, because it resembles the sigmoid kernel for certain parameters and it requires less parameters than a polynomial kernel. The kernel function parameter γ and the parameter C, which controls the complexity of the decision function versus the training error minimization, can be determined by running a 2 dimensional grid search, which means that the values for pairs of parameters (C, γ) are generated in a predefined interval with a fixed step. The performance of each combination is computed and used to determine the best pair of parameters. The non-sparse property of the solution leads to a really slow evaluation process. Thus, for the microarray datasets a data reduction [13] can be done in terms of genes or features of the dataset considered. Redundant or highly correlated features can be replaced with a smaller uncorrelated number of features capturing the entire information. This is done by applying a method called Principal Component Analysis (PCA) before using the SVM algorithm. The method is performed by solving an eigenvector problem or by using iterative algorithms and the result is a set of orthogonal vectors called principal components. The mapping of the larger set into the new smaller set is done by projecting the initial instances on the principal components. The first principal component is defined 2
as the direction given by a linear regression fit through the input data. This direction will hold the maximum variance in the input data. The second component is orthogonal on the first vector, uncorrelated and it is defined to maximize the remaining variance. This procedure is repeated until the last vector is obtained. The envisioned research will follow the main steps of knowledge discovery processes: - Gene selection - the irrelevant attributes (genes) are removed and the selected data is represented as a two-dimensional table. - Preprocessing - if the selected table contains missing values or empty cell entries, the table must be preprocessed in order to remove some of the incompleteness. Statistics should be run to obtain more information about the data. - Training and validation sample - the initial table is divided into at least two tables by using a crossvalidation procedure. One will be used in the training step, the other in the validation or testing step. - Interpretation and evaluation - the validation or test data set is then used to test the classificatory performance of the methods in terms of efficiency and accuracy. A time projection for the project is given in the next table. Table 2. Project Time Table Task Name Literature review research about the gene expression data, support vector machine techniques and feature selection algorithms. Developing programs that automatically test machine learning algorithms against for classification and prediction. Full scale integration off the successful algorithms to large gene expression datasets. Dissemination of results through papers and communications at specific conferences. Evaluation of the applicability of the developed algorithms to other datasets 2004 2005 S O N D J F M A M J J A 4. References [1] R. Burbridge, M. Trotter, B. Buxton, and S. Holden, Drug design by machine learning; support vector machines for pharmaceutical data analysis. Computers and Chemistry 26:5-14, 2001. [2] M. Molla, M. Waddell, D. Page, J.and Shavlik, Using Machine Learning to Design and Interpret Gene- Expression Microarrays. AI Magazine 25:23-44, 2004. [3] Z. Wang, Y. Wang, J. Lu, S. Kung, J. Zhang, R. Lee, J. Xuan, at al., Discriminatory Mining of Gene Expression Microarray Data. The Journal of VLSI Signal Processing 35:255-272, 2003. [4] W. Dubitzky, M. Granzow, and D. Berrar, Data Mining and Machine Learning Methods for Microarray Analysis. In: Lin, S.M., Johnson, K.F. (eds.) Methods of Microarray Data Analysis - Papers from CAMDA 2000, Boston. Kluwer, Academic Publishers, 2001. [5] P. S. Bradley and O. L. Mangasarian, Feature Selection via Concave Minimization and Support Vector Machines. In Machine Learning Proceedings of the Fifteenth International Conference(ICML '98), J. Shavlik, editor, Morgan Kaufmann, San Francisco, California, 82-90, 1998. [6] S. Cho, and H. Won, Machine Learning in DNA Microarray Analysis for Cancer Classification. APBC 2003:189-198, 2003. 3
[7] B. Schölkopf and A. Smola, Learning with Kernels. MIT Press, Cambridge Massachusetts, 2002. [8] V. N. Vapnik, The Nature of Statistical Learning Theory, 2 nd edition, Springer-Verlag, New York, NY, 1999. [9] J.B. Tobler, M.N. Molla, E.F. Nuwaysir, R.D. Green, and J.W. Shavlik, Evaluating machine learning approaches for aiding probe selection for gene-expression arrays. Bioinformatics 18:164-171, 2002. [10] T. Joachims, Making large-scale SVM learning practical., In B. Scholkopf, C. J. C. Burges and A. j. Smola, editors, Advances in Kernel Methods Support Vector Learning, pp. 169-184, MIT Press,, Cambrige, MA, 1999. [11] C.-C. Chang, and C.-J. Lin, LIBSVM: a library for support vector machines, Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm, 2001. [12] N. Cristianini and J. Shawe -Taylor, An Introduction to Support Vector Machines and Other Kernel-based Learning Methods, Cambridge University Press, Cambridge, England, 2000. [13] Y.-J. Lee and O.L. Mangasarian, RSVM: Reduced Support Vector Machines, Proc. Of the First SIAM International Conference on Data Mining, Chicago, April 5-7, 2001. 5. Impact on the Goal of CREU The foremost goal of the CREU project is to encourage females and minorities to pursue graduate work and study in the field of computer science. This project will provide a realistic research experience for the two female undergraduates, by active involvement in the planning, execution and interpretation of scientific research. Welldeveloped research projects can significantly enrich the educational experience for undergraduate students. Working on this research project, students will be able to enhance their computer and programming skills, apply those skills to investigate scientific problems, learn how to formulate questions and problems and to participate in the discovery of new knowledge. A good research experience can foster an enthusiasm for lifelong learning and a desire to continue education beyond the baccalaureate. Successful scientific instruction should develop in student a sense of wonder and curiosity about the world. The students will be exposed to both sides of the scientific investigation: hypothesis testing and development of theoretical explanations of observations. No science education is complete without research related activities, technical writing and oral presentations. Darcy Davis feels that this project will certainly support the goals of CREU. As an undergraduate female with intentions of pursuing graduate work in computer science, this project will give her, a useful introduction to the practical applications of her studies for research, focusing on artificial intelligence. This is a project that can potentially be a foundation for her senior thesis, and the mathematical concepts will be a wonderful basis for the presentations she intends to make at this year's mathematics conferences. Allison Hanuschak believes that this research project will introduce the world of graduate research to her and will be an exceptional opportunity to gain valuable research experience. In addition, by completing the CREU project, she thinks that she will have a distinct advantage for admissions to the graduate school of her choice. Also, she hopes to to encourage fellow female students by setting an example for them and being a positive role model for continuing study in computer science. All in all, this experience will be beneficial for her and will also aid her in pursuing graduate study. Both students intend to present this project at the 2005 YSU QUEST conference. 6. Student Activity and Responsibilities Specific tasks for the two participant students will include: literature search and review, reading and discussing research articles, designing and implementing data mining and machine learning algorithms, data processing, data analysis and interpretation, summarizing and preparing results for presentations and publications, participation at the YSU QUEST 2005 and writing the final report. The primary responsibility of the two students is to participate in all phases of the project: proposal, development, experiments, and dissemination. The students will be required to do weekly independent work and to 4
schedule team meetings. It is important that they work together as a team. The faculty advisor will meet with the students every other week. Email will be used for questions, announcements and documents interchange. 7. Faculty Activity and Responsibilities As faculty advisor for the proposed project, Dr. Alina Lazar will work to actively mentor the two students and continuously supervise their progress during the one year period. She will meet with the students on regular basis to guide their activities and answer their questions related to the project. Dr. Lazar has extensive experience in data mining, machine learning and artificial intelligence and she has written several papers related to the subject of this proposal. Her knowledge will make this project an enjoyable research experience for the undergraduate students. The department will supply a small computer lab and Dr. Lazar will provide the necessary software from funds previously obtained through the university. She guided the students on how to develop and write the present proposal and she will help them with the final report and also with the preparation of a conference paper. The overall guidance and mentoring will not refer only to this project but it will provided insights about how to apply and how to succeed in graduate school, about being a female computer scientist and what the options are after graduate school. 8. Budget For the proposed project we are requesting $2000 for the two participant female students. An additional $500 will be used to buy computer media, books and other materials necessary for the project. While working on the project the students will be encouraged to apply for the Undergraduate Research Grant Award sponsored by the Youngstown State University and other scholarships. 5