Gene-Expression Microarrays Classification using Feature Selection and Support Vector Machines

Similar documents
Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Python Machine Learning

Speech Emotion Recognition Using Support Vector Machine

GRADUATE STUDENT HANDBOOK Master of Science Programs in Biostatistics

Lecture 1: Machine Learning Basics

Learning From the Past with Experiment Databases

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Rule Learning With Negation: Issues Regarding Effectiveness

Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge

AD (Leave blank) PREPARED FOR: U.S. Army Medical Research and Materiel Command Fort Detrick, Maryland

Exposé for a Master s Thesis

Rule Learning with Negation: Issues Regarding Effectiveness

Reducing Features to Improve Bug Prediction

Applications of data mining algorithms to analysis of medical data

GUIDELINES FOR COMBINED TRAINING IN PEDIATRICS AND MEDICAL GENETICS LEADING TO DUAL CERTIFICATION

Australian Journal of Basic and Applied Sciences

Lecture 1: Basic Concepts of Machine Learning

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Word Segmentation of Off-line Handwritten Documents

Prerequisite: General Biology 107 (UE) and 107L (UE) with a grade of C- or better. Chemistry 118 (UE) and 118L (UE) or permission of instructor.

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

Probabilistic Latent Semantic Analysis

A Case Study: News Classification Based on Term Frequency

Generative models and adversarial training

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Human Emotion Recognition From Speech

Issues in the Mining of Heart Failure Datasets

Assignment 1: Predicting Amazon Review Ratings

CSL465/603 - Machine Learning

Nanotechnology STEM Program via Research Experience for High School Teachers

Learning Methods for Fuzzy Systems

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

BIOH : Principles of Medical Physiology

Department of Anatomy and Cell Biology Curriculum

Artificial Neural Networks written examination

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

On-Line Data Analytics

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

Mining Association Rules in Student s Assessment Data

Knowledge Transfer in Deep Convolutional Neural Nets

What Teachers Are Saying

SARDNET: A Self-Organizing Feature Map for Sequences

MYCIN. The MYCIN Task

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Knowledge-Based - Systems

BIOLOGICAL CHEMISTRY MASTERS PROGRAM

BENCHMARK TREND COMPARISON REPORT:

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Dana Carolyn Paquin Curriculum Vitae

Biomedical Sciences (BC98)

ABSTRACT. A major goal of human genetics is the discovery and validation of genetic polymorphisms

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

ScienceDirect. A Framework for Clustering Cardiac Patient s Records Using Unsupervised Learning Techniques

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Time series prediction

A Comparison of Two Text Representations for Sentiment Analysis

Software Maintenance

Agent-Based Software Engineering

Program in Molecular Medicine

A project-based learning approach to protein biochemistry suitable for both face-to-face and distance education students

PH.D. IN COMPUTER SCIENCE PROGRAM (POST M.S.)

CS4491/CS 7265 BIG DATA ANALYTICS INTRODUCTION TO THE COURSE. Mingon Kang, PhD Computer Science, Kennesaw State University

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Classroom Assessment Techniques (CATs; Angelo & Cross, 1993)

Evolutive Neural Net Fuzzy Filtering: Basic Description

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

WHEN THERE IS A mismatch between the acoustic

A survey of multi-view machine learning

CS Machine Learning

(Sub)Gradient Descent

arxiv: v1 [cs.lg] 3 May 2013

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

A Vector Space Approach for Aspect-Based Sentiment Analysis

Comparison of EM and Two-Step Cluster Method for Mixed Data: An Application

Massachusetts Institute of Technology Tel: Massachusetts Avenue Room 32-D558 MA 02139

Computational Data Analysis Techniques In Economics And Finance

Biology 10 - Introduction to the Principles of Biology Spring 2017

Rule-based Expert Systems

Experiment Databases: Towards an Improved Experimental Methodology in Machine Learning

Evolution of Symbolisation in Chimpanzees and Neural Nets

Knowledge Elicitation Tool Classification. Janet E. Burge. Artificial Intelligence Research Group. Worcester Polytechnic Institute

Observing Teachers: The Mathematics Pedagogy of Quebec Francophone and Anglophone Teachers

*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN

Jeff Walker Office location: Science 476C (I have a phone but is preferred) 1 Course Information. 2 Course Description

CS 446: Machine Learning

Beyond the Blend: Optimizing the Use of your Learning Technologies. Bryan Chapman, Chapman Alliance

Linking Task: Identifying authors and book titles in verbose queries

What can I learn from worms?

To link to this article: PLEASE SCROLL DOWN FOR ARTICLE

Master's Programme Biomedicine and Biotechnology

Learning Methods in Multilingual Speech Recognition

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Mathematics. Mathematics

Multi-tasks Deep Learning Model for classifying MRI images of AD/MCI Patients

Stephanie Ann Siler. PERSONAL INFORMATION Senior Research Scientist; Department of Psychology, Carnegie Mellon University

Semi-Supervised Face Detection

Ph.D in Advance Machine Learning (computer science) PhD submitted, degree to be awarded on convocation, sept B.Tech in Computer science and

Transcription:

Gene-Expression Microarrays Classification using Feature Selection and Support Vector Machines Darcy Davis Allison Hanuschak - Alina Lazar Department of Computer Science and Information Systems Youngstown State University 1. Introduction Every living organism contains inside its cells genetic material, which is transmitted from one generation to the next. The genetic material encoded in each cell is composed out of nucleic acid (DNA). The DNA molecule is organized into segments called genes. An organism has the same genes in all its cells but they can be in different stages at different time moments. The genetic information stored into DNA may be transcribed into complementary RNA molecules which in turn may be translated into proteins. Many complex human diseases and especially cancer are correlated with abnormal functionality at this level. After 1996, a new technology called DNA microarrays gave researchers the possibility to synthesize a global gene image of the cell. The study of gene-expression microarrays is a reasonably new development in biology that allows thousands of genes to be studied simu ltaneously. Each mircoarray is a silicon chip on which gene probes are align in a grid pattern. Measurements are done by using fluorescent detection. The fact that today one microarray can be used to measure all the human genes has led to advances in the diagnosis and prognosis of diseases and also in the drug discovery [1]. However, the amount of data in each microarray is too overwhelming for manual analysis, since a single sample often contains measurements for around 10,000 genes. Due to this excessive amount of information, efficiently producing results requires automatic computer controlled analysis of data. By using machine learning techniques [2, 3], the computer can be trained to recognize patterns that biologically classify the microarrays. Assuming that half of the instances are from healthy patients and half are from patients that have a disease, especially cancer, by using machine learning algorithms we can find gene combinations to distinguish and separate the healthy patients from the sick ones. The main data analysis techniques [4] used currently in biomedical applications related to microarrays are: classification, clustering and gene selection. In contrast with data sets from other fields a typical microarray dataset has a large number of genes (~10000) and a small number of samples (~100). However, we can expect that not all the genes will carry relevant information for a particular classification task. The process of selecting only the important components it is called gene or feature selection [5]. Supervised learning or classification algorithm can be used to classify and predict diseases outcome. Unlike classification, clustering does not use a tissue annotation as a decision and it is used to discover new biological classes. Several machine learning algorithms [6] have been previously used to classify microarrays datasets, including, decision trees, Fisher linear discriminant analysis, nearest neighbor, neural networks, Bayesian networks and support vector machines. Supervised machine learning method known like support vector machine (SVM) [7, 8] have been used to analyze a preexisting data set of microarrays and diagnose cancer. As it is unreasonable to expect perfect diagnosis with a limited knowledge of cancer, the goal is to optimize the correctness of the diagnosis by employing different methods for using and training the SVM. Applying machine learning algorithms on DNA microarray data sets is of maximum importance for the future medical research related to gene expression analysis for disease classification and genotyping for diagnosis and drug discovery. 2. Specific Questions The goal of our proposed research will be to use supervised learning to classify and predict cancer or other diseases, based on the gene expressions collected from microarrays. These microarrays give us information concerning the rate at which a certain gene is expressing itself, or in other terms, the rate at which its DNA is being transcribed into RNA and then being translated into the corresponding protein. Today, there are many freely 1

available public microarray data sets available to analyze and utilize in our research. Table 1 summarizes the data sets that will be used in the present research. Table 1. Publicly Available Microarray Datasets Name National Center for Biotechnology Information Stanford Microarray Database URL http://www.ncbi.nlm.nih.gov/geo/ http://genome-www5.stanford.edu/ University of Pittsburgh Microarray Dataset Collection Kent Ridge Bio-medical Data Set Repository http://bioinformatics.upmc.edu/help/upittged.html http://sdmc.lit.org.sg/gedatasets/datasets.html Known sets of data will be used to train the machine learning protocols to categorize cancer patients according to their prognosis. Consequently, the accuracy of the routines developed will be tested against a separate set of known data. The outcome of this study will provide information regarding the efficiency of the machine learning techniques, in particular SVM methods, in discovering patterns related to genetic disorders, and also will allow the identification of relevant types of gene expressions. These could possibly be abnormal expression rates for a particular gene, the presence or absence of a particular gene or sequence of genes, or a pattern of unusual expression across a gene subset. Subsequently, SVM methods with different parameters will be applied to identify the best ones in terms of accuracy, efficiency and least false positive outcome [9]. It is envisioned that this would thereby provide help to guide physicians in determining the best treatment for a patient, for example regarding the aggressiveness of a course of treatment on which to place a patient. 3. Methods Two of the most important and hard problems in microarray data analysis relate to the dimensionality of the data and to noise. Because many data analysis techniques involve exhaustive search over the object space, they are very sensitive to the size of the data in terms of time complexity. In case of microarrays, the solution is to reduce the search space vertically (in terms of genes) by using a feature selection method. The other problem is that errors occur during actual data collection and they are referred as noise in the data. Supervised learning methods based on statistical learning theory, for classification and regression, provide good generalization and classification accuracy on real data. However, their inherent trade-off is their computational expense. Recently, support vector machines (SVM) [10] have become a popular tool for learning methods since they translate the input data into a larger feature space where the instances are linear separable, thus increasing efficiency. In the SVM methods a kernel which can be considered a similarity measure is used to recode the input data. The kernel is used accompanied by a map function Φ. Even if the mathematics behind the SVM is straight forward, finding the best choices for the kernel function and parameters can be challenging, when applied to real data sets. We will use the Libsvm developed by Chang [11]. Usually, the recommended kernel function [12] for nonlinear problems is the Gaussian radial basis function, because it resembles the sigmoid kernel for certain parameters and it requires less parameters than a polynomial kernel. The kernel function parameter γ and the parameter C, which controls the complexity of the decision function versus the training error minimization, can be determined by running a 2 dimensional grid search, which means that the values for pairs of parameters (C, γ) are generated in a predefined interval with a fixed step. The performance of each combination is computed and used to determine the best pair of parameters. The non-sparse property of the solution leads to a really slow evaluation process. Thus, for the microarray datasets a data reduction [13] can be done in terms of genes or features of the dataset considered. Redundant or highly correlated features can be replaced with a smaller uncorrelated number of features capturing the entire information. This is done by applying a method called Principal Component Analysis (PCA) before using the SVM algorithm. The method is performed by solving an eigenvector problem or by using iterative algorithms and the result is a set of orthogonal vectors called principal components. The mapping of the larger set into the new smaller set is done by projecting the initial instances on the principal components. The first principal component is defined 2

as the direction given by a linear regression fit through the input data. This direction will hold the maximum variance in the input data. The second component is orthogonal on the first vector, uncorrelated and it is defined to maximize the remaining variance. This procedure is repeated until the last vector is obtained. The envisioned research will follow the main steps of knowledge discovery processes: - Gene selection - the irrelevant attributes (genes) are removed and the selected data is represented as a two-dimensional table. - Preprocessing - if the selected table contains missing values or empty cell entries, the table must be preprocessed in order to remove some of the incompleteness. Statistics should be run to obtain more information about the data. - Training and validation sample - the initial table is divided into at least two tables by using a crossvalidation procedure. One will be used in the training step, the other in the validation or testing step. - Interpretation and evaluation - the validation or test data set is then used to test the classificatory performance of the methods in terms of efficiency and accuracy. A time projection for the project is given in the next table. Table 2. Project Time Table Task Name Literature review research about the gene expression data, support vector machine techniques and feature selection algorithms. Developing programs that automatically test machine learning algorithms against for classification and prediction. Full scale integration off the successful algorithms to large gene expression datasets. Dissemination of results through papers and communications at specific conferences. Evaluation of the applicability of the developed algorithms to other datasets 2004 2005 S O N D J F M A M J J A 4. References [1] R. Burbridge, M. Trotter, B. Buxton, and S. Holden, Drug design by machine learning; support vector machines for pharmaceutical data analysis. Computers and Chemistry 26:5-14, 2001. [2] M. Molla, M. Waddell, D. Page, J.and Shavlik, Using Machine Learning to Design and Interpret Gene- Expression Microarrays. AI Magazine 25:23-44, 2004. [3] Z. Wang, Y. Wang, J. Lu, S. Kung, J. Zhang, R. Lee, J. Xuan, at al., Discriminatory Mining of Gene Expression Microarray Data. The Journal of VLSI Signal Processing 35:255-272, 2003. [4] W. Dubitzky, M. Granzow, and D. Berrar, Data Mining and Machine Learning Methods for Microarray Analysis. In: Lin, S.M., Johnson, K.F. (eds.) Methods of Microarray Data Analysis - Papers from CAMDA 2000, Boston. Kluwer, Academic Publishers, 2001. [5] P. S. Bradley and O. L. Mangasarian, Feature Selection via Concave Minimization and Support Vector Machines. In Machine Learning Proceedings of the Fifteenth International Conference(ICML '98), J. Shavlik, editor, Morgan Kaufmann, San Francisco, California, 82-90, 1998. [6] S. Cho, and H. Won, Machine Learning in DNA Microarray Analysis for Cancer Classification. APBC 2003:189-198, 2003. 3

[7] B. Schölkopf and A. Smola, Learning with Kernels. MIT Press, Cambridge Massachusetts, 2002. [8] V. N. Vapnik, The Nature of Statistical Learning Theory, 2 nd edition, Springer-Verlag, New York, NY, 1999. [9] J.B. Tobler, M.N. Molla, E.F. Nuwaysir, R.D. Green, and J.W. Shavlik, Evaluating machine learning approaches for aiding probe selection for gene-expression arrays. Bioinformatics 18:164-171, 2002. [10] T. Joachims, Making large-scale SVM learning practical., In B. Scholkopf, C. J. C. Burges and A. j. Smola, editors, Advances in Kernel Methods Support Vector Learning, pp. 169-184, MIT Press,, Cambrige, MA, 1999. [11] C.-C. Chang, and C.-J. Lin, LIBSVM: a library for support vector machines, Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm, 2001. [12] N. Cristianini and J. Shawe -Taylor, An Introduction to Support Vector Machines and Other Kernel-based Learning Methods, Cambridge University Press, Cambridge, England, 2000. [13] Y.-J. Lee and O.L. Mangasarian, RSVM: Reduced Support Vector Machines, Proc. Of the First SIAM International Conference on Data Mining, Chicago, April 5-7, 2001. 5. Impact on the Goal of CREU The foremost goal of the CREU project is to encourage females and minorities to pursue graduate work and study in the field of computer science. This project will provide a realistic research experience for the two female undergraduates, by active involvement in the planning, execution and interpretation of scientific research. Welldeveloped research projects can significantly enrich the educational experience for undergraduate students. Working on this research project, students will be able to enhance their computer and programming skills, apply those skills to investigate scientific problems, learn how to formulate questions and problems and to participate in the discovery of new knowledge. A good research experience can foster an enthusiasm for lifelong learning and a desire to continue education beyond the baccalaureate. Successful scientific instruction should develop in student a sense of wonder and curiosity about the world. The students will be exposed to both sides of the scientific investigation: hypothesis testing and development of theoretical explanations of observations. No science education is complete without research related activities, technical writing and oral presentations. Darcy Davis feels that this project will certainly support the goals of CREU. As an undergraduate female with intentions of pursuing graduate work in computer science, this project will give her, a useful introduction to the practical applications of her studies for research, focusing on artificial intelligence. This is a project that can potentially be a foundation for her senior thesis, and the mathematical concepts will be a wonderful basis for the presentations she intends to make at this year's mathematics conferences. Allison Hanuschak believes that this research project will introduce the world of graduate research to her and will be an exceptional opportunity to gain valuable research experience. In addition, by completing the CREU project, she thinks that she will have a distinct advantage for admissions to the graduate school of her choice. Also, she hopes to to encourage fellow female students by setting an example for them and being a positive role model for continuing study in computer science. All in all, this experience will be beneficial for her and will also aid her in pursuing graduate study. Both students intend to present this project at the 2005 YSU QUEST conference. 6. Student Activity and Responsibilities Specific tasks for the two participant students will include: literature search and review, reading and discussing research articles, designing and implementing data mining and machine learning algorithms, data processing, data analysis and interpretation, summarizing and preparing results for presentations and publications, participation at the YSU QUEST 2005 and writing the final report. The primary responsibility of the two students is to participate in all phases of the project: proposal, development, experiments, and dissemination. The students will be required to do weekly independent work and to 4

schedule team meetings. It is important that they work together as a team. The faculty advisor will meet with the students every other week. Email will be used for questions, announcements and documents interchange. 7. Faculty Activity and Responsibilities As faculty advisor for the proposed project, Dr. Alina Lazar will work to actively mentor the two students and continuously supervise their progress during the one year period. She will meet with the students on regular basis to guide their activities and answer their questions related to the project. Dr. Lazar has extensive experience in data mining, machine learning and artificial intelligence and she has written several papers related to the subject of this proposal. Her knowledge will make this project an enjoyable research experience for the undergraduate students. The department will supply a small computer lab and Dr. Lazar will provide the necessary software from funds previously obtained through the university. She guided the students on how to develop and write the present proposal and she will help them with the final report and also with the preparation of a conference paper. The overall guidance and mentoring will not refer only to this project but it will provided insights about how to apply and how to succeed in graduate school, about being a female computer scientist and what the options are after graduate school. 8. Budget For the proposed project we are requesting $2000 for the two participant female students. An additional $500 will be used to buy computer media, books and other materials necessary for the project. While working on the project the students will be encouraged to apply for the Undergraduate Research Grant Award sponsored by the Youngstown State University and other scholarships. 5