ScienceDirect. A Framework for Clustering Cardiac Patient s Records Using Unsupervised Learning Techniques

Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 98 (2016 ) 368 373 The 6th International Conference on Current and Future Trends of Information and Communication Technologies in Healthcare (ICTH 2016) A Framework for Clustering Cardiac Patient s Records Using Unsupervised Learning Techniques Rao Muzamal Liaqat *, Bilal Mehboob b, Nazar Abbas Saqib c, Muazzam A Khan d {muzamal.liaqat14 *, bilal.mehboob14 b, nazar.abbas c, muazzamak d }@ce.ceme.edu.pk National University of Sciences and Technology (NUST), H-12, Islamabad, Pakistan Abstract Today we are surrounded with large data related to health reports of patients. In this paper we will introduce a methodology to extract the useful information (pattern) from raw data by using different unsupervised learning techniques. These hidden patterns will help the practitioner to understand the hidden relation (dependency) among the data. With the help of useful clustering we can predict the hidden trends in patients. We will use the correlation matrix followed by K-mean (fast) to extract the interesting pattern as well as patient state that will help the practitioner to treat the patient wisely. According to the nature of data we can categorize the heart patient into normal, moderate, risk and critical patients. We use the different clustering algorithm and analyze the performance of each algorithm in cardiac dataset. For this research we have used the real dataset provided by AFIC (Armed force institute of cardiology).data set consist of 1500 records along with 36 attributes. 2016 Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license 2016 The Authors. Published by Elsevier B.V. (http://creativecommons.org/licenses/by-nc-nd/4.0/). Peer-review Peer-review under under responsibility responsibility of the of Program the Conference Chairs Program Chairs. Keywords: Clustering; data mining; Unsupervised Learning; K-Mean (fast) 1. Introduction It is the common practice patient comes to the doctor, after routine procedure and tests, doctor checkup the subject and diagnosis, that s why a large of data remain unexplored in hospital which raises a significant problem in healthcare domain. Then certain question arises e.g. How we can get the useful information from the data, is there any hidden relation between the data that reveals some specific pattern to practitioner so that they can take some wise decision. All these can be answered by using data mining and machine learning algorithms to indicate the * Corresponding author. Tel: +92-51-222-9561; fax: +92-51-927-8257 E-mail address: muzamal.liaqat14@ce.ceme.edu.pk 1877-0509 2016 Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). Peer-review under responsibility of the Program Chairs doi:10.1016/j.procs.2016.09.056

Rao Muzamal Liaqat et al. / Procedia Computer Science 98 ( 2016 ) 368 373 369 unseen or hidden pattern 1. Nowadays we are surrounding with a large dataset related to patient history 2. However the current database of patients is not so informative to extract any useful information or to track the patient disease 3. It is believed by using data mining techniques a lot of hidden information can be extracted by discovering the hidden pattern and correlation among attributes. Nowadays statistics is very popular and commonly used technique to analyze the medical data. Researchers are using the different statistical tools, software to analyze the data and extract the useful information 4. In our work we will use the data mining algorithms which are more reliable as compared to statistical model; we will also compute the performance of different algorithms. Basically there are two types of algorithms that are used in data mining. One is known as supervised learning algorithms (in supervised learning we have trainee dataset e.g. SVM, Naïve Bayes). Second is known as unsupervised learning (in which we have no trainee dataset or label attribute e.g. K-Mean, DBSCAN). The main focus of this paper is to extract hidden pattern and correlation among different attributes that will assist the practitioner to write a wise and better prescription for heart patient. In this paper we use the unsupervised techniques such as K-means, K-means (fast), DBSCAN and K- medoids to find out the hidden cluster and pattern for heart patient. The remaining paper is divided into 5 sections. Section 2 describes the literature review. Section 3 describes the methodology and detailed analysis of cluster, performance of results is carried out in section 4. Conclusion and future work is detailed in section 5. 2. Literature Review In literature a lot of wo rk has been carried out for medical data analysis to discover the hidden pattern and extract useful information from large data by applying data mining techniques 5. In conventional methods for information extraction from data Professional s manual method was used, which has no worth when dataset increases in volume as well as in dimension. To deal such data we need some computing technologies 6.In medical domain most of the work is carried out on cardiac image segmentation, feature extraction, pattern recognition as well as correlation 7, 8. Decision tree is a widely used algorithm that is used to mine the hidden information and back track the root cause in medical data. In decision tree we have root node and leaf nodes, leaf nodes represent concrete knowledge according to label attribute. Commonly used decision tree algorithms are ID3, CHAID, Random Forest and Decision Stump which are mostly used for mining the useful information 9.Many intelligent systems have been developed to assist the practitioner in cardiac disease 10. Researchers have used the Naïve Byes, ANN and decision tree to extract the hidden pattern and correlation among attributes 11. Our main focus is to process the data to get the useful information and explored the hidden pattern. In this paper we use the dataset provided by AFIC (Armed force Institute of Cardiology). Preprocessing steps and performance of different unsupervised learning classifiers are described in methodology section. 3. Proposed Methodolog y Our methodology to extract the hidden pattern and correlation among the attribute in context of cardiac data is shown in Fig 1. Fig 1: Knowledge Discovery Process Model

370 Rao Muzamal Liaqat et al. / Procedia Computer Science 98 ( 2016 ) 368 373 The model is divided into 6 phases; each phase may involve the certain input, output and operations. We will explain each phase in detail. 3.1 Data Acquisition Mostly we have the medical data in the form of medical reports, lab reports and doctor reviews from all kind of data can be categorized as unstructured form of data 12. We get the data in report form from Armed Force Institute of Cardiology (AFIC). Raw data consist of 1500 records with 50 attributes. Then we get the target data from raw data by applying feature selection on the basis of attributes weight and expert opinion. 3.2 Target Data (Attribute Selection) Target data is our interest data which is mined from raw data. We can select the target attribute from raw data by assigning weights to attribute using correlation matrix and the consensus of experts. Correlation operator applied on cardiac patient data is shown in the Fig 2. Fig 2: Correlation Matrix Fig 3: Weight Assigned by Correlation Matrix Now we can see the different values of weights assigned to attribute by using this correlation matrix. Weight against each attribute is shown by Fig 3. By using the weights assigned by correlation matrix and expert opinion we have selected 16 attributes. Now we will extract the hidden pattern among these attributes by using the different data mining algorithms. 3.3 Preprocessed Data In this step we make our data compatible with machine learning algorithms by applying some preprocessing steps. Usually we have missing value in our data to remove these values we apply filtering so that more reliable result can be extracted fro m the data. In this paper our work is related to clustering (k-mean. DBSCA N, k-mean (fast), k-medoids). For this we have to convert the nominal and polynomial data into numeric because k-mean doesn t work on such types of data. In the Report Category we have Normal, Moderate, Risk and Critical labels these labels are replaced by numeric values 0, 1, 2 and 3 respectively. 3.4 Transformed Data Data transformation is carried out by using certain scripts on data, basically data transformation is related to data preprocessing steps such as data cleansing (in which we make the data smooth by applying some filtering to mitigate the abrupt changes in data). Data reduction is also an important step in data transformation which is used to remove or exclude the certain column that has redundant behavior or zero effect on overall results as shown in Fig 4.

Rao Muzamal Liaqat et al. / Procedia Computer Science 98 ( 2016 ) 368 373 371 Fig. 4: Transform Data to Exclude Column 3.5 Patterns/Models This phase describe the hidden pattern extracted from data. We will briefly explain the hidden pattern is result and discussion section before that we have to make some assumptions for better understanding and visualization of results. These assumptions are made according to universal standards and expert recommendations. In our data we have different range of value for BMI column. According to standard we can categorize the BMI in four groups.18 to 24(Normal Weights), 25 to 30(Over Weights), 31 to Onward (Obesity) and <18 is categorized as Underweight. According to expert recommendations we have also divide the LVEF value into four groups for better understanding and visualization. Below 30% Very Crit ical, belo w 40% Critical, below 50% Risky and above 50% is categorized as Normal patients. 4. Result and Discussion To extract the hidden information we apply the K-mean (fast) clustering then we connect it correlation matrix followed by data to similarity module to understand the internal dependency among different attribute as shown in fig 5. Fig 5: K-Mean (Fast) Implementation Fig.6: BMI VS Report Category 4.1 Hidden pattern BMI VS Report Category In this cluster we extract the hidden relation between two important attributes BMI vs. Report category. We have assigned the four label overweight, Normal weight, Underweight and obesity for better understanding and visualization as discussed in pattern/model part. All the person that have the underweight value of BMI is categorize

372 Rao Muzamal Liaqat et al. / Procedia Computer Science 98 ( 2016 ) 368 373 as Normal Patient. It can understand from the graph age is an important factor; all the patients who were above 80 years are at risk as shown in Fig 6. Fig 7: BMI VS Report (Age) Fig 8: Angiography VS Report Fig 9: LV-Myocardium VS Report Fig 10: LVEF VS Report 4.2 Performance Measurement of Different Algorithms Table 1: Comparative Analysis of Algorithms Criteria K-Means K-Means(fast) K-Medoids DBSCAN Cluster Density -6673.259-6673.259-46937.563-91490.939 Cluster Distance -2554.952-2554.652-78871.650 N/A Davies Bouldin -0.968-2554.952-6.810 N/A It can extracted by visualizing the results of different algorithms shown in table 1, we have select the K-Mean (fast) algorithm. Although K-Mean and k-mean (fast) depicts similar behavior on cluster density and distance criteria. DBSCAN perform very poorly in cluster distance and davies Boulden Criteria. However K-Means (fast) gives better result as compared to other three algorithms on the basis of selection criteria.

Rao Muzamal Liaqat et al. / Procedia Computer Science 98 ( 2016 ) 368 373 373 5. Conclusion In this paper we have applied the K-mean (fast) algorithm (value of K is 5 decided with the consultancy of practitioner) along with correlation and similarity of data module to extract the hidden pattern among different attributes. With the help of correlation matrix and expert opinion we decide the four attributes (LVEF, gender, LV_ Myocardium and report category) among the list of attributes. Then we plot the graph to understand the hidden relation of each selected attribute with cardiac patient report category. Fig 6 reveals that patient that are above 80 years regardless their value of BMI are mostly at Risk Level in heart failure. Fig 7, 8 reflects critical situation in cardiac patient is dominant in males as compared to females. Severity chances of moderate and critical cardiac patients in Fig 9 males are more affected as compared to females. LV- Myocardium tells the heart state about ischemic disease (this disease occurs due to inadequate blood supply of an organ in body), when the value of LV- Myocardium is low patient are categorize normal and patient higher value of myocardium indicates the risk and critical behavior of cardiac patients as shown by Fig 9. LVEF in cardiac patient indicates how much blood the left ventricle pumps out with each contraction. If value of LVEF > 50 patient is normal otherwise we categorize as an abnormal or affected patient as shown in Fig 10. Acknowledgement I am grateful to AFIC, Pakistan for providing me dataset for research study. I am thankful to my HOD, Dr Shoab A Khan for helping and guiding me during this work. I am also thankful to Dr Aqib Malik RMO, EME College for assisting me in this research. References 1. K. Aziz, S. Aziz, Evaluation and Comparison of Coronary Heart Disease Risk Factor Profiles of Children in a Country with Developing Economy 2 Abu Khousa, E.; Campbell, P., "Predictive data mining to support clinical decisions: An overview of heart disease prediction systems," Innovations in Information Technology (IIT), 2012 International Conference on, vol., no., pp.267,272, 2012. 3. Rao, R. B., Krishnan, S., &Niculescu, R. S. (2006), Data mining for improved cardiac care. ACM SIGKDD Explorations Newslett er, 8(1), 3-10. 4.Kajabadi, A., Saraee, M. H., &Asgari, S. (2009, October). Data mining cardiovascular risk factors. In Application of Information and Communication Technologies, 2009.AICT 2009. International Conference on (pp. 1-5). IEEE. 5. Giudici, P.: Applied Data Mining: Statistical Methods for Business and Industry, New York: John Wiley, 2003. 6. Wamiq M. Ahmed, (2008) Knowledge representation and data mining for biological imaging, Purdue University Cytometry Laborat ories, Bindley Bioscience Center, 1203 W. State Street, West Lafayette, IN 47907, USA. 7. J.J. Sychra, D.G. Pave1, E. Olea,(1988), Classification Images Of Cardiac Wall Motion Abnormalities 8. R. Bharat Rao, Glenn Fung, BalajiKrishnapuram, (2010), Mining Medical Images 9. J. Han and M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers, USA, 2011.http://docs.rapidi.com/files/rapidminer/RapidMiner_OperatorReference_en.pdf 10. Palaniappan, S. &, Awang, R., Intelligent heart disease predication system using data mining technique.ijcsns International Journal of Computer Science and Network Security.Vol. 8, No. 8,2008. 11. Ms. Ishtake S.H, Prof. Sanap S.A., Intelligent Heart Disease Prediction System Using Data Mining Techniques, International J. of Healthcare & Biomedical Research, Volume: 1, pp. 94-101, 2013. 12. Unstructured Data Mining: The Tools You Need to Dig the Deep Web, Posted February 13, 2013 @ 3:41 pm by Scott Raspa, ht t p://www.ikanow.com/blog/02/13/unstructured-data-mining-digthe-deep-web