ScienceDirect. A Framework for Clustering Cardiac Patient s Records Using Unsupervised Learning Techniques

Similar documents
Mining Association Rules in Student s Assessment Data

Rule Learning With Negation: Issues Regarding Effectiveness

Python Machine Learning

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Procedia - Social and Behavioral Sciences 226 ( 2016 ) 27 34

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Comparison of EM and Two-Step Cluster Method for Mixed Data: An Application

Rule Learning with Negation: Issues Regarding Effectiveness

CS Machine Learning

Learning From the Past with Experiment Databases

Word Segmentation of Off-line Handwritten Documents

Assignment 1: Predicting Amazon Review Ratings

ScienceDirect. Noorminshah A Iahad a *, Marva Mirabolghasemi a, Noorfa Haszlinna Mustaffa a, Muhammad Shafie Abd. Latif a, Yahya Buntat b

Australian Journal of Basic and Applied Sciences

EDEXCEL FUNCTIONAL SKILLS PILOT TEACHER S NOTES. Maths Level 2. Chapter 4. Working with measures

Procedia - Social and Behavioral Sciences 141 ( 2014 ) WCLTA Using Corpus Linguistics in the Development of Writing

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Issues in the Mining of Heart Failure Datasets

A study of speaker adaptation for DNN-based speech synthesis

Linking Task: Identifying authors and book titles in verbose queries

A Case Study: News Classification Based on Term Frequency

Lecture 1: Machine Learning Basics

Lecture 1: Basic Concepts of Machine Learning

Impact of Cluster Validity Measures on Performance of Hybrid Models Based on K-means and Decision Trees

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Procedia - Social and Behavioral Sciences 237 ( 2017 )

Probabilistic Latent Semantic Analysis

ScienceDirect. Malayalam question answering system

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

CS 446: Machine Learning

Applications of data mining algorithms to analysis of medical data

Switchboard Language Model Improvement with Conversational Data from Gigaword

Text-mining the Estonian National Electronic Health Record

Procedia - Social and Behavioral Sciences 146 ( 2014 )

On-Line Data Analytics

AUTOMATED TROUBLESHOOTING OF MOBILE NETWORKS USING BAYESIAN NETWORKS

Universidade do Minho Escola de Engenharia

Calibration of Confidence Measures in Speech Recognition

Mining Student Evolution Using Associative Classification and Clustering

TIMSS ADVANCED 2015 USER GUIDE FOR THE INTERNATIONAL DATABASE. Pierre Foy

Analyzing sentiments in tweets for Tesla Model 3 using SAS Enterprise Miner and SAS Sentiment Analysis Studio

Taxonomy of the cognitive domain: An example of architectural education program

CSL465/603 - Machine Learning

A Coding System for Dynamic Topic Analysis: A Computer-Mediated Discourse Analysis Technique

A NEW ALGORITHM FOR GENERATION OF DECISION TREES

Process Evaluations for a Multisite Nutrition Education Program

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

Speech Emotion Recognition Using Support Vector Machine

Learning Methods for Fuzzy Systems

Risk factors in an ageing population: Evidence from SAGE

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

The taming of the data:

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Procedia - Social and Behavioral Sciences 143 ( 2014 ) CY-ICER Teacher intervention in the process of L2 writing acquisition

Team Formation for Generalized Tasks in Expertise Social Networks

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

AUTOMATED FABRIC DEFECT INSPECTION: A SURVEY OF CLASSIFIERS

Guidelines in context

Reducing Features to Improve Bug Prediction

PSIWORLD Keywords: self-directed learning; personality traits; academic achievement; learning strategies; learning activties.

Different Requirements Gathering Techniques and Issues. Javaria Mushtaq

value equivalent 6. Attendance Full-time Part-time Distance learning Mode of attendance 5 days pw n/a n/a

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Clinical Quality in EMS. Noah J. Reiter, MPA, EMT-P EMS Director Lenox Hill Hospital (Rice University 00)

Executive Guide to Simulation for Health

MYCIN. The MYCIN Task

Modeling function word errors in DNN-HMM based LVCSR systems

OVERVIEW OF CURRICULUM-BASED MEASUREMENT AS A GENERAL OUTCOME MEASURE

Procedia - Social and Behavioral Sciences 98 ( 2014 ) International Conference on Current Trends in ELT

Physics 270: Experimental Physics

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

COMPUTER-ASSISTED INDEPENDENT STUDY IN MULTIVARIATE CALCULUS

Using Web Searches on Important Words to Create Background Sets for LSI Classification

VOL. 3, NO. 5, May 2012 ISSN Journal of Emerging Trends in Computing and Information Sciences CIS Journal. All rights reserved.

Innovative Methods for Teaching Engineering Courses

PR:EPARe: a game-based approach to relationship guidance for adolescents.

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Exploration. CS : Deep Reinforcement Learning Sergey Levine

DOES OUR EDUCATIONAL SYSTEM ENHANCE CREATIVITY AND INNOVATION AMONG GIFTED STUDENTS?

Automating the E-learning Personalization

Fuzzy rule-based system applied to risk estimation of cardiovascular patients

Longest Common Subsequence: A Method for Automatic Evaluation of Handwritten Essays

Procedia - Social and Behavioral Sciences 180 ( 2015 )

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

AP Statistics Summer Assignment 17-18

Developing True/False Test Sheet Generating System with Diagnosing Basic Cognitive Ability

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

Procedia - Social and Behavioral Sciences 141 ( 2014 ) WCLTA 2013

HOSA 106 HOSA STRATEGIES FOR EMERGENCY PREPAREDNESS: COMPETITIVE EVENTS

What motivates mathematics teachers?

Instructor: Mario D. Garrett, Ph.D. Phone: Office: Hepner Hall (HH) 100

Matching Similarity for Keyword-Based Clustering

Learning Methods in Multilingual Speech Recognition

Data Fusion Models in WSNs: Comparison and Analysis

Transcription:

Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 98 (2016 ) 368 373 The 6th International Conference on Current and Future Trends of Information and Communication Technologies in Healthcare (ICTH 2016) A Framework for Clustering Cardiac Patient s Records Using Unsupervised Learning Techniques Rao Muzamal Liaqat *, Bilal Mehboob b, Nazar Abbas Saqib c, Muazzam A Khan d {muzamal.liaqat14 *, bilal.mehboob14 b, nazar.abbas c, muazzamak d }@ce.ceme.edu.pk National University of Sciences and Technology (NUST), H-12, Islamabad, Pakistan Abstract Today we are surrounded with large data related to health reports of patients. In this paper we will introduce a methodology to extract the useful information (pattern) from raw data by using different unsupervised learning techniques. These hidden patterns will help the practitioner to understand the hidden relation (dependency) among the data. With the help of useful clustering we can predict the hidden trends in patients. We will use the correlation matrix followed by K-mean (fast) to extract the interesting pattern as well as patient state that will help the practitioner to treat the patient wisely. According to the nature of data we can categorize the heart patient into normal, moderate, risk and critical patients. We use the different clustering algorithm and analyze the performance of each algorithm in cardiac dataset. For this research we have used the real dataset provided by AFIC (Armed force institute of cardiology).data set consist of 1500 records along with 36 attributes. 2016 Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license 2016 The Authors. Published by Elsevier B.V. (http://creativecommons.org/licenses/by-nc-nd/4.0/). Peer-review Peer-review under under responsibility responsibility of the of Program the Conference Chairs Program Chairs. Keywords: Clustering; data mining; Unsupervised Learning; K-Mean (fast) 1. Introduction It is the common practice patient comes to the doctor, after routine procedure and tests, doctor checkup the subject and diagnosis, that s why a large of data remain unexplored in hospital which raises a significant problem in healthcare domain. Then certain question arises e.g. How we can get the useful information from the data, is there any hidden relation between the data that reveals some specific pattern to practitioner so that they can take some wise decision. All these can be answered by using data mining and machine learning algorithms to indicate the * Corresponding author. Tel: +92-51-222-9561; fax: +92-51-927-8257 E-mail address: muzamal.liaqat14@ce.ceme.edu.pk 1877-0509 2016 Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). Peer-review under responsibility of the Program Chairs doi:10.1016/j.procs.2016.09.056

Rao Muzamal Liaqat et al. / Procedia Computer Science 98 ( 2016 ) 368 373 369 unseen or hidden pattern 1. Nowadays we are surrounding with a large dataset related to patient history 2. However the current database of patients is not so informative to extract any useful information or to track the patient disease 3. It is believed by using data mining techniques a lot of hidden information can be extracted by discovering the hidden pattern and correlation among attributes. Nowadays statistics is very popular and commonly used technique to analyze the medical data. Researchers are using the different statistical tools, software to analyze the data and extract the useful information 4. In our work we will use the data mining algorithms which are more reliable as compared to statistical model; we will also compute the performance of different algorithms. Basically there are two types of algorithms that are used in data mining. One is known as supervised learning algorithms (in supervised learning we have trainee dataset e.g. SVM, Naïve Bayes). Second is known as unsupervised learning (in which we have no trainee dataset or label attribute e.g. K-Mean, DBSCAN). The main focus of this paper is to extract hidden pattern and correlation among different attributes that will assist the practitioner to write a wise and better prescription for heart patient. In this paper we use the unsupervised techniques such as K-means, K-means (fast), DBSCAN and K- medoids to find out the hidden cluster and pattern for heart patient. The remaining paper is divided into 5 sections. Section 2 describes the literature review. Section 3 describes the methodology and detailed analysis of cluster, performance of results is carried out in section 4. Conclusion and future work is detailed in section 5. 2. Literature Review In literature a lot of wo rk has been carried out for medical data analysis to discover the hidden pattern and extract useful information from large data by applying data mining techniques 5. In conventional methods for information extraction from data Professional s manual method was used, which has no worth when dataset increases in volume as well as in dimension. To deal such data we need some computing technologies 6.In medical domain most of the work is carried out on cardiac image segmentation, feature extraction, pattern recognition as well as correlation 7, 8. Decision tree is a widely used algorithm that is used to mine the hidden information and back track the root cause in medical data. In decision tree we have root node and leaf nodes, leaf nodes represent concrete knowledge according to label attribute. Commonly used decision tree algorithms are ID3, CHAID, Random Forest and Decision Stump which are mostly used for mining the useful information 9.Many intelligent systems have been developed to assist the practitioner in cardiac disease 10. Researchers have used the Naïve Byes, ANN and decision tree to extract the hidden pattern and correlation among attributes 11. Our main focus is to process the data to get the useful information and explored the hidden pattern. In this paper we use the dataset provided by AFIC (Armed force Institute of Cardiology). Preprocessing steps and performance of different unsupervised learning classifiers are described in methodology section. 3. Proposed Methodolog y Our methodology to extract the hidden pattern and correlation among the attribute in context of cardiac data is shown in Fig 1. Fig 1: Knowledge Discovery Process Model

370 Rao Muzamal Liaqat et al. / Procedia Computer Science 98 ( 2016 ) 368 373 The model is divided into 6 phases; each phase may involve the certain input, output and operations. We will explain each phase in detail. 3.1 Data Acquisition Mostly we have the medical data in the form of medical reports, lab reports and doctor reviews from all kind of data can be categorized as unstructured form of data 12. We get the data in report form from Armed Force Institute of Cardiology (AFIC). Raw data consist of 1500 records with 50 attributes. Then we get the target data from raw data by applying feature selection on the basis of attributes weight and expert opinion. 3.2 Target Data (Attribute Selection) Target data is our interest data which is mined from raw data. We can select the target attribute from raw data by assigning weights to attribute using correlation matrix and the consensus of experts. Correlation operator applied on cardiac patient data is shown in the Fig 2. Fig 2: Correlation Matrix Fig 3: Weight Assigned by Correlation Matrix Now we can see the different values of weights assigned to attribute by using this correlation matrix. Weight against each attribute is shown by Fig 3. By using the weights assigned by correlation matrix and expert opinion we have selected 16 attributes. Now we will extract the hidden pattern among these attributes by using the different data mining algorithms. 3.3 Preprocessed Data In this step we make our data compatible with machine learning algorithms by applying some preprocessing steps. Usually we have missing value in our data to remove these values we apply filtering so that more reliable result can be extracted fro m the data. In this paper our work is related to clustering (k-mean. DBSCA N, k-mean (fast), k-medoids). For this we have to convert the nominal and polynomial data into numeric because k-mean doesn t work on such types of data. In the Report Category we have Normal, Moderate, Risk and Critical labels these labels are replaced by numeric values 0, 1, 2 and 3 respectively. 3.4 Transformed Data Data transformation is carried out by using certain scripts on data, basically data transformation is related to data preprocessing steps such as data cleansing (in which we make the data smooth by applying some filtering to mitigate the abrupt changes in data). Data reduction is also an important step in data transformation which is used to remove or exclude the certain column that has redundant behavior or zero effect on overall results as shown in Fig 4.

Rao Muzamal Liaqat et al. / Procedia Computer Science 98 ( 2016 ) 368 373 371 Fig. 4: Transform Data to Exclude Column 3.5 Patterns/Models This phase describe the hidden pattern extracted from data. We will briefly explain the hidden pattern is result and discussion section before that we have to make some assumptions for better understanding and visualization of results. These assumptions are made according to universal standards and expert recommendations. In our data we have different range of value for BMI column. According to standard we can categorize the BMI in four groups.18 to 24(Normal Weights), 25 to 30(Over Weights), 31 to Onward (Obesity) and <18 is categorized as Underweight. According to expert recommendations we have also divide the LVEF value into four groups for better understanding and visualization. Below 30% Very Crit ical, belo w 40% Critical, below 50% Risky and above 50% is categorized as Normal patients. 4. Result and Discussion To extract the hidden information we apply the K-mean (fast) clustering then we connect it correlation matrix followed by data to similarity module to understand the internal dependency among different attribute as shown in fig 5. Fig 5: K-Mean (Fast) Implementation Fig.6: BMI VS Report Category 4.1 Hidden pattern BMI VS Report Category In this cluster we extract the hidden relation between two important attributes BMI vs. Report category. We have assigned the four label overweight, Normal weight, Underweight and obesity for better understanding and visualization as discussed in pattern/model part. All the person that have the underweight value of BMI is categorize

372 Rao Muzamal Liaqat et al. / Procedia Computer Science 98 ( 2016 ) 368 373 as Normal Patient. It can understand from the graph age is an important factor; all the patients who were above 80 years are at risk as shown in Fig 6. Fig 7: BMI VS Report (Age) Fig 8: Angiography VS Report Fig 9: LV-Myocardium VS Report Fig 10: LVEF VS Report 4.2 Performance Measurement of Different Algorithms Table 1: Comparative Analysis of Algorithms Criteria K-Means K-Means(fast) K-Medoids DBSCAN Cluster Density -6673.259-6673.259-46937.563-91490.939 Cluster Distance -2554.952-2554.652-78871.650 N/A Davies Bouldin -0.968-2554.952-6.810 N/A It can extracted by visualizing the results of different algorithms shown in table 1, we have select the K-Mean (fast) algorithm. Although K-Mean and k-mean (fast) depicts similar behavior on cluster density and distance criteria. DBSCAN perform very poorly in cluster distance and davies Boulden Criteria. However K-Means (fast) gives better result as compared to other three algorithms on the basis of selection criteria.

Rao Muzamal Liaqat et al. / Procedia Computer Science 98 ( 2016 ) 368 373 373 5. Conclusion In this paper we have applied the K-mean (fast) algorithm (value of K is 5 decided with the consultancy of practitioner) along with correlation and similarity of data module to extract the hidden pattern among different attributes. With the help of correlation matrix and expert opinion we decide the four attributes (LVEF, gender, LV_ Myocardium and report category) among the list of attributes. Then we plot the graph to understand the hidden relation of each selected attribute with cardiac patient report category. Fig 6 reveals that patient that are above 80 years regardless their value of BMI are mostly at Risk Level in heart failure. Fig 7, 8 reflects critical situation in cardiac patient is dominant in males as compared to females. Severity chances of moderate and critical cardiac patients in Fig 9 males are more affected as compared to females. LV- Myocardium tells the heart state about ischemic disease (this disease occurs due to inadequate blood supply of an organ in body), when the value of LV- Myocardium is low patient are categorize normal and patient higher value of myocardium indicates the risk and critical behavior of cardiac patients as shown by Fig 9. LVEF in cardiac patient indicates how much blood the left ventricle pumps out with each contraction. If value of LVEF > 50 patient is normal otherwise we categorize as an abnormal or affected patient as shown in Fig 10. Acknowledgement I am grateful to AFIC, Pakistan for providing me dataset for research study. I am thankful to my HOD, Dr Shoab A Khan for helping and guiding me during this work. I am also thankful to Dr Aqib Malik RMO, EME College for assisting me in this research. References 1. K. Aziz, S. Aziz, Evaluation and Comparison of Coronary Heart Disease Risk Factor Profiles of Children in a Country with Developing Economy 2 Abu Khousa, E.; Campbell, P., "Predictive data mining to support clinical decisions: An overview of heart disease prediction systems," Innovations in Information Technology (IIT), 2012 International Conference on, vol., no., pp.267,272, 2012. 3. Rao, R. B., Krishnan, S., &Niculescu, R. S. (2006), Data mining for improved cardiac care. ACM SIGKDD Explorations Newslett er, 8(1), 3-10. 4.Kajabadi, A., Saraee, M. H., &Asgari, S. (2009, October). Data mining cardiovascular risk factors. In Application of Information and Communication Technologies, 2009.AICT 2009. International Conference on (pp. 1-5). IEEE. 5. Giudici, P.: Applied Data Mining: Statistical Methods for Business and Industry, New York: John Wiley, 2003. 6. Wamiq M. Ahmed, (2008) Knowledge representation and data mining for biological imaging, Purdue University Cytometry Laborat ories, Bindley Bioscience Center, 1203 W. State Street, West Lafayette, IN 47907, USA. 7. J.J. Sychra, D.G. Pave1, E. Olea,(1988), Classification Images Of Cardiac Wall Motion Abnormalities 8. R. Bharat Rao, Glenn Fung, BalajiKrishnapuram, (2010), Mining Medical Images 9. J. Han and M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers, USA, 2011.http://docs.rapidi.com/files/rapidminer/RapidMiner_OperatorReference_en.pdf 10. Palaniappan, S. &, Awang, R., Intelligent heart disease predication system using data mining technique.ijcsns International Journal of Computer Science and Network Security.Vol. 8, No. 8,2008. 11. Ms. Ishtake S.H, Prof. Sanap S.A., Intelligent Heart Disease Prediction System Using Data Mining Techniques, International J. of Healthcare & Biomedical Research, Volume: 1, pp. 94-101, 2013. 12. Unstructured Data Mining: The Tools You Need to Dig the Deep Web, Posted February 13, 2013 @ 3:41 pm by Scott Raspa, ht t p://www.ikanow.com/blog/02/13/unstructured-data-mining-digthe-deep-web