Childhood Obesity epidemic analysis using classification algorithms

Similar documents
Rule Learning With Negation: Issues Regarding Effectiveness

Rule Learning with Negation: Issues Regarding Effectiveness

Comparison of EM and Two-Step Cluster Method for Mixed Data: An Application

Reducing Features to Improve Bug Prediction

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Applications of data mining algorithms to analysis of medical data

CS Machine Learning

ScienceDirect. A Framework for Clustering Cardiac Patient s Records Using Unsupervised Learning Techniques

A Case Study: News Classification Based on Term Frequency

Mining Association Rules in Student s Assessment Data

Australian Journal of Basic and Applied Sciences

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Learning From the Past with Experiment Databases

EDEXCEL FUNCTIONAL SKILLS PILOT TEACHER S NOTES. Maths Level 2. Chapter 4. Working with measures

Python Machine Learning

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Process Evaluations for a Multisite Nutrition Education Program

Lecture 1: Machine Learning Basics

Disambiguation of Thai Personal Name from Online News Articles

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

Lecture 1: Basic Concepts of Machine Learning

Modeling function word errors in DNN-HMM based LVCSR systems

Assignment 1: Predicting Amazon Review Ratings

Issues in the Mining of Heart Failure Datasets

Speech Emotion Recognition Using Support Vector Machine

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Switchboard Language Model Improvement with Conversational Data from Gigaword

Human Emotion Recognition From Speech

Linking Task: Identifying authors and book titles in verbose queries

Calibration of Confidence Measures in Speech Recognition

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Content-based Image Retrieval Using Image Regions as Query Examples

Indian Institute of Technology, Kanpur

Modeling function word errors in DNN-HMM based LVCSR systems

A Decision Tree Analysis of the Transfer Student Emma Gunu, MS Research Analyst Robert M Roe, PhD Executive Director of Institutional Research and

Activity Recognition from Accelerometer Data

Wellness Committee Action Plan. Developed in compliance with the Child Nutrition and Women, Infant and Child (WIC) Reauthorization Act of 2004

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

AUTOMATED TROUBLESHOOTING OF MOBILE NETWORKS USING BAYESIAN NETWORKS

Feature Selection based on Sampling and C4.5 Algorithm to Improve the Quality of Text Classification using Naïve Bayes

Data Fusion Through Statistical Matching

CSL465/603 - Machine Learning

A NEW ALGORITHM FOR GENERATION OF DECISION TREES

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

Cross-lingual Short-Text Document Classification for Facebook Comments

Fuzzy rule-based system applied to risk estimation of cardiovascular patients

Universidade do Minho Escola de Engenharia

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Experiment Databases: Towards an Improved Experimental Methodology in Machine Learning

Word Segmentation of Off-line Handwritten Documents

Impact of Cluster Validity Measures on Performance of Hybrid Models Based on K-means and Decision Trees

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Semi-Supervised Face Detection

Evolutive Neural Net Fuzzy Filtering: Basic Description

A Model to Predict 24-Hour Urinary Creatinine Level Using Repeated Measurements

Data Fusion Models in WSNs: Comparison and Analysis

Truth Inference in Crowdsourcing: Is the Problem Solved?

On-Line Data Analytics

Improving Simple Bayes. Abstract. The simple Bayesian classier (SBC), sometimes called

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Learning Methods in Multilingual Speech Recognition

Multivariate k-nearest Neighbor Regression for Time Series data -

Multi-Lingual Text Leveling

Computerized Adaptive Psychological Testing A Personalisation Perspective

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Analyzing sentiments in tweets for Tesla Model 3 using SAS Enterprise Miner and SAS Sentiment Analysis Studio

SAGES 2017 ANNUAL MEETING SESSION DESIGN FORM - SAMPLE

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

Mining Student Evolution Using Associative Classification and Clustering

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

GRADUATE STUDENT HANDBOOK Master of Science Programs in Biostatistics

MYCIN. The MYCIN Task

Multi-label Classification via Multi-target Regression on Data Streams

INPE São José dos Campos

COMPUTER-ASSISTED INDEPENDENT STUDY IN MULTIVARIATE CALCULUS

A Biological Signal-Based Stress Monitoring Framework for Children Using Wearable Devices

Softprop: Softmax Neural Network Backpropagation Learning

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Fragment Analysis and Test Case Generation using F- Measure for Adaptive Random Testing and Partitioned Block based Adaptive Random Testing

Large-Scale Web Page Classification. Sathi T Marath. Submitted in partial fulfilment of the requirements. for the degree of Doctor of Philosophy

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

POLA: a student modeling framework for Probabilistic On-Line Assessment of problem solving performance

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

CS 446: Machine Learning

CS4491/CS 7265 BIG DATA ANALYTICS INTRODUCTION TO THE COURSE. Mingon Kang, PhD Computer Science, Kennesaw State University

Using dialogue context to improve parsing performance in dialogue systems

Beyond the Pipeline: Discrete Optimization in NLP

Executive Guide to Simulation for Health

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Linking the Ohio State Assessments to NWEA MAP Growth Tests *

Genre classification on German novels

Historical maintenance relevant information roadmap for a self-learning maintenance prediction procedural approach

Ph.D in Advance Machine Learning (computer science) PhD submitted, degree to be awarded on convocation, sept B.Tech in Computer science and

GUIDELINES FOR COMBINED TRAINING IN PEDIATRICS AND MEDICAL GENETICS LEADING TO DUAL CERTIFICATION

Transcription:

Childhood Obesity epidemic analysis using classification algorithms Suguna. M M.Phil. Scholar Trichy, Tamilnadu, India suguna15.9@gmail.com Abstract Obesity is the one of the most serious public health challenges of 21 st century globally there are more than one billion overweight adults and it is increasing three fold. In Today s world people get affected by many diseases where such disease leads to many other diseases. Obesity is such a condition that leads to many diseases. In this paper several data mining classification algorithms are used to detect the obesity and overweight conditions in children and detect the cause further. Keywords: Data mining, classification, childhood obesity, J48, CART, Naive Bayes I. INTRODUCTION Data mining is the process of extracting knowledge from large data. It is also called as knowledge in database. It is performed in iterative sequence. Classification maps data into predefined groups or classes often referred as supervised learning as classes are determined prior to examine. A classification task begins with a data set in which the class assignments are known. Medical diagnosis is a classification process. Fat is one of the main macronutrients. Fat plays an important role in preserving vital organs and it also maintains body temperature and it also store calories for future use. But however when it is stored in excess lead to obesity. Obesity is a serious worldwide health epidemic that affects one in four Americans. This phenomenon is global and about 30 million Indian are found to be obese and found to be double in next five years. Obesity is the condition that arise when body fat has accumulated to the extent that may have a negative effect on health, and it also leads to many other disease such as diabetes, sleep apnea, heart disease, stroke, gout, gallbladder disease, high blood pressure,osteoarthritis and even certain type of cancer. Childhood obesity is one of the most serious public health challenges of the 21st century. Children and adolescents who are obese are likely to be obese as adults and they are likely to have many health related and psychological problems. Childhood obesity is doubled in children and quadrupled in adults in the past years. The paper is organized as follows: section II as Related works, section III as Data set description, section IV as Proposed model, section V as Experimental results and finally section VI as Conclusion II. RELATED WORKS Adnan M.H.M [4] et al in their work provides the review of various data mining methods and its utilization in predictions of obesity and points outs the merits and demerits of particular data mining techniques. Adnan M.H.M [5] et al in their paper provides a novel framework for predictions of childhood obesity predictions using NBtree. Abduh Elbanna[6] et al in gastroenterology and surgery using data mining in obesity related gastrointestinal motility disorder; (IBS) and surgical bariatric approaches. Naive Bayes algorithm is used to obtain result and the rapid miner tool was employed. Alex Aussem [7] et al in their paper provided the framework for the analysis of visceral obesity and its determinants in women using Bayesians Networks. Archana Bhattarai [8] et al in their research work use text classification to predict obesity and its co-morbidities. Various algorithms such as Naive Bayes, Support Vector Machine (SVM), J48 and J48 and extraction were used. Syed Taghi Heydari[10] et al in their work performed a comparative study on detection of obesity using Artificial Neural Network and Logistics Regression. They predicted that Artificial Neural Network provides better result than Logistics Regression however they cited the requirement of a customized classifier for the Neural Networks.Sivaranjani.T [11] provides a comparative study on the prediction of obesity by using KNN and ID3 algorithms.she predicted that the ID3 algorithm works efficiently and provide more accurate result than the KNN algorithm by using Rapid Miner tool.shaoyan Zhang et al[12]in their paper provided a comparison of logistic regression with six data mining techniques such as decision trees (C4.5), association rules, Artificial Neural Networks (ANN), Naive Bayes, Bayesian networks and Support Vector Machines (SVM).They concluded that Support Vector Machine (SVM) provides more accurate result than the rest five data mining techniques. Sunita Soni et al [13] in their work discussed about the Case Based Reasoning (CBR).Three Data Mining techniques such as Nearest Neighborhood, Decision Tree and Bayesian Classification, were applied on distributed case bases for Case retrieval and Case adaptation. their work talks about the use of data mining technology in RES Publication 2012 Page 22 III. DATA SET DESCRIPTION The data are collected from the Child and Adolescent Health Measurement Initiative (CAHMI). DRC Indicator Dataset: 2007 National Survey of Children s Health. Data Resource Center for Child and Adolescent Health, ww.childhealthdata.org.. The objective of the data set is to

analysis the obesity among children whose age is between 10 to 17. The data set consists of 12 attributes that are used to predict the obesity. The detailed description of the data set are given in the below table 3.1 3.1 data set description No Name of the Description attribute 1 CID Children Id 2 Age Age of the children 3 Gender Gender of the children 4 Weight Weight (pounds) 5 Height Height (mts) 6 Sleep hour Sleep time 7 Exercise time Exercise time 8 Time on Time spend in Computer computer 9 Watching TV Time spend on hours watching TV 10 Depression Depression level 11 Health status Status of health 12 Body Mass Index Value 13 CLS class 14 Category Category of the The attributes given here are based on the data type. The data type used here is numerical and nominal, here category and gender takes nominal values and the rest of the data takes the numerical values. IV. PROPOSED MODEL In the proposed method mainly decision tree is used for predicting the obesity from the given data set instances. Here the framework can be given as below, In the proposed model three different types of decision tree algorithms such as Simple Cart, J48 and NB Tree are applied on type obesity dataset in the WEKA tool and the performance is calculated. 4.1 Simple Cart Table 4.1: Proposed Framework INPUT DATA PREPROCESSING COLLECT DB DATA CLEANING DECISION TREE ALGORITHM OUTPUT CART is the Classification And Regression Tree it is greedy algorithm, it is used to build binary decision tree in that it chooses the locally best discriminatory feature at each stage in the process. As with ID3 entropy is used as a measure to choose the best splitting attribute and criterion. Here child is created for each sub category, only two children are created. The splitting is performed around what is determined to be the best split point. At each step an exhaustive search is performed to determine the best split. CART handles the missing data by simply ignoring that record in calculating the goodness of a split on that attribute. The tree stops growing when no split will improve the performance, it also contains a pruning strategy. 4.2 J48 Algorithm J48 builds decision trees from a set of labeled training data using the concept of information entropy. It uses the fact that each attribute of the data can be used to make a decision by splitting the data into smaller subsets. J48 examines the normalized information gain (difference in entropy) that results from choosing an attribute for splitting the data. To make the decision, the attribute with the highest RES Publication 2012 Page 23

normalized information gain is used. Then the algorithm recurs on the smaller subsets. The splitting procedure stops if all instances in a subset belong to the same class. Then a leaf node is created in the decision tree telling to choose that class. But it can also happen that none of the features give any information gain. In this case J48 creates a decision node higher up in the tree using the expected value of the class.j48 can handle both continuous and discrete attributes, training data with missing attribute value sand attributes with differing costs. Further it provides an option for pruning trees after creation. Fig 5.1 Performance of the Algorithms based on the time taken 4.3 NB Tree Naive bayes classification is a type of Bayesian classification. It is a simple classification technique. It assumes that the effect of an attribute value on a given class is independent of the values of the other attributes. This assumption is called class conditional independence. It exhibits high accuracy and speed when applied to large database. It is used to compare it results with decision tree networks and neural networks. V. EXPERIMENTAL RESULTS The given three types of decision tree algorithms like Simple Cart, J48 and NB Tree are applied on the obesity data set in WEKA and the performance of the algorithm are given based various factors. The performance can be obtained based on the time taken to build the tree and correctly classified instances Fig 5.2 Performance of the Algorithms based on the accuracy Table 5.1 Time taken by the algorithms to build the decision tree Name of the Algorithm Simple Cart Time Taken to build the decision tree 0.03 seconds J48 0.01seconds NB Tree 0.45seconds RES Publication 2012 Page 24

The decision tree model for the obesity is given by the tree structure. Fig 5.3 Decision tree for obesity dataset Body Mass Index Table 5.3 confusion matrix for Simple CART,NB Tree and J48 191 0 0 0 0 34 0 0 0 0 33 0 0 0 0 18 <= Under weight The above decision tree algorithm predicts the class label. The final output will be patterns which are used to find the children fall in which Class based on it they are classified as Underweight, Normal, Overweight and obese or not. A Confusion Matrix is a useful visualization tool for analyzing the classifier accuracy. Structure of the confusion matrix can be given as below Where, Normal CLASS Table 5.2: Structure of the Confusion Matrix TP FP >1 <= >3 Over weight TN TP is True Positive, obese children correctly diagnosed. FP is False Positive, Normal people incorrectly identified as obese. TN is True Negative, Normal people correctly identified as healthy. FN is False Negative, obesity people incorrectly identified as healthy. The confusion matrix based on the execution of decision tree classification algorithm such as Simple Cart, NB tree and J48 FN Obese From the various factors such as time taken to build the decision tree, accuracy and confusion matrix we arrive at a conclusion that J48 algorithm provide better result. using WEKA tool are given below, RES Publication 2012 Page 25 VI. CONCLUSION Data mining plays a vital role in mining large database and its usage in determining hidden information is very useful in medical mining. In medical field classification techniques have high utility. This experimental model is built based as a test case on the training dataset. This experiment is successfully performed in children health training dataset with several data mining classification algorithms and found that J48 algorithm provides better performance with 100% of accuracy and minimum time taken. It is believed that the data mining can help in the obesity research and improve the health care of people with overweight and obesity. It can also be implemented using several classification techniques. REFERENCE [1] Han J. Kamber. M, Data Mining; Concepts and Techniques, Morgan Kaufmann Publishers. [2] Margaret H. Dunham, Data Mining Techniques and Algorithms, Prentice Hall Publishers. [3] Arun.K.Pujari, Data mining Techniques, University Press( India) Private Limited,2001 [4] Adnan, M.H.M., Husain.W, Rashid, N.A.A.Rashid, A survey on utilization of data mining for childhood obesity prediction, Information and Telecommunication Technologies(APSITT), 2010 8th Asia -pacific symposium on, vol., no., pp.1,6, 15-18 July 2010. [5] Adnan, M.H.M,Husain. W., Rashid, N.A.A.Rashid, A framework for childhood obesity classifications and predictions using NBtree, Information Technology in Asia (CITA 11), 2011 7th International Conference, vol., no., pp.1,6, 12-13 July 2011. [6] AbduhElbanna, AbdElrazek M., AlyAbdElrazek Is Advanced Statistical Computing Technology a Clue in Applied Medicine? A Study using Data Mining as a Predictor Technology in Gastroenterology & Bariatric Surgery; Novel Elbanna Operations, Global Journal of Computer Science

and Technology Software & Data EngineeringVolume 13 Issue 12 Version 1.0 Year 2013. [7] Aussem et al., Analysis of lifestyle and metabolic predictors of visceral obesity with Bayesian Networks. BMC Bioinformatics 2010 11:487. [8] ArchanaBhattarai, VasileRus, DipankarDasgupta., Classification of Clinical Conditions: A Case Study on Prediction of obesity and its Co-morbidities, Department of Computer Science,The university of Memphis,209 Dunn Hall. [9] LalithaSarojaThota et al., A Review on Information Technology in Obesity Epidemic: Prediction and Prevention, International Journal of Advanced Research in Computer Science and Software Engineering 3(9), September - 2013, pp. 775-780. [10] SeyedTaghiHeydari et al., Comparison of Artificial Neural Networks with Logistic Regression for Detection of Obesity Springer April 2011. [11] T.Sivaranjani, Comparative Study on Obesity Based on ID3 and KNN International Journal of Advanced Research in Computer Science And Management Studies pp 389,396vol.2, Issue 9, September 2014. [12] Shaoyan Zhang, C.T., XiaojunZeng, Hong Qiao, Iain Buchan, John Keane, Comparing data mining methods withlogistic regression in childhood obesity prediction. Information Systems Frontiers, 2009. 11(4): p. 51. [13] Soni, S.; Pillai, J., "Usage of Nearest Neighborhood, Decision Tree and Bayesian Classification Techniques in Development of Weight Management Counseling System," Emerging Trends in Engineering and Technology, 2008. ICETET '08. First International Conference, vol., no., pp.691, 694, 16-18 July 2008. BIOGRAPHY M.Suguna, M.Sc, B.Ed, M.Phil, completed M.phil in Bishop Heber College Trichy (2014-2015). Paper Published: A Technical review on Obesity Analysis using classification algorithms, International Journal of Applied Engineering Research, ISSN 0973-4562 Vol. 10 No.55 (2015). RES Publication 2012 Page 26