PREDICTING PERFORMANCE OF CLASSIFICATION ALGORITHMS

Similar documents
Learning From the Past with Experiment Databases

Rule Learning With Negation: Issues Regarding Effectiveness

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

CS Machine Learning

Rule Learning with Negation: Issues Regarding Effectiveness

Python Machine Learning

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Lecture 1: Machine Learning Basics

A Case Study: News Classification Based on Term Frequency

Impact of Cluster Validity Measures on Performance of Hybrid Models Based on K-means and Decision Trees

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Applications of data mining algorithms to analysis of medical data

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

Learning Methods in Multilingual Speech Recognition

Australian Journal of Basic and Applied Sciences

Human Emotion Recognition From Speech

Universidade do Minho Escola de Engenharia

CSL465/603 - Machine Learning

Switchboard Language Model Improvement with Conversational Data from Gigaword

Softprop: Softmax Neural Network Backpropagation Learning

Mining Association Rules in Student s Assessment Data

Indian Institute of Technology, Kanpur

Experiment Databases: Towards an Improved Experimental Methodology in Machine Learning

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Assignment 1: Predicting Amazon Review Ratings

Comparison of EM and Two-Step Cluster Method for Mixed Data: An Application

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

COMPUTER-ASSISTED INDEPENDENT STUDY IN MULTIVARIATE CALCULUS

Reducing Features to Improve Bug Prediction

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Speech Emotion Recognition Using Support Vector Machine

(Sub)Gradient Descent

Linking Task: Identifying authors and book titles in verbose queries

Axiom 2013 Team Description Paper

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Word Segmentation of Off-line Handwritten Documents

Defragmenting Textual Data by Leveraging the Syntactic Structure of the English Language

CS4491/CS 7265 BIG DATA ANALYTICS INTRODUCTION TO THE COURSE. Mingon Kang, PhD Computer Science, Kennesaw State University

ScienceDirect. A Framework for Clustering Cardiac Patient s Records Using Unsupervised Learning Techniques

Calibration of Confidence Measures in Speech Recognition

Generative models and adversarial training

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Issues in the Mining of Heart Failure Datasets

Learning Methods for Fuzzy Systems

On-Line Data Analytics

Reinforcement Learning by Comparing Immediate Reward

Disambiguation of Thai Personal Name from Online News Articles

Customized Question Handling in Data Removal Using CPHC

Data Fusion Through Statistical Matching

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Detecting English-French Cognates Using Orthographic Edit Distance

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

A Decision Tree Analysis of the Transfer Student Emma Gunu, MS Research Analyst Robert M Roe, PhD Executive Director of Institutional Research and

Handling Concept Drifts Using Dynamic Selection of Classifiers

Lecture 1: Basic Concepts of Machine Learning

Chamilo 2.0: A Second Generation Open Source E-learning and Collaboration Platform

A NEW ALGORITHM FOR GENERATION OF DECISION TREES

Beyond the Pipeline: Discrete Optimization in NLP

Artificial Neural Networks written examination

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Driving Author Engagement through IEEE Collabratec

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

The Boosting Approach to Machine Learning An Overview

Improving Simple Bayes. Abstract. The simple Bayesian classier (SBC), sometimes called

Team Formation for Generalized Tasks in Expertise Social Networks

Welcome to. ECML/PKDD 2004 Community meeting

12- A whirlwind tour of statistics

IT Students Workshop within Strategic Partnership of Leibniz University and Peter the Great St. Petersburg Polytechnic University

Computerized Adaptive Psychological Testing A Personalisation Perspective

A GENERIC SPLIT PROCESS MODEL FOR ASSET MANAGEMENT DECISION-MAKING

16.1 Lesson: Putting it into practice - isikhnas

INPE São José dos Campos

Large-Scale Web Page Classification. Sathi T Marath. Submitted in partial fulfilment of the requirements. for the degree of Doctor of Philosophy

Activity Recognition from Accelerometer Data

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

Learning Distributed Linguistic Classes

Schoology Getting Started Guide for Teachers

GACE Computer Science Assessment Test at a Glance

Mining Student Evolution Using Associative Classification and Clustering

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Using focal point learning to improve human machine tacit coordination

Content-based Image Retrieval Using Image Regions as Query Examples

TD(λ) and Q-Learning Based Ludo Players

Introduction to Causal Inference. Problem Set 1. Required Problems

Time series prediction

Modeling function word errors in DNN-HMM based LVCSR systems

Model Ensemble for Click Prediction in Bing Search Ads

Problems of the Arabic OCR: New Attitudes

TeacherPlus Gradebook HTML5 Guide LEARN OUR SOFTWARE STEP BY STEP

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Generation of Attribute Value Taxonomies from Data for Data-Driven Construction of Accurate and Compact Classifiers

Transcription:

INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET) ISSN 0976 6367(Print) ISSN 0976 6375(Online) Volume 6, Issue 2, February (2015), pp. 19-28 IAEME: www.iaeme.com/ijcet.asp Journal Impact Factor (2015): 8.9958 (Calculated by GISI) www.jifactor.com IJCET I A E M E PREDICTING PERFORMANCE OF CLASSIFICATION ALGORITHMS Firas Mohammed Ali 1, Dr. Prof. El-Bahlul Emhemed Fgee 2, Dr.Prof.Zakaria Suliman Zubi 3 1 B.Sc IT Student, IT Department, Libyan Academy, Tripoli, Libya, 2 Supervisor, Computer Department, Libyan Academy, Tripoli, Libya, 3 External Guide, Sirt University, Sirt, Libya, ABSTRACT Classification is the most commonly applied data mining method, and is used to develop models that can classify large amounts of data to predict the best performance. Identifying the best classification algorithm among all available is a challenging task. This paper presents a performance comparative study of the most widely used classification algorithms. Moreover, the performances of these algorithms have been analyzed by using different data sets. Three different datasets from University of California, Irvine (UCI) are compared with different classification techniques. Each technique has been evaluated with respect to accuracy and execution time and performance evaluation has been carried out with selected classification algorithms. The WEKA machine learning tool is used to analysis of these three different data sets based on applying these classification methods to selected datasets and predicting the best performance results. Keywords: Classification Algorithms, Weka, LMT, Random Tree, Neive Base I. INTRODUCTION Nowadays there is huge amount of data being collected and stored in databases everywhere across the globe. The tendency is to keep increasing year after year. It is not hard to find databases with Terabytes of data in enterprises and research facilities. That is over 1012 bytes of data. There is invaluable information and knowledge hidden in such databases; and without automatic methods for extracting this information it is practically impossible to mine for them [1]. Throughout the years many algorithms were created to extract what is called nuggets of knowledge from large sets of data. There are several different methodologies to approach this problem: classification. Classification is a data mining (machine learning) technique used to predict group membership for data instances. For example, you may wish to use classification to predict whether the weather on a particular day will 19

be sunny, rainy or cloudy. Popular classification techniques include decision trees and neural networks. It involves using a training set of data that contains observations to identify which categories each observation should be placed in. Individual observations are analyzed and grouped in explanatory variables, which may have categorical, ordinal, integer-valued, or real-valued properties. Figure.1 shows the classification process. II. PROBLEM DESCRIPTION Classification consists of predicting a certain outcome based on a given input. In order to predict the outcome, the algorithm processes a training set containing a set of attributes and the respective outcome, usually called goal or prediction attribute. The algorithm tries to discover relationships between the attributes that would make it possible to predict the outcome. Next the algorithm is given a data set not seen before, called prediction set, which contains the same set of attributes, except for the prediction attribute not yet known. The algorithm analyses the input and produces a prediction. The prediction accuracy defines how good the algorithm is. For example, in a medical database the training set would have relevant patient information recorded previously, where the prediction attribute is whether or not the patient had a heart problem [2]. III. THE SELECTED CLASSIFICATION ALGORITHMS USED IN WEKA These are the selected WEKA algorithms I chose to analyze since whey where implemented in the WEKA suite and ready to use directly. The decision to use the following algorithms was based on the efficiencies seen in the reports I read about data classification. I tried to pick at least one type of classifier from each of the major classifier groups and ended up with the following below is a small description of as follows: a) Naive Bayes This was a very simple classifier that performed decent and should be easy to implement regardless of language used. The drawback was that it wasn't in the top when it came to classifying instances correctly. This was however not a big drawback since it was quick at constructing a classification model, as well as classifying data [3]. b) SMO Sequential minimal optimization algorithm that uses support vectors. Has built in support to handle multiple classes using pairwise classification. Note that this algorithm is of the lazy type which does all the calculations [3]. c) KStar (K*) Aha, Kibler & Albert describe three instance-based learners of increasing sophistication. IB1 is an implementation of a nearest neighbor algorithm with a specific distance function. IB3 is a further extension to improve tolerance to noisy data. Instances that have a sufficiently bad classification history are forgotten and only instances that have a good classification history are used for classification [4]. d) AdaBoostM1 Class for boosting a nominal class classifier using the Adaboost M1 method. Only nominal class problems can be tackled. Often dramatically improves performance, but sometimes over fits [4]. 20

e) JRip A decent classifier that performed okay even though I had higher expectations of this rule learner due to the reports I saw where it had been used. The drawback of this classifier was that it requires an extremely long time to construct a classification model for big data sets when using a high WTK value to a point where it becomes useless. For example, it required almost 54 hours to construct a classification model for data set B in chapter 4.2 meanwhile the Naive Bayes classifier managed to do the same in under 3 minutes [3].. f) OneR Class for building and using a 1R classifier; in other words, uses the minimum-error attribute for prediction, discretizing numeric attributes [4]. g) PART Class for generating a PART decision list. PART uses the separate-and-conquer strategy, where it builds a rule in that manner and removes the instances it covers, and continues creating rules recursively for the remaining instances. Where C4.5 and RIPPER does global optimization to produce accurate rule sets, this added simplicity is the main advantage of PART [4]. h) J48 An open source implementation of the C4.5 algorithm that builds a decision tree using information entropy. That means that when building the decision tree, C4.5 will at each node select the attribute that most successfully splits its set of samples seen to the difference in entropy that the selected subtree generates [3]. i) LMT Classifier for building 'logistic model trees', which are classification trees with logistic regression functions at the leaves. The algorithm can deal with binary and multi-class target variables, numeric and nominal attributes and missing values [4]. j) Random tree Class for constructing a tree that considers K randomly chosen attributes at each node. Performs no pruning [4] IV. DEVELOPMENT How a method to analyze data can be constructed are discussed and implemented. And also discuss how different algorithms perform when classifying data. In theory, using a big data set to construct the classifier model will increase the performance when classifying new data since it would be easier to construct a more general model and hence finding a suitable match for our dataset. The optimal size of the data set used to construct the classifier model is dependent on a number of things such as the size of the classification problem, the classifier algorithm used and the quality of the data set. The goal was to see how well the different algorithms performed, not just by comparing the number of correct classifications, but also by looking into the time required to construct the classification model depending on the size of the input data and number features used of as well as the time required to classify a data set using the generated classification model. It was entirely possible to implement these algorithms into classifiers from scratch since there were a lot of documentations describing them. Mainly three data sets used in this thesis are again taken from the UCI data sets [5, 6]. 21

V. CLASSIFICATION USING WEKA- IMPLEMENTATION STEPS Step 1. Open WEKA Application Start > All Programs > WEKA 3.7.11 > WEKA 3.7 Step 2. Load a Dataset file Explorer > Open file > Local Disk (C :) > Program Files > Weka-3-7 > data > select dataset file Step 3. Building Classifiers Classify > Choose > select the classifier name Figure 1: Load a Dataset file Figure 2: Building Classifiers 22

Step 4.Load the Test Option Click on Choose button in the Classifier box just below the tabs and select C4.5 classifier WEKA -> Classifiers -> Trees ->J48. VI. DATA SET INFORMATION Figure 3: Load the Test Option Three data sets used in this for predicting performance with selected classification algorithms. VII. RESULTS AND DISCUSSIONS Table 1: Credit German dataset information Dataset Instances Attributes Data Type Credit-g 1000 21 String Table 2: Ionosphere dataset information Dataset Instances Attributes Data Type Ionosphere 351 35 Numeric Table 3: Vote dataset information Dataset Instances Attributes Data Type Vote 435 17 Nominal In this paper to evaluate performance of selected tool using the given datasets, several experiments are conducted. For evaluation purpose, three test modes are used, the training set, the cross-validation mode and percentage split mode. At the end, the recorded measures are averaged. It is common to have 66% of the objects of the original database as a training set and the rest of objects as a test set.there's a few more variables to considered before making the final decision, but from the performance seen in earlier chapters, the proposed solution for how researchers should tackle the problem of classifying structured data in there data sets is to implement a solution. The reason why 23

Random Tree is proposed instead of the other two candidates AdaBoostM1 and LMT that also managed to reach the goal of a positive classification 100% percentage three times, whereas LMT classification percentage perform 75.90 % and 77.06 %.Some predictive performance accuracies given as an example in Table 4,5 and Table 6 shows best accuracy results highlighted in red and blue colors with respect to the percentage split test mode, cross fold and training set on the three selected UCI data sets such as German credit data, ionosphere and vote data sets [7]. Table 4: Comparison of classifiers using German Credit Data set in Percentage split mode 24

Table 5: Comparison of classifiers using ionosphere Data set in Cross-validation mode 25

Table 6: Comparison of classifiers using vote Data set in Training set mode Table 7: Predictive performance of credit.g dataset TestMode High accuracy TrainingSet RandomTree Crossfolds 10 LMT Percentage split LMT Table 8: Predictive performance of ionosphere dataset TestMode High accuracy TrainingSet RandomTree Crossfolds 10 LMT Percentage split AdaBoostM1 26

Table 9: Predictive performance of Vote dataset TestMode High accuracy TrainingSet RandomTree Crossfolds 10 J48 Percentage split AdaBoostM1 VIII. CONCLUSION Figure 4: Tree analysis of Highest Performance Algorithms Classification is one of the data mining tasks that applied in many area especially in medical applications. One reason for using this technique is selecting the appropriate algorithm for each data type.there is no algorithm that is the best for all classification domains. This paper results is a way to select the proper algorithm for a particular domain with respect the test modes. Due to this, in my opinion the RandomTree and LMTare the best predictive performance classifiers that come out in top in this analysis. Future work will focus on the combination of best classification techniques that can be used to improve the performance. IX. ACKNOWLEDGEMENTS I would like to thank my supervisor and external guide to their valuable suggestions and tips to write this paper. REFERENCES 1. http://www.tutorialspoint.com/data mining/dm_classification_prediction.htm 2. Fabricio Voznika, Leoardo Viana, Data Mining Classifications. 3. Lilla Gula, Robin Norberg Information Data Management for the Future of communication, 2013. 4. http://weka.sourceforge.net/ 27

5. Ghazi Johnny, Interactive KDD System for Fast Mining Association Rules. Date of Lecturer/ Staff Developing Center, Acceptance 8/6/2009. 6. Dr.Philip Gordon, Data Mining: Predicting tipping points, 2013. 7. Deepali Kharche, K. Rajeswari, Deepa Abin, SASTRA University, Comparison of different datasets using various classification techniques with WEKA, Vol. 3, Issue. 4, April 2014. 8. Shravan Vishwanathan and Thirunavukkarasu K, Performance Analysis of Learning and Classification Algorithms International journal of Computer Engineering & Technology (IJCET), Volume 5, Issue 4, 2014, pp. 138-149, ISSN Print: 0976 6367, ISSN Online: 0976 6375. 9. Prof. Sindhu P Menon and Dr. Nagaratna P Hegde, Research on Classification Algorithms and Its Impact on web Mining International journal of Computer Engineering & Technology (IJCET), Volume 4, Issue 4, 2013, pp. 495-504, ISSN Print: 0976 6367, ISSN Online: 0976 6375. 10. Nitin Mohan Sharma and Kunwar Pal, Implementation of Decision Tree Algorithm After Clustering Through Weka International journal of Computer Engineering & Technology (IJCET), Volume 4, Issue 1, 2013, pp. 358-363, ISSN Print: 0976 6367, ISSN Online: 0976 6375. AUTHORS DETAILS Firas Mohammed Ali He received his BSc in computer science in 2010 from Sirte University. He currently pursuing Master in Information Technology from The Libyan Academy. His research area is Data Mining and Artificial intelligence. Dr. Prof.El-Bahlul Emhemed Fgee He received his PhD. in Internetworking, Department of Engineering Mathematics and Internetworking in 2006 from Dalhousie University, Halifax NS. Dr.Fgee Supervise students in Network Design and Management.He Worked as the Dean of Gharyan High Institute of Vocational Studies from 2008 to 2012.and published many researches and technical reports in international journals and conference proceedings. Dr. Prof. Zakaria Suliman Zubi He received his Ph.D. in Computer Science in 2002 from Debrecen University in Hungary he is an Associate Professor since 2010. Dr. Zubi, served his university under various administrative positions including the Head of Computer Science Department 2003-2005. He published as authors and a co-author in many researches and technical reports in local and international journals and conference proceedings. 28