Advances in Environmental Biology

Similar documents
OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

CS Machine Learning

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Lecture 1: Machine Learning Basics

Rule Learning With Negation: Issues Regarding Effectiveness

Assignment 1: Predicting Amazon Review Ratings

A Case Study: News Classification Based on Term Frequency

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Python Machine Learning

(Sub)Gradient Descent

Word Segmentation of Off-line Handwritten Documents

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Learning Methods for Fuzzy Systems

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

International Series in Operations Research & Management Science

Grade 6: Correlated to AGS Basic Math Skills

Numeracy Medium term plan: Summer Term Level 2C/2B Year 2 Level 2A/3C

IT Students Workshop within Strategic Partnership of Leibniz University and Peter the Great St. Petersburg Polytechnic University

Australian Journal of Basic and Applied Sciences

CSL465/603 - Machine Learning

Rule Learning with Negation: Issues Regarding Effectiveness

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

INPE São José dos Campos

Learning From the Past with Experiment Databases

A Comparison of the Effects of Two Practice Session Distribution Types on Acquisition and Retention of Discrete and Continuous Skills

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Probability and Statistics Curriculum Pacing Guide

Axiom 2013 Team Description Paper

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Switchboard Language Model Improvement with Conversational Data from Gigaword

SARDNET: A Self-Organizing Feature Map for Sequences

On-Line Data Analytics

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

University of Groningen. Systemen, planning, netwerken Bosman, Aart

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

FUZZY EXPERT. Dr. Kasim M. Al-Aubidy. Philadelphia University. Computer Eng. Dept February 2002 University of Damascus-Syria

COMPUTER-ASSISTED INDEPENDENT STUDY IN MULTIVARIATE CALCULUS

Linking Task: Identifying authors and book titles in verbose queries

Calibration of Confidence Measures in Speech Recognition

Speaker Identification by Comparison of Smart Methods. Abstract

Generative models and adversarial training

Using Web Searches on Important Words to Create Background Sets for LSI Classification

ACTL5103 Stochastic Modelling For Actuaries. Course Outline Semester 2, 2014

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

Teachers development in educational systems

Introducing the New Iowa Assessments Mathematics Levels 12 14

Speech Emotion Recognition Using Support Vector Machine

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

TIMSS ADVANCED 2015 USER GUIDE FOR THE INTERNATIONAL DATABASE. Pierre Foy

Human Emotion Recognition From Speech

Reducing Features to Improve Bug Prediction

What s in a Step? Toward General, Abstract Representations of Tutoring System Log Data

A Decision Tree Analysis of the Transfer Student Emma Gunu, MS Research Analyst Robert M Roe, PhD Executive Director of Institutional Research and

Impact of Cluster Validity Measures on Performance of Hybrid Models Based on K-means and Decision Trees

Certified Six Sigma Professionals International Certification Courses in Six Sigma Green Belt

An Effective Framework for Fast Expert Mining in Collaboration Networks: A Group-Oriented and Cost-Based Method

CHAPTER 4: REIMBURSEMENT STRATEGIES 24

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

The Effect of Written Corrective Feedback on the Accuracy of English Article Usage in L2 Writing

Artificial Neural Networks written examination

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Evolutive Neural Net Fuzzy Filtering: Basic Description

Probabilistic Latent Semantic Analysis

Comparison of EM and Two-Step Cluster Method for Mixed Data: An Application

Statewide Framework Document for:

Radius STEM Readiness TM

AQUA: An Ontology-Driven Question Answering System

Lecture 1: Basic Concepts of Machine Learning

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

Ph.D in Advance Machine Learning (computer science) PhD submitted, degree to be awarded on convocation, sept B.Tech in Computer science and

Multivariate k-nearest Neighbor Regression for Time Series data -

Model Ensemble for Click Prediction in Bing Search Ads

Applications of data mining algorithms to analysis of medical data

Answer Key For The California Mathematics Standards Grade 1

ScienceDirect. A Framework for Clustering Cardiac Patient s Records Using Unsupervised Learning Techniques

arxiv: v1 [cs.lg] 15 Jun 2015

Math-U-See Correlation with the Common Core State Standards for Mathematical Content for Third Grade

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

AP Statistics Summer Assignment 17-18

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

Mining Association Rules in Student s Assessment Data

STA 225: Introductory Statistics (CT)

An Introduction to Simio for Beginners

Universidade do Minho Escola de Engenharia

Test Effort Estimation Using Neural Network

A NEW ALGORITHM FOR GENERATION OF DECISION TREES

Computerized Adaptive Psychological Testing A Personalisation Perspective

Using dialogue context to improve parsing performance in dialogue systems

Multi-Lingual Text Leveling

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

Beyond the Pipeline: Discrete Optimization in NLP

Detailed course syllabus

Learning Methods in Multilingual Speech Recognition

Longest Common Subsequence: A Method for Automatic Evaluation of Handwritten Essays

Software Maintenance

Speech Recognition at ICSI: Broadcast News and beyond

Transcription:

AENSI Journals Advances in Environmental Biology ISSN-1995-0756 EISSN-1998-1066 Journal home page: http://www.aensiweb.com/aeb/ Using C4.5 Algorithm for Predicting Efficiency Score of DMUs in DEA Babak Dalvand, Gholamreza Jahanshahloo, Farhad Hosseinzadeh Lotfi, Mohsen Rostami Malkhalife Department of Mathematics, Science and Research Branches, Islamic Azad University, Tehran, Iran A R T I C L E I N F O Article history: Received 26 September 2014 Received in revised form 20 November 2014 Accepted 25 December 2014 Available online 10 January 2015 A B S T R A C T Data envelopment analysis (DEA), a non-parametric productivity analysis, has become an accepted approach for assessing efficiency in a wide range of fields. Despite its extensive applications, some features of DEA have remained unexploited. We aim to show in this paper that by using the C4.5 algorithm we can overcome one of DEAs shortages and predict DMUs efficiency score. Keywords: Data envelopment analysis; C4.5; Machine learning. 2014 AENSI Publisher All rights reserved. To Cite This Article: Babak dalvand, Gholamreza jahanshahloo, Farhad hosseinzadeh lotfi, Mohsen Rostami Malkhalife, Using C4.5 Algorithm for Predicting Efficiency Score of DMUs in DEA. Adv. Environ. Biol., 8(22), 473-477, 2014 INTRODUCTION Data envelopment analysis (DEA) is a linear programming based technique for measuring the relative performance of decision making units (DMUs) in the presence of multiple input and output. Being initially explored by Charnes et al. [1], DEA methods have subsequently been developed and extended [1]. In the one of vast applications of DEA method, Score efficiency in DEA has become a parameter for prediction financial failure. There are various financial failure prediction models in the related literature. Early researches on financial failure prediction have employed univariate approaches using ratio analysis. Later, multivariate approaches (LMDA), multiple regression and logistic regression were used to predict potential financial distress. Since about 15 years ago, artificial intelligence approaches and data mining techniques, such as neural network(nn) have been widely applied to corporate financial failure forecasting because of its universal approximation property and the ability to extract useful knowledge from vast data and domain experts. Despite DEAs extensive applications, some features of DEA have remained bothersome. The most important shortage is that although DEA is good at estimating the relative efficiency of a DMU, it only tells us how well we are doing compared with our peers but not compared with a theoretical maximum. Thus, in order to measure the efficiency of a new DMU, we have to develop entirely a new DEA model with the data that already have been used. Also, we cannot predict the efficiency score of the new DMU without another DEA analysis. In this paper, we aim to show that by using the machine learning technique and C4.5 algorithm we can predict the efficiency score of a new DMU. The rest of the paper is structured as follows. In the section 2 we mention to the CCR model as a tools for measuring efficiency score of the DMUs. In the section 3 we explain how we can make a decision tree and how to use C4.5 algorithm. In section 4 we apply C4.5 algorithm to predict efficiency score of a real data bank. In section 5 we present conclusion and the results of this research. The classic DEA model: The data envelopment analysis (DEA) is a linear programming-based method for evaluating the relative efficiency of a set of Decision Making Units(DMUs). In order to find the efficiency score of the DMUs we used the most common model in DEA method that is the CCR model. Suppose that we have DMUs. Each produces outputs, and uses inputs. Then the linear program (input-based) CCR model is giving Corresponding Author: Babak dalvand, department of mathematics, science and research branches, islamic azad university, Tehran, iran. Tel: 00989166670133, E-mail: babak.dalvand.riazi@gmail.com

474 Babak dalvand et al, 2014 (1) Where and are, respectively, the th input and th output for under evaluation. To find efficiency score of DMUs we have to solve this LP exactly times. If we add a new DMU to the set of the DMUs in order to calculate efficiency score of this DMU we have to solve at least one LP programming to find that out. Solving a LP programming is bothersome and its implementation takes a lot of time. Now imagine we have a plan extinction and in each step we have to add a new DMU to the previous DMUs. In this case if one of the new DMUs is efficient ( all of our previous calculations would be useless and to find out the efficiency score of other DMUs we must calculate the efficiency score of all the other DMUs again. There are many applications that frequently augment a new DMU to a set of peer DMUs. The manager needs to know how much efficiency score they can expect instead of certain supply. We extended a new approach in the next section by using C4.5 algorithm for predicting efficiency score of new DMU in the reasonable region. We used our method with the real data of 200 Iranian bank branches and the result is totally satisfying. Decision trees and C4.5 algorithm: Decision tree learning is a method for approximating discrete-valued target functions, in which the learned function is represented by a decision tree. Decision tree classifies instances by sorting them down the tree from the root to some leaf node, which provides the classification of the instance. Most algorithms that have been developed for learning decision trees are varieties of core algorithms that employ a top-down, greedy search through the space of possible decision trees. these basic algorithms are called (1950). Ross Quilan (1986) has developed this kind of algorithm and called that then he improved some of the algorithm features and called that C4.5 algorithm. Describing the pattern recognition process, the goal is to learn how to classify objects, through the analysis of an instance set, whose classes are known. As we know the classes of an instance set (or training set), we can use several algorithms to discover the way the attribute vector of the instance behaves, to estimate the classes for new instances. One way to do this, is through decision trees. A decision tree is a directed graph showing the various possible sequences of questions (tests), answers and classifications. With each question the data set could be split to two or more subsets. The method first chooses a subset of the training examples to form a decision tree. After constructing tree with the training examples if the tree does not give the correct answer for all the objects, a selection of the exceptions (incorrectly classified examples) is added to the examples and the process continues until the correct decision tree is found. The eventual outcome is a tree in which each leaf carries a class name, and each interior node specifies an attribute with a branch corresponding to each possible value of that attribute. A tree is either a leaf node labeled with a class, or a structure containing a test, linked to two or more nodes. So, to classify some instances, first we get its attribute vector, and apply this vector to the tree. The tests are performed in to these attribute, reaching one or other leafs, to complete the classification process, as shows in figure 1. Now we want to explain how C4.5 algorithm creates a decision tree. The C4.5 algorithm is based on the ID3 algorithm that tries to find small decision trees. C4.5 uses an information theoretic approach aiming at minimizing the expected number of tests to classify an object. The C4.5 algorithm uses the concept of or to select the optimal split. Suppose that we have a variable whose possible value has probabilities. What is the smallest number of bits, on average per symbol, needed to transmit a stream of symbols representing the values of observed? The answer is called the entropy of and is defined as Where does this formula for entropy come from? For an event with probability, the average amount of required information in bits required to transmit the result is. For example, the result of a fair coin toss, with probability 0.5, can be transmitted using bit, which is a zero or 1, depending on the result of the toss. For variables with several outcomes, we simply use a weighted sum of the, with weight equal to the outcome probabilities, resulting in the formula (1). C4.5 uses this concept of entropy as follows. Suppose that we have a candidate split S, which partitions the training data set T into several subsets,. The mean information requirement can then be calculated as the weighted sum of the entropies for the individual subsets, as follows: Where represents the proportion of records in subset that is. We may then define our information gain to be, that is, the increase in information produced by partitioning the training data T according to this candidate split S. at each decision node, C4.5 chooses the optimal split to be the split that has the greatest information gain, gain(s).

475 Babak dalvand et al, 2014 Fig. 1: Simple example of a classification process. Predicting efficiency score of the bank branch by using C4.5 algorithm: As we saw in the section3 to construct a decision tree, we need to ask frequency questions and then the answers would help us to develop decision tree. Each question represents one attribute and the attributes represent characteristics of the problem. In this paper we want to predict the efficiency score of DMUs as a goal attribute and used 200 bank branches are used as decision making units (DMUs). We apply the LP model to find efficiency score for all this DMUs. As we know from the DEA literatures the efficiency score obtained from the (1) is numeric and belongs to the interval. Also we use the Weka software [5] to apply C4.5 algorithm to construct decision tree. Basically C4.5 algorithm is designed for category attribute and for numerical attribute the Weka software discrete numeric attributes by itself except goal attribute ( The one difference between ID3 and C4.5 algorithm) meanwhile for constructing decision tree, goal attribute must be in categories and we have to introduce a categories attribute item for that. In our study, goal attribute is efficiency score and it is a numeric attribute then we partition efficiency scores to the 10 categories and then apply C4.5. For example all DMUs with belong to the category 1 and all DMUs with belong to the category and so on. Then for each DMU we use input, output and efficiency score category for its attributes. In our study we have three inputs and five outputs and overall 9 attributes could be define. Figure 3 shows a bank branch with inputs and outputs. At the C4.5 algorithm first of all is calculated for all this attributes. Then The attribute with the biggest value of takes place at the root of the decision tree. Then the selected attribute is eliminated from the list of attributes and again is calculated for remained attributes until the whole decision tree would be developed. Fig. 3: A branch bank with inputs and outputs. To do this, shows this procedure as follows Use the LP (1) to find efficiency score of 200 bank branches. : Divide efficiency score to 10 groups. (we will describe this at follows) : Use the inputs, outputs and efficiency score group of DMUs as attributes of the C4.5 algorithm. : Construct decision tree and use it to predict efficiency score of other DMUs with the same attribute. The result of our study shows that the decision tree developed with this manner, predict efficiency score of DMUs with the validity of percent true. After constructing decision tree with 200 bank branches (training set) we can use this decision tree to predict efficiency score categories of any new bank branch who add to the

476 Babak dalvand et al, 2014 sets of peers DMUs with the same inputs and outputs. We add 10 new bank branches and use the constructed decision tree to categorize efficiency score of these DMUs. The gained result from the algorithm and real value is illustrated in the figure2. As we see in four cases the prediction is exactly true for,,, and in five cases the gap between predicted and real category is acceptable (,,,, ) and the only big gap is related to the. All these results are quite acceptable in comparison with direct method for calculation efficiency score. Fig. 2: Compare the predicted efficiency score category with the real one. The gray color shows the efficiency score category of 10 DMUs calculated by (1) and the black on shows the predicted efficiency score category for the same DMUs with the application of constructed decision tree. Conclusions: DEA is good at estimating the relative efficiency of a DMU, it can tell us how well we are doing compared with our peers but not compared with a theoretical maximum. Thus, to measure the efficiency of a new DMU, we have to develop DEA previously with the same method with data of used DMU. Also, we cannot predict the efficiency score of a new DMU. In this paper we develop a method based on C4.5 algorithm for the prediction of efficiency score of a new DMU. In order to calculate for calculation efficiency score of a new DMU we have to find the efficiency score of previous DMUs again, so this research is worthwhile and will be useful for the manager and other fields who would like to estimate efficiency score without really solving any linear programming. We hope the researchers pay attention to this issue and use other heuristic methods in data envelopment analysis. REFERENCES [1] Charnes, A., W.W. Cooper and E. Rhodes, 1978. Measuring the efficient of decision making unit, European Journal of Operation Research, 2: 429-444. [2] Cooper, W.W., R.G. Thomson, R.M. Trall, 1996. Introduction: extensions and new development in DEA. Annals of operations Research, 66: 3-45. [3] Daniel, T. Larose, 2005. Discovering knowledge in data. An introduction to data mining. Wiley publication. [4] Demyanyk, Y., I. Hsan, 2009. Financial crises and bank failures: a review of prediction methods. Omega(2009), doi: 10.1016/j. omega, 09.007. [5] Han Kook Hong, Sung Ho Ha, Chung Kwan Shin, Sang Chan Park, Soung Hie Kim, 1999. "Evaluating the efficiency of system integration projects using data envelopment analysis(dea) and machine learning", Expert systems with application, 16: 283-296. [6] Weka, WEKA, 2006. (Data mining software). Available at http://www.cs.waikato.ac.nz/m1/weka/.2006.

477 Babak dalvand et al, 2014