Big Data Classification using Evolutionary Techniques: A Survey

Big Data Classification using Evolutionary Techniques: A Survey Neha Khan nehakhan.sami@gmail.com Mohd Shahid Husain mshahidhusain@ieee.org Mohd Rizwan Beg rizwanbeg@gmail.com Abstract Data over the internet has been rapidly increasing day by day. Automatically mine useful information from the massive data has been a common concern for the organizations having large dataset. In order to reduce risk in future valuable information can be extracted from the sentiments of the message. Big Data has mainly three characteristics namely Velocity, Volume and Variety, on the basis of these characteristics data can be classified in three ways -supervised, unsupervised and semi supervised methods. Various algorithms and techniques are recently proposed for Clustering and Classification of the data and E-document. In our report we will discuss & compare widely used evolutionary techniques in big data classification. Keywords Big data, Genetic algorithm, clustering, Neural networks, Swarm Intelligence, Co-evolutionary programming, Naïvebayes and Decision trees. I. INTRODUCTION Data on the web has been explosively increasing in the past few decades. The ability to automatically mine useful information from massive data has been a common concern for organizations who own large datasets. With the increasing availability of electronic documents and the rapid growth of the World Wide Web, the task of automatic categorization of documents became the key method for organizing the information and knowledge discovery. Proper classification of e-documents, online news, blogs, e-mails and digital libraries need text mining, machine learning and natural language processing techniques to get meaningful knowledge. The text mining studies are gaining more importance recently because of the availability of the increasing number of the electronic documents from a variety of sources. The resources of unstructured and semi structured information include the world wide web, governmental electronic repositories, news articles, biological databases, chat rooms, digital libraries, online forums, electronic mail and blog repositories. Therefore, proper classification and knowledge discovery from these resources is an important area for research. II. CLASSIFICATION OF BIG DATA Classification can be defined as The process of categorizing data into predefined categories. This can be achieved by various classification algorithm. Classification approaches for big data analysis includes: A. Binary Classification These approaches involve categorizing data into two categories. For example - classifying the state of a machine into good or faulty. B. Multi-class Classification This approach involves categorizing the data into more than two categories. For example - gene expression classification problem. C. Document Classification It is a type of multi class classification approach in which the document to be classified is in the form of text document. III. BIG DATA CLASSIFIACTION APPROACHES A. Naive Bayes Algorithm Define This is a probabilistic classification algorithm based on Bayes theorem with a naïve assumption about the independence of feature attributes. If C is a class variable and F1, F2.Fn are the feature variables then Conditional probability will be- P(C F1 Fn) = P(F1 Fn C)P(C)/P(F1..Fn). According to Naive assumptions made various versions of naive Bayes are available such as Multinomial Naive Bayes and Bernoulli Naive Bayes. B. Decision Trees Decision trees are supervised learning method. The predictive model is in the form of a tree that can be used to predict the value of a target variables based on several attribute values. Each leaf in the decision tree represents the target variable. The learning process involves recursively & IT 243

splitting the attribute until all the samples in the child node have the same value of the target variable. Information Gain and Gini Coefficient are the most popular metrics to determine the best attribute for splitting. C. Random Forest It is an ensemble learning method that is based on randomized decision trees. Random forest trains a number decision tree and then takes the majority vote by using the mode of the class predicted by the individual trees. D. Support Vector Machine(SVM) SVM is also a supervised machine learning approach generally used for classification and regression. The basic one is Binary classifier which classifies the data into two classes. It determines the maximum margin hyper plane that separates the two classes. IV. CLUSTERING IN BIG DATA Clustering and classification are the fundamental tasks in data mining. Clustering is used for unsupervised learning method and Classification for Supervised Learning method. Clustering is the grouping of similar objects, some sort of measure should be used to determine whether the two objects are similar or dissimilar. The valid distance measure should be symmetric and it obtains its minimum value. The distance measure is known as metric, we can measure the distance between two or more objects by the various method such as Minkowski and Euclidean distance measure. A. Minkowski Distance Measure The distance between two data instances can be calculated using the Minkowski Metric D(x, y)=( x i1 - x ji )g + x i2 - x j1 g+.+ x in - x jn g)1/g B. Euclidean Distance Measure It is the most commonly used method to measure the distance between two objects when g=2.when g=1,the sum of absolute paraxial distance is obtained and when g=infinity one gets the greatest of the paraxial distance. If the variable is assigned with a weight according to its importance then weighted distance should be measure. V. MODEL BASED CLUSTERING METHODS These models attempt to optimize the fit between the given data and some mathematical models. These methods also find characteristics description for each group which represent a concept or class. Decision Trees:-This method forma a hierarchy of tree in which data is represented in terms of nodes and leaf. Each leaf refers to a concept and contains a probabilistic description of that concept.cobweb is the algorithm for classification trees representing the unlabelled data. The goal can be achieve by taking decisions in yes/no form. Neural Networks:-Neural Networks represents each cluster by a neuron. This model is inspired by a human brains. The input of the model is also represented by the neuron. Each connection has some weight assigned to it. Learning process takes place in a Winter Takes all fashion. This network has input output layer and between these two layers it has single or number of hidden layers. VI. EVOLUTIONARY TECHNIQUES IN BIG DATA A. Swarm Intelligence The first approach is Swarm Intelligence. The algorithm is inspired by the swarm behaviour of insects, flocks and birds. Swarm is generally a group of several agents helping each other to achieve some goal. The agents follow local rules to execute their actions and with the help of entire group they achieve their objective. Particle Swarm Optimization (PSO) and Ant Colonies Optimization (ACO) are the most popular algorithms used in Swarm Intelligence. Particle Swarm Optimization (PSO): PSO is a population based search algorithm and is started with a population of random solutions, called particles. Every particle is associated with velocity and they fly through search space with velocities based on their behaviour. Various Neighbourhood topologies are used to identify which swarm particle can influence the individuals. The most common are gbest and lbest. In the gbest swarm, the trajectory of each particle is influenced by the best particle in the entire swarm whereas in lbest, each particle is influence by the smaller number of neighbour, basically it consist of two neighbour: one on the left side and one on the right side. The convergence in lbest is slow but locates the global optimum with greater chance. Various types of swarm topologies can be used depending upon the problem to be solved. Ant Colonies Optimization:- This algorithm is inspired by the ant colonies behaviour. Generally ants have the ability to find out the shortest path from their nest to the food. This idea is the main source of this algorithm. Initially ants search their nearby area in a random manner to search food. While moving here and there they leave a chemical called pheromone trail on the ground and now they are guided by these smell. Now when they find the food source they evaluate the quantity and quality of food and carry some food back to the nest. The quantity of pheromone depends upon the quantity of food the ants carry during the return trip. This chemical guide other ants as well. This indirect communication between the ants helps them to find the shortest path form nest to the food source. Recommender systems and Cascading Classifiers are the applications of PSO. The objective of this strategy is to reduce complexity and improve classification accuracy. B. Genetic Programming It is an specific application of Genetic Algorithm and is used to evolve computer programs. It selects the fittest members of a given population and then crossing and mutating them. Instead of representing solutions as chromosomes, represent it as strings which can grow infinitely longer. & IT 244

Genetic programming is basically an optimization method that is used to produce good result. We know that optimization process is to find the best solution from all the available solutions. It applies natural operations on the generated programs until the best solution is found. Genetic Programming is an evolutionary algorithm and is effective in the sense that it does not require the user to know the structure of the solution in advance and automatically solve the problem. Genetic Programming is used to solve many classification problems but the execution time is too long because it is an iterative process. The MapReduce Model can be used for this algorithm that is efficient and effective for the calculation of big data. But GP is not implemented using this model until now. Genetic programming is considered as Open-ended search technique which produces many different combinations of attributes.gp is very useful for classification and prediction of tasks. C. Genetic Algorithm Genetic algorithm is adaptive procedure based on the Darwin s Law of Survival of the Fittest in nature. Genetic Algorithm is proposed as a general model of Adaptive Processes. The aim is to find the optimum characteristics parameters using the mechanism of Genetic evolution and survival of the fittest in natural selection. It is a search algorithm that is used to generate solutions to optimization and search problems. The accuracy of the document classification can be improved by this algorithm. The basic idea is to maintain the population of Chromosomes. Each chromosome is associated with fitness that is used to determine which chromosomes are to be used to form the new one in the competitive process. This process is called as Selection..It successfully solve the problems with a given activation/reward scores. Other genetic operators like Mutation are also applied for offspring s. Applications of Genetic Algorithm a) Genetic algorithm is an effective tool for pattern recognition in data mining. b) Genetic algorithm has a wide range in business; there are various domains in which GA can be applied c) Genetic algorithm can be used to automatically determine the optimal value for the variables. d) Genetic algorithm can be used in stock exchange data minig.one application is to find the best combination values for each parameter. e) Genetic algorithm is widely used for classification, clustering and feature selection, data mining etc. other. This algorithm is used in discovering fuzzy classification also. Two populations are evolved together with the fitness function involving the relationship with other individuals. In this algorithm, the individuals of the two populations can be evolve either competing each other or co-operating each other.in competitive approach, the fitness of an individual in population is completely based on the fitness of an individual in other population whereas in cooperative approach, the fitness of an individual in one population is how much cooperative with the other individual in another population. E. Artificial Neural Network Artificial Neural Networks are basically the computational models that consist of number of processing units that communicates to one another over a large network by sending signals. They are inspired by human brains. In biological term, neuron collects signals from other neurons through Dendrites. The main important feature of this algorithm is that you can learn from examples so that you can ignore programming. ANN is non-linear data driven self-adaptive approach and a powerful tools for modelling Characteristics: This technique can successfully apply on an extra-ordinary range of domain problem. Artificial Neural Network have the following characteristics: a) Artificial Neural Network can train with an example. b) Artificial Neural Network can predict new output from the past one. c) Artificial Neural Network is the robust system. d) Artificial Neural Network is design to tolerate fault efficiently. e) Artificial Neural Network can map input pattern to their output patterns. Basic Of Artificial Neural Network: The neural network consist of a number of neurons and receive signals either from input cell or other neuron. The network may be designed into many layers in which the output of the preceding layer works as input to the subsequent layer. There are several types of Artificial Neural Network architectures:- A) Feed Forward Network:- In Feed Forward Network, information flows in one direction i.e. from input layer to the output layer via hidden layers.there is no loop in Feed Forward Network. D. Co-evolutionary Programming It Ruled-based technique is preferred over the classification technique as they are more comprehensive. The individuals of the two populations evolve through either competing against each other or through co-operating each B) Recurrent Network In this architecture, there is at least one feedback loop i.e. there should be at least one feedback connection. There should be some neurons with self-feedback links. & IT 245

Types of Artificial Neural Network There are various types in Artificial Neural Network, the most important real world problem solving class of Artificial Neural Network are as discussed below. a) Multilayer Perception This is the most popular form of Artificial Neural Network. It has number of inputs and may have more than one hidden layers. It also has any number of outputs with any activation function. They are called as universal approximators. They are used when we have little knowledge about the relationship between input and outputs. VII. CONCLUSION We have discussed about the big data and the concept of classification & clustering in big data. In efficiently accessing the relevant data as per users need, classification/clustering of big data plays a vital role. The main motive behind every evolutionary algorithm in data mining is their attractive features and applications which help them to resolve many drawbacks of the conventional data mining techniques. We also have discussed the common evolutionary techniques used for classification/clustering of big data. After the brief description of all the above evolutionary techniques we have compare all the techniques on the basis or some attributes in table 1. b) Radial Basis Function Neural Network They are also called as feed forward network and have only one hidden layer unlike Multilayer Perception. It has number of input and output but only one hidden layer. Radial Basis Function Neural Network uses exponential and softmax activation functions. c) Kohonen Neural Network They are also called as Self Organising Feature Map (SOFM).These networks are quite different from the above two networks. It is specially designs of unsupervised learning. SOFM tries to learn data structure with output, used in exploratory data analysis novelty detection. It does not have any hidden layer. It has only input and output layer. VIII. FUTURE WORK As we know big data is an emerging field. In this paper we have discussed the various evolutionary techniques their features and their applications. We can implement one of the techniques in future. The hope is to implement better and better technique which can robustly resolve the drawbacks and find the best solution. Table: 1 Comparative analysis of the techniques Parameters Approach Genetic algorithm Genetic Programming Swarm Intelligence ANN Co-evolutionary Accuracy Improved accuracy Accuracy pf GP as a modelling approach is attached to the evaluation measure use. It performs superior on one evaluation measure at the cost of the other Improved classification accuracy High degree of accuracy Robust predication accuracy Computation Requires high computations Large computation is required High computation is required in Swarm Intelligence Large computation is required High computation is required Applications Performance Used to generate useful solution to optimization and search problem Cannot effectively solve the problem in which the only fitness measure is a single right/wrong measure Widely used to solve relatively simple problems and produced out-standing result in many areas like quantum computing Uses automatic induction of binary machine code to achieve better performance Used in telecommunication network crown simulation etc Because of redundancy and no central control they are inefficient Used for developing predictive models in programming ANN cannot reach on optimum performance in several non-linear problems Used for discovering fuzzy classification rules Performance depends on selection and evolution and sometimes varies. & IT 246

References [1] Liu, Bingwei, et al. "Scalable sentiment classification for Big Data analysis using Naïve Bayes Classifier." Big Data, 2013 IEEE International Conference on. IEEE, 2013. [2] Suthaharan, Shan. "Big data classification: Problems and challenges in network intrusion prediction with machine learning." ACM SIGMETRICS Performance Evaluation Review 41.4, pp. 70-73, 2014. [3] Singh, Pravesh Kumar, and Mohd Shahid Husain. "Books Reviews using Naıve Bayes and Clustering Classifier." Second International Conference on Emerging Research in Computing, Information, Communication and Applications (ERCICA-2014), pp. 886-891, 2014. [4] Ayush Joshi, Jordan Wallwork, khulood Alyahya, Sultanah AlOtaibi, The Use of Evolutionary Algorithms in Data Mining [5] Kumar, Amrender. "ARTIFICIAL NEURAL NETWORKS FOR DATA MINING." [6] Singh, Pravesh Kumar, and Mohd Shahid Husain. "METHODOLOGICAL STUDY OF OPINION MINING AND SENTIMENT ANALYSIS TECHNIQUES."International Journal on Soft Computing 5(1), (2014). [7] Singh, Pravesh Kumar, and Mohd Shahid Husain. "ANALYTICAL STUDY OF FEATURE EXTRACTION TECHNIQUES IN OPINION MINING." Computer Science (2013). [8] Crina Grosan, Ajith Abraham and Monica Chis department of Computer Science, Swarm Intelligence in Data Mining. & IT 247