Classification of chestnuts with feature selection by noise resilient classifiers

Similar documents
Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Python Machine Learning

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Learning From the Past with Experiment Databases

CS Machine Learning

Artificial Neural Networks written examination

Issues in the Mining of Heart Failure Datasets

Rule Learning With Negation: Issues Regarding Effectiveness

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Rule Learning with Negation: Issues Regarding Effectiveness

Softprop: Softmax Neural Network Backpropagation Learning

INPE São José dos Campos

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Reducing Features to Improve Bug Prediction

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

Assignment 1: Predicting Amazon Review Ratings

Experiment Databases: Towards an Improved Experimental Methodology in Machine Learning

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

(Sub)Gradient Descent

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Knowledge Transfer in Deep Convolutional Neural Nets

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Cooperative evolutive concept learning: an empirical study

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

An Effective Framework for Fast Expert Mining in Collaboration Networks: A Group-Oriented and Cost-Based Method

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Evolutive Neural Net Fuzzy Filtering: Basic Description

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Learning to Schedule Straight-Line Code

Learning Methods for Fuzzy Systems

SARDNET: A Self-Organizing Feature Map for Sequences

Calibration of Confidence Measures in Speech Recognition

Chapter 2 Rule Learning in a Nutshell

Using focal point learning to improve human machine tacit coordination

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Lecture 1: Basic Concepts of Machine Learning

Lecture 1: Machine Learning Basics

Word Segmentation of Off-line Handwritten Documents

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Australian Journal of Basic and Applied Sciences

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

On-Line Data Analytics

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Generative models and adversarial training

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Version Space. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Version Space Term 2012/ / 18

Mining Student Evolution Using Associative Classification and Clustering

Using the Attribute Hierarchy Method to Make Diagnostic Inferences about Examinees Cognitive Skills in Algebra on the SAT

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

A Case Study: News Classification Based on Term Frequency

Mining Association Rules in Student s Assessment Data

Human Emotion Recognition From Speech

ScienceDirect. A Framework for Clustering Cardiac Patient s Records Using Unsupervised Learning Techniques

Test Effort Estimation Using Neural Network

Data Integration through Clustering and Finding Statistical Relations - Validation of Approach

A NEW ALGORITHM FOR GENERATION OF DECISION TREES

Speech Emotion Recognition Using Support Vector Machine

Learning Methods in Multilingual Speech Recognition

Model Ensemble for Click Prediction in Bing Search Ads

Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method

arxiv: v2 [cs.cv] 30 Mar 2017

Applications of data mining algorithms to analysis of medical data

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Linking Task: Identifying authors and book titles in verbose queries

Optimizing to Arbitrary NLP Metrics using Ensemble Selection

Comparison of EM and Two-Step Cluster Method for Mixed Data: An Application

Entrepreneurial Discovery and the Demmert/Klein Experiment: Additional Evidence from Germany

I-COMPETERE: Using Applied Intelligence in search of competency gaps in software project managers.

Large-Scale Web Page Classification. Sathi T Marath. Submitted in partial fulfilment of the requirements. for the degree of Doctor of Philosophy

Truth Inference in Crowdsourcing: Is the Problem Solved?

stateorvalue to each variable in a given set. We use p(x = xjy = y) (or p(xjy) as a shorthand) to denote the probability that X = x given Y = y. We al

Classification Using ANN: A Review

Switchboard Language Model Improvement with Conversational Data from Gigaword

Speech Recognition at ICSI: Broadcast News and beyond

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

Improving Simple Bayes. Abstract. The simple Bayesian classier (SBC), sometimes called

CSL465/603 - Machine Learning

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Universidade do Minho Escola de Engenharia

On-the-Fly Customization of Automated Essay Scoring

WHEN THERE IS A mismatch between the acoustic

Axiom 2013 Team Description Paper

Impact of Cluster Validity Measures on Performance of Hybrid Models Based on K-means and Decision Trees

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Probabilistic Latent Semantic Analysis

Modeling function word errors in DNN-HMM based LVCSR systems

arxiv: v1 [cs.lg] 15 Jun 2015

Lecture 2: Quantifiers and Approximation

CS 446: Machine Learning

Speaker Identification by Comparison of Smart Methods. Abstract

Beyond the Pipeline: Discrete Optimization in NLP

Build on students informal understanding of sharing and proportionality to develop initial fraction concepts.

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

Transcription:

Classification of chestnuts with feature selection by noise resilient classifiers Elena Roglia 1 Rossella Cancelliere 2 Rosa Meo 3 Università di Torino - Dipartimento di Informatica corso Svizzera 185 - Italy Abstract. In this paper we solve the problem of classifying chestnut plants according to their place of origin. We compare the results obtained by state of the art classifiers, among which, MLP, RBF, SVM, C4.5 decision tree and random forest. We determine which features are meaningful for the classification, the achievable classification accuracy of these classifiers families with the available features and how much the classifiers are robust to noise. Among the obtained classifiers, neural networks show the greatest robustness to noise. 1 Introduction One of the main activities of botanic science is plants classification. As a typical problem of pattern recognition some basic issues must be addressed: (1) which attributes, called features, should be used from botanists descriptions for classification, (2) which classifiers should be used in order to obtain, with the available features, a high classification accuracy, and finally (3) at which extent the classification accuracy degrades if the features are affected by noise. These issues are discussed in this article. We face the problem of the prediction of chestnut origin from their properties: this problem has many important industrial applications, such as production and verification of certificates of product origin. At first we worked with few features related only to fruit peculiarities. We compared the classification accuracy obtained by these features by many state of the art classifiers: a multi-layer perceptron (MLP), introduced by D. Rumelhart et al. [2], a radial basis function network (RBF), a support vector machine [8], a C4.5 decision model induced by C4.5 algorithm [3] and a random forest (RF) presented by L. Breiman [4]. The extremely poor classification performances obtained (see Section 3), suggested us to perform a initial selection of the classifiers and to seek for more informations. In fact, the initial features were supposed to be not appropriate due to the excessive variability of value from fruit to fruit. Thus, we added to the description of each chestnut instance, some features related to the entire plant with the idea that they could constitute more robust predictors. The larger data set so obtained (soon available on line) contains 1600 samples, described by 37 features taken from both chestnut plants Work supported by Regione Piemonte in the context of the project: Realizzazione di modelli informatici per la valorizzazione della qualità e la tracciabilità delle produzioni in specie da frutto coltivate in Piemonte, Cipe 2004.

and fruits. They are all the necessary informations to discriminate among the different classes. A choice of the best subset of these features must however be performed remembering that botanic features are extracted, collected and stored in a data set by human agents. This process is lengthy, costly and error-prone. As a consequence, the number of features should be reduced as much as possible but this reduction should not affect the classification performance. In addition, it is very important to investigate how a classifier answers when noisy inputs are presented. We show that the selected classifiers, especially in the case of neural networks, are also robust to the presence of a noise amount consistent with the small perturbation assumption (maximum error=5% of the value of each feature). This paper is structured as follows. In Section 2 we overview the selected classifiers: C4.5 decision trees and the random forest (MLP is a widely known method and will not be reviewed). In Section 3 we discuss the feature selection strategy and present the experimental results, in both non-noisy and noisy cases. 2 Overview of C4.5 decision trees and Random Forest For the sake of completeness we introduce some of the basic characteristics of the adopted learning models. A complete overview can be found in [3, 4, 5]. Decision trees. A decision tree is a structure whose internal nodes answer to a test condition based on the value of some of the record attributes. Each outcome of the test condition leads either to another internal node or to a leaf node which contains a class value. That class value is the prediction of the decision tree for the set of data records that reach that final node. Thus, the outcomes of the attribute tests separate data records with different characteristics into disjoint partitions that are homogeneous for the class value. The induction step in C4.5 follows a greedy strategy that grows a decision tree by progressively partitioning the training data into smaller partitions until each of them is homogeneous in the value of the class label. C4.5 algorithm induces the form of the decision tree, i.e., chooses the test condition at each node t of the tree by the following rule. Let denote by S the set of data records that reach node t. Given c class labels, let p(i, S) denote the fraction of records in S that belong to class i. Any attribute test at node t is evaluated by entropy of the class value in S, denoted by E(S): E(S) c 1 = p(i, S)log 2 [p(i, S)] i=0 Entropy is a measure of impurity of the class in S. The best attribute test at node t is selected as that one that allows the higher difference between the entropy at the parent node t (before the test condition) and the entropy at the children nodes (after the test condition).

Random forest. Significant improvements in classification accuracy have resulted in a set of methods called ensemble methods. They consist in the generation of multiple, base classifiers from training data and successively combining the predictions of each of them in test. Random forest is a special ensemble learner, which is also suitable for problems involving a large number of features. In a random forest a large number of decision trees is grown where each tree depends on the values of a random vector, sampled independently and with the same distribution for all trees. Random vectors are generated using an ensemble method (called bagging) which randomly selects N features, with replacement from the original training set. Each tree in a random forest is grown at least partially at random in one of the following ways: (1) randomness is injected by growing each tree on a different random subsample of the training data; (2) randomness is injected into the attribute test selection process so that the test condition at any node is determined partially at random. When multiple trees are generated, their predictions are usually combined so that the most popular class among them is predicted. The technique of majority voting is usually adopted (where majority is eventually weighted by giving more weight to the more correct trees). 3 Experimental results In this section we describe in more details the feature selection method, the generation of the training and test sets and the results obtained for the task of classifying chestnuts according to eight places of origin. The initial data set was made by 19 features describing 1600 samples taken from fruits and was analysed using a cross validation methods with 10 folds. Classification results from a MLP, a RBF, a binary decision tree (C4.5), a random forest (RF) and John Platt s sequential minimal optimization algorithm for training a support vector classifier (SMO) are compared. We used the default settings of Weka classification tools [8]. MLP RBF C4.5 RF SMO 58.12% 47.94% 49.81% 55.06% 52.50% Table 1: Percentage of instances correctly classified. Table 1 shows that classification accuracy was extremely poor. After some attempts to optimize the models, we decided to add to the dataset more descriptive features related to the entire plant. We had the hope that the high number of features obtained (37) would be afterwards reduced by feature selection. The initial results also suggested us to reduce the number of classifiers in the further investigations: MLP, chosen as the best of neural methods, C4.5 (the classical and largely used decision tree method) and random forest (because of its robustness to noise and scalability).

3.1 Feature selection In feature selection, the goal is to find a subset of significant attributes able to correctly predict unseen data and to reduce both human measurement errors and the cost activity of data extraction from plants and fruits. Ranking of features is possible since a large number of feature evaluation measures is available (see, for instance [6, 7] for a survey on some of them). In our experiments, we tested on the whole dataset some commonly used methods for classification purposes, available in Weka. Some of them are filter methods, that select features on the basis of measures of feature predictivity and redundancy, like: Symmetrical Uncertainty (SU), Chi-square statistic, Gain Ratio and Information Gain. Others are wrapper methods, based on the accuracy that some learner is able to reach on the data with the selected set of features, like: attribute selector based on instance-based learners, attribute subset evaluator based on any learner, and oner methodology based on simple rule-based classifiers. We verified that all of these methods agree on the selection of a unique core of relevant features, determined applying the above cited feature selection methods as ranking methods for the overall set of features. Selected core is made of features which are present, in the first 10 positions of the rankings. We noticed that it is exactly the feature set selected by entropy-based information gain criterion. This criterion is commonly used by decision tree algorithms when they select which attribute will become a test attribute in a given branch of the tree. Thus, information gain criterion was finally used to select the core of 6 relevant features among the 37 initial ones. They are: number of chestnuts/kg, diameter of the trunk, number of female inflorescence/ament, ament length, length of the leaf limb and height of the plant. We verified through comparison of classification performances that no information content was lost in this process; on the contrary classification performances improved because of redundancy reduction in instance description. 3.2 Classification performances in non-noisy datasets The training set is a list of T = 1120 instances randomly chosen from original data set (70% of the overall data set). The test set includes the remaining 480 instances. Training set has been used to optimize the three classifiers. Random forest has been built by a forest of decision trees built on the 6 features and trained each on a different training set, built by random selection of samples with replacement (first option). MLP classifier has 6, 12 and 8 neurons (one for each geographic zone) respectively in input, hidden and output layers. We optimized the number of hidden neurons and the most relevant parameters with respect to Weka default because performances compared with those obtained with C4.5 and RF were lower. The training phase required 100 iterations. Decision tree and random forest produce a correct classification of all input instances while neural network correctly classifies 97.91 % of the instances.

3.3 Classification performances in noisy datasets We also evaluate the sensitivity of the three models to noise. A noisy test set was created perturbing separately each attribute of every instance according to the following equation: i [A] = i[a] + 0.05 η i[a] where i[a] is the value of the attribute A in the i-th instance and η is a random value ( 1 η 1). The three classifiers were finally run on the perturbed data set i.e. on the noisy version of the test set. Percentage of correctly classified instances 100 98 96 94 92 90 88 86 84 82 0 1 2 3 4 5 6 # of features affected by noise Fig. 1: Accuracies on test set Fig. 2: Accuracy decrease. Figure 1 shows the classification accuracy obtained by the different classifiers on non-noisy and noisy versions of the test data. It is clear that without noise the decision tree and the random forest reach a slightly higher accuracy rate than the neural network. On the contrary, on noisy data, the neural network maintains its good performance while the decision tree and the random forest degrade seriously the previously obtained results. We also performed a paired, one-tail t test on the statistical significance of the difference in accuracy of the classifiers, conducted on the 480 test samples. The null hypothesis is that one classifier has a classification accuracy lower than the other one (mean difference 0). The observed differences between MLP and RF lead to a critical value t c = 1.435 so that the null hypothesis is rejected with a p-value of 0.75%. For the difference in accuracy between MLP and C4.5 the critical value is t c = 5.185 while it is t c = 4.969 for the difference between RF and C4.5, both corresponding to a p-value of 10 5 %. We also examined closely the classifier behavior w.r.t. an increasing number of noisy features in the data set. Figure 2 shows the performance decrease when the number of noisy features increases from 0 to 6 for decision tree (+), random forest ( ) and multilayer perceptron ( ). Results remark that neural networks are quite stable because class prediction results marginally affected by the presence of noise. On the contrary, decision tree and random forest are more sensitive. Although decision tree and random

forest reach higher accuracy in a clean test data, classification accuracies result proportionally more affected by an increasing presence of noise. In conclusion, when decision trees and random forests are used as predictive models in the context of this peculiar domain, they are less robust with respect to neural networks to the presence of noise. This is an important issue for a learner, employed in a real environment, in which commonly some features are affected by noise or human error. The interested reader can find more experimental results in [1]. 4 Conclusions In this paper we compare the accuracy of classification of chestnuts according to their place of origin. We used state of the art learners: decision trees, random forests, multilayer perceptrons, radial basis functions networks and support vector machines. The results, in the context of this peculiar domain, confirm the robustness of neural network classification techniques and their reliability for treating noisy data. Even though decision trees and random forests reach higher accuracy rates on clean and safe test data, when noise is present, they result less robust and stable. In this study we have also experimented the importance of feature selection for classification of botanic species. We applied several feature ranking methods. All of them agree on the selection of a core of 6 features (only 16%) as the most predictive and least redundant ones that still allow to obtain comparable classification results. 5 Acknowledgements The authors thank IPLA and A.Ferrara, E.Viotto and F.Tagliaferro for the dataset collected for INTERREG Project funded by Regione Piemonte. References [1] Elena Roglia, Rossella Cancelliere, and Rosa Meo. Classification of chestnuts with experiments on feature selection and noise. Technical Report 100-2007 - available from http://www.di.unito.it/ meo/pubblist/pubblisteng.html, November 2007. [2] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by error propagation. Parallel distributed processing: explorations in the microstructure of cognition, Volume 1: foundations:318 362, 1986. [3] J. Ross Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1992. [4] Leo Breiman. Random forest. Machine Learning, 45:5 32, 2001. [5] P-N. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining. Addison-Wesley, 2005. [6] M. Dash and H. Liu. Feature selection for classification. Intell. Data Analysis, 1(3), 1997. [7] Ken McGarry. A survey of interestingness measures for knowledge discovery. Knowl. Eng. Rev., 20(1):39 61, 2005. [8] I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, 2nd edition, 2005.