5.1 Evaluating classification. 5. Evaluation after training. Binary classification

Size: px

Start display at page:

Download "5.1 Evaluating classification. 5. Evaluation after training. Binary classification"

Carol Carr
6 years ago
Views:

1 5. Evaluation after training 5.1 Evaluating classification The purpose of training (or learning) is to adjust neural network, its weight values, to possibly achieve better results. An objective (error, loss or cost) function is used to evaluate results of a neural network and produces evaluating scores. Normally, the objective function attempts to achieve low scores or error values. The optimization task concerns minimization of error values given by the objective function. Optimization of weights can be performed with various algorithms. The traditional backpropagation algorithm will be viewed in the following section, as well as others. Classification is a task in which a classifier method attempts to classify an input case into one of classes. In multiclassification, a case can be classified even into more than one class. For example, an electronic document might be classified into topics of both politics and history if its contents covered both of them. The simpliest way of of evaluating classification results is to count either correctly or incorrectly classified items and to express such in per cents or ratios in the scale of 0 to Binary classification In binary classification an input case is classified into either class of two alternatives, for instance, class A or B, yes or no, correct or incorrect, or healthy or diseased. Let us consider an example of a credit card company. The classification method has to decide how to respond to a new potential customer. A credit card can be issued or declined. For two existing classes only, the score of an objective function is the number of false positive predictions versus the number of false negatives when incorrect decisions are counted. The aim is, of course, then to minimise the number of bad decisions. 126 For the credit card example, issuing a a credit card would be the positive. A false positive occurs when a credit card is issued to someone who will have a high credit risk. On the other hand, a false negative appears when a credit card is declined to someone who would have a low risk. Because only two options exist, the mistake with more serious type of error is chosen. For most banks, a false positive is worse than a false negative. Declining a potentially good credit card holder is better than accepting a credit card holder who would cause the bank to undertake expensive collection activities or even losses. This concerns risk level. 127

2 Another example is a medical one when a possible disease is suspected. If it is found that a patient has a disease, this is a positive finding or case (even if not pleaseant for anyone). If the patient had not luckily that disease, this is a negative case. Thus, the false positive is such to whom a really non-existent disease is decided, and the false negative is anyone to whom it is not decided although the patient does have it. In the current example, a false negative decision is worse than a false positive. After additional examinations, the false positive case can be found to be actually negative. For the false negative, the disease might remain fatally undetected or the correct decision be postponed. Table 5.1 Possible alternatives of binary classification outcomes. The smaller FP and FN or the greater TP and TN related to n, the better results. Reality / Prediction Positive Negative total Positive True positive, TP False negative, FN TP+FN Negative False positive, FP True negative, TN FP+TN total TP+FP FN+TN n=tp+tn+fp+fn There are different measures that can be computed from classification results. Two basic ones are true positive rate TPR and true negative rate TNR, also called sensitivity and specificity. (5.1) Typically, these are multiplied with 100 to obtain their per cent results. UCI machine learning repository with URL contains the auto MPG data set. Its data could be classified for cars built in the USA. The field named origin provides information on the location of the cars assembly. Thus, a single output node would give a value to indicate the probability that the car was built in the USA

3 To make a prediction, one can change the origin field to hold values between maximum 1 or minimum 0 of the range of the sigmoid activation function or those of 1 and -1 of the hyperbolic activation function. The neural network would output a result value that corresponds to the possibility (loosely thinking probability) of a car being produced in the USA or elsewhere. Outputs closer to 1 indicate a car originating in the USA and those close to 0 or -1 from outside the USA. We have to select a reasonable cutoff value that differentiates predictions into either USA or non-usa. If USA is 1.0 and non-usa 0.0, we could take 0.5 to be the cutoff value. Therefore, a car with 0.6 would be USA and 0.4 non-usa. 132 No doubt, this simple neural network will also produce errors as it classifies cars. A USA-produced car might yield an output of Because this is below the cutoff value, it would not classify the car correctly. Since the network was designed to classify USA-produced cars, this error would be called a false negative. In other words, the network showed that the car was non-usa, creating a negative result since the car was actually from the USA. The negative result was false, also known as a type-2 error. Similarly, the network might incorrectly classify a non-usa car as USA. This error is a false positive or a type false Node ouput range 1.0 true Better specificity Better sensitivity Neural networks prone to generate false positives are characterised as more specific. Networks prone to produce false negative are defined as more sensitive. It is possible to make a neural network more sensitive or specific by adjusting the cutoff value. If the cutoff 1 is moved left in Fig. 5.1(a), the network becomes more specific, and if the cutoff 2 is moved right in Fig. 5.1(b), it becomes more sensitive. (Only one of cutoffs 1 and 2 exists at a time.) For the former, the decrease in the size of the true negative (TN) area makes specificity to increase. For the latter, the decrease in the size of the true positive (TP) area makes sensitivity to grow output 0.0 TN TP FN FP 1.0 cutoff1 Fig. 5.1(a) Sensitivity vs. specificity. 135

4 1.0 output 0.0 TN EER FN FP 1.0 cutoff2 Fig. 5.1(b) Sensitivity and speficity are, in a way, opposing each other. If the one increases, the other decreases and vice versa. EER is equal error rate. TP 136 Reaching 100% specificty or sensitivity is not normally reasonable. A medical test can reach 100% specificity by just predicting that eyveryone does not have the disease. Such a test will never commit a false positive error since it never gave a positive answer. Naturally, this would be useless and unmeaningless. In general, very high specificty close to 100% would mean low sensitivity and the other way round. An effectiveness evaluation measure independent of the cutoff point is needed. Such is accuracy or total prediction rate A as follows. (5.2) % The effectiveness of binary classifiers can be visualised with the technique of a receiver operator characteristic (ROC) curve as in Fig. 5.2 that shows three different ROC curves. All ROC curves start at the origin and move to the upper-right corner where true positive (TP) and false positive (FP) are both 100%. The vertical axis indicates percentages from 0 to 100. While moving from the origin upward TP increases, so does sensitivity, but specificity falls. The ROC curves enable the appropriate selection of sensitivity and also then shows the number of FPs to be accepted in that situation. 138 True positive rate (sensitivity) No predictive power Somewhat good predictive power High predictive power False positive rate (100%-specificity) 100% Fig. 5.2 The dashed line represents accuracy 50% as random guessing. To get 100% TP, there should also be 100% FP, causing all results of negatives to be wrong. 139

5 While training a feedforward network, the aim is to improve its performance, which would pull ROC curve toward the upper left corner in Fig To evaluate the total effectiveness of the network, the measure called area under the curve (AUC) can be used. For 100% accuracy it would equal 1 being the square between locations (0,0) and (1,1) in Fig Subject to the practical use of ROC curves, their middle parts only are interesting. Sometimes, depending on applications, a certain point of the curve, called equal error rate (EER) (Fig. 5.1(b)) where the numbers of FP and FN are equal may be interesting. Let us look at an example of ROC curves (although for other classifiers than neural networks) in Fig Fig. 5.3 ROC curves 3 presented with true positive rates (TPR) and false positive rates (FPR) in percent for support vector machines (SVM) with respect to linear, quadratic, polynomial of degree 3 and radial basis function (RBF) kernel with = Y. Zhang and M. Juhola: On applying signals of saccade eye movements for biometric verification of a subject, 8 th International Conference on Mass Data Analysis of Images and Signals (MDA 2013), Springer, pp Multiclass classification As seen above, one output node suffices for binary classification, but two output nodes can also be applied. Then the greater output value determines the class to be predicted. The same principle of winner takes all is valid also for multiclass classification with more than two classes, when the number of output nodes is equal to the number of classes. In Fig there is an example network for iris data set. Sepal length Sepal width Petal length Petal width 1 or more hidden layers Setosa Versicolour Virginica Fig. 5.4 The structure of a neural network for four measurements or input variables and three-class iris data set

6 Evaluation of multiclass classification For more than two classes, the view in Table 5.1 is extended to cover all classes by applying a tabular representation called confusion matrix or contingency table. In the confusion matrix every row shows the numbers for a single class, their sum being equal of the class size. The columns show classification results or predictions. The greater numbers there are in the diagonal, the better results. Thus, the values in the diagonal are true positives TP k k=1,..c (classes), and by dividing their sum with the total gives accuracy A. (5.3) 144 Example: On the classification of (biological) bugs Table 5.2 shows an example of the classification of five bug classes. These are from the data of Henry Joutsijoki s thesis 4, where he studied classification of benthic macroinvertebrates (their microscope images), small animals living at the bottom of a river. The environmental application in the background is the investigation of freshwater basins. Existence, absence and distributions of different benthic macroinvertebrate species (classes) can be used as biomarkers expressing the properties of water quality. These small animals were caught, prepared, screened and classes defined by hydrobiologists (Finnish Environmental Institute, SYKE, in Jyväskylä region), who produced the image data and extracted feature (variable) values. 4 H. Joutsijoki: Variations on a theme: The classification of benthic macroinvertebrates, University of Tampere, Table 5.2 Confusion matrix of five classes Table 5.2 is a confusion matrix of five classes and classification results as numbers (also percents are used). Classification was accomplished with the leave-one-out method, where the training set includes n-1 cases from the entire data and each test set one case only. This is the extreme form of cross-validation, when the size of a training set is maximal. It is used for small data sets. The aim is to utilize data maximally for training by building n models. It may be very time consuming depending on n and classification methods used. Fig. 5.5 shows samples from five bug classes. 146 Class BAE DIU HEP PEL SIL BAE DIU HEP PEL SIL The rows show the actual classes. The columns represent the predicted classes. Classes are BAE= Baetis rhodani, DIU= Diura nanseni, HEP= Heptagenia sulphurea, PEL= Hydropsyche pellucidulla and SIL= Hydropsyche siltalai. Decision trees were used for classfication which was performed on the basis of 24 features of geometrical and intensity-based feature types. 147

5.2 Evaluating regression (a) (b) (c) Mean squared error (MSE) is the most extensively used measure to

5 (a) Baetis rhodani (BAE), (b) Heptagenia sulphurea (HEP), (c) Hydropsyche pellucidulla (PEL), (d)

4) In Eq. (5.4), y i is the ideal (correct) output and y-hat is the output calculated.

7 5.2 Evaluating regression (a) (b) (c) Mean squared error (MSE) is the most extensively used measure to evaluate regression tasks in machine learning. Fig. 5.5 (a) Baetis rhodani (BAE), (b) Heptagenia sulphurea (HEP), (c) Hydropsyche pellucidulla (PEL), (d) Diura nanseni (DIU) and (e) Hydropsyche siltalai (SIL). (d) (e) (5.4) In Eq. (5.4), y i is the ideal (correct) output and y-hat is the output calculated. MSE is equal to the mean of the squared individual differences. Other possible, rarely used measures are mean absolute error (MAE), root mean squared error (RMSE) and sum of squared error (SSE)

Python Machine Learning

Python Machine Learning Unlock deeper insights into machine learning with this vital guide to cuttingedge predictive analytics Sebastian Raschka [ PUBLISHING 1 open source I community experience distilled