Questions!"#$%#&'()$*#+','()#$(-+,./01) Since induction is fallible, it is necessary to be able to assess its reliability!! Typical questions:! AgroParisTech! What is the true performance of my (learned) classification rule Is my learning algorithm better than this other one? (based in part on Sebastian Thrun CMU class " and on the tutorial of Padraic Cunningham at ECML-09)! Evaluating ML algorithms 2 Outline 1. Measuring the error rate 2. Confusion matrices and various performance criteria!1&0#&'()./*).+%*)*++-+)+#.*) 3. The ROC curve Evaluating ML algorithms 3 Evaluating ML algorithms 4
Evaluating classification rules Various sets of data The whole available data set Large data sample Very small data sample Learning set Validation set Test set Illimited sample Evaluating ML algorithms 5 Evaluating ML algorithms 6 Asymptotic behaviour (ideal case) Over-fitting (over-learning) Erreur erreur sur base de test Sur-apprentissage! Useful for very large data sets! erreur sur base d'apprentissage Arrêt de l'apprentissage t Evaluating ML algorithms 7 Evaluating ML algorithms 8
Over-fitting (NNs) Why using a test set?! The control parameters of the learning algorithm E.g.: number of hidden layers, number of neurons,... Are tuned in order to reduce the error on the validation set ))2-%+3*1)4-%+)5)666)*7*04$*1)! In order to have a non optimistically biased estimate of the error, one must measure it on an independent data set: the test set!!"#$%&'(!)#$%!*!+++!','-).'(!/! Evaluating ML algorithms 9 Evaluating ML algorithms 10 Evaluating classification rules Evaluating the error rate! True error:! (Real risk) e =! y " f ( x, # ) p( x, y) dx, y D D D = the true distribution! Test error:! (Empirical risk) 1 eˆ S =! y # f ( x, $ ) m x, y " ST A lot Few m = # of test examples T = test data Evaluating ML algorithms 11 Evaluating ML algorithms 12
Example: Confidence intervals! We want to estimate error D (h).!! The learned hypothesis incorrectly classifies 12 out of 40 examples in the test set T.! Q : What will be the true error rate?! R :???! We estimate it by using error T (h) which follows a binomial law! With mean! And standard error!!! They are estimated using the normal law with: Mean: Standard deviation: Evaluating ML algorithms 13 Evaluating ML algorithms 14 Confidence intervals Confidence intervals! The normal law!! The normal law! With probability N%, the true error error D lies in the interval:! N% 50% 68% 80% 90% 95% 98% 99% z N 0.67 1.0 1.28 1.64 1.96 2.33 2.58 Evaluating ML algorithms 15 Evaluating ML algorithms 16
Confidence intervals (cf. Mitchell 97) Example: If T contains m examples independently sampled m! 30 Then With probability 95%, the true error e D is within: eˆ S ± 1.96 eˆ S (1! eˆ S ) m! The learned hypothesis incorrectly classifies 12 out of 40 test examples in T.! Q: What will be the true error on unseen examples?! A: With 95% confidence, the true error will lie within [0.16;0.44] " eˆ S ± 1.96 eˆ S (1! eˆ S ) m m = 40 12 eˆ ˆ eˆ S (1 " es ) S = = 0.3 1.96! 0. 14 40 m Evaluating ML algorithms 17 Evaluating ML algorithms 18 95% confidence intervals Performance curves 95% confidence intervals Erreur de test Erreur d apprentissage Evaluating ML algorithms 19 Evaluating ML algorithms 20
Evaluating learned hypotheses Various sets Data Lot of data Few Learning test " error Evaluating ML algorithms 21 Evaluating ML algorithms 22 Small data sets: a dilemma Small data sets: a dilemma Evaluating ML algorithms 23 Evaluating ML algorithms 24
Cross validation (k-fold) Data Learn on yellow, test on rose " error 1 Learn on yellow, test on rose " error 2 Learn on yellow, test on rose " error 3 k-way split Learn on yellow, test on rose " error 4 Learn on yellow, test on rose " error 5 Learn on yellow, test on rose " error 6 The leave-one-out procedure Data! Low bias! Highvariance! Tends to underestimate the error if the data are not fully i.i.d. Learn on yellow, test on rose " error 7 Learn on yellow, test on rose " error 8 [Guyon & Elisseeff, jmlr, 03]! error = # error i / k Evaluating ML algorithms 25 Evaluating ML algorithms 26 The Bootstrap estimate Problem Data! The calculation of the confidence interval supposes the independence of the estimations.! But our estimations are not independent. # " Learn on yellow, test on rose " error " Repeat and compute the mean Estimation of the true risk for the final h Mean of the risks On the k test samples Mean of the risk on whole data set Evaluating ML algorithms 27 Evaluating ML algorithms 28
Types of performance criteria 2-'8%1,-')0#.+,9*1) #':)"#+,-%1)4*+8-+0#'9*)9+,.*+,#) Evaluating ML algorithms 29 Evaluating ML algorithms 30 Confusion matrix Confusion matrix 14% of the butterflies are recognized as fishes Réel! Estimé! +! -! +! VP! FP! -! FN! VN! Evaluating ML algorithms 31 Evaluating ML algorithms 32
Types of performance criteria Types of performance criteria Evaluating ML algorithms 33 Evaluating ML algorithms 34 Types of performance criteria Types of performance criteria Evaluating ML algorithms 35 Evaluating ML algorithms 36
Types of performance measures Performance measures! Sensitivity! VP FN + VP! Recall! VP VP + FN! Specificity! VN VN + FP! Precision! VP VP + FP Réel! Estimé! +! -! +! VP! FP! -! FN! VN! Evaluating ML algorithms 37 Evaluating ML algorithms 38 Performance measures Performance measures! FN-rate! FN VP + FN! FP-rate! FP FP + VN! F-measure! 2 x recall x precision Recall + precision = 2 VP 2 VP + FP + FN Réel! Estimé! +! -! +! VP! FP! -! FN! VN! Evaluating ML algorithms 39 Evaluating ML algorithms 40
Performance measures Performance measures Evaluating ML algorithms 41 Evaluating ML algorithms 42 Performance measures H/*)IJ2)9%+"*)!!!!!!!!!!!!"#$%! &'()#! *++,! -.,! (--:) 6;<=>) 6;?5@) 3#:) 6;56>) 6;A<A) B+*9,1,-'C(--:D)E)=F)G)5A5)E)6;@5<) Evaluating ML algorithms 43 Evaluating ML algorithms 44
The ROC curve Types of errors Evaluating ML algorithms 45 Evaluating ML algorithms 46 The ROC curve The ROC curve ROC = Receiver Operating Characteristic Probabilité de la classe Classe '+' Faux négatifs Vrais positifs Probabilité de la classe Classe '-' Classe '+' (10%) (90%) Critère de décision Probabilité de la classe Classe '-' Vrais négatifs Faux positifs Critère de décision (50%) (50%) Evaluating ML algorithms 47 Critère de décision Evaluating ML algorithms 48
Classe '+' Faux négatifs Classe '- ' Vrais négatifs Faux positifs Vrais positifs Critère de décision Critère de décision Classe '+' Faux négatifs (10%) Classe '- ' Vrais négatifs (50%) (50%) Faux positifs Vrais positifs (90%) Critère de décision Critère de décision The ROC curve The ROC curve PROPORTION DE VRAIS NEGATIFS 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1,0 1,0 1,0 0,9 0,9 PROPORTION DE VRAIS POSITIFS 0,8 0,7 0,6 0,5 0,4 0,3 Courbe ROC (pertinence = 0,90) Ligne de hasard (pertinence = 0,5) 0,8 0,7 0,6 0,5 0,4 0,3 PROPORTION DE FAUX NEGATIFS 0,2 0,2 0,1 0,1 0 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1,0 0 PROPORTION DE FAUX POSITIFS Evaluating ML algorithms 49 Evaluating ML algorithms 50 The ROC curve The ROC curve PROPORTION DE VRAIS NEGATIFS PROPORTION DE VRAIS NEGATIFS 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1,0 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1,0 1,0 1,0 1,0 1,0 0,9 0,9 0,9 Seuil "laxiste" 0,9 PROPORTION DE VRAIS POSITIFS 0,8 0,7 0,6 0,5 0,4 0,3 Courbe ROC (pertinence = 0,90) Ligne de hasard (pertinence = 0,5) 0,8 0,7 0,6 0,5 0,4 0,3 PROPORTION DE FAUXNEGATIFS PROPORTION DE VRAIS POSITIFS 0,8 0,7 0,6 0,5 0,4 0,3 Seuil "sévère" Probabilité delaclase Probabilité delaclase Probabilité delaclase Probabilité delaclase 0,8 0,7 0,6 0,5 0,4 0,3 PROPORTION DE FAUXNEGATIFS 0,2 0,2 0,2 0,2 0,1 0,1 0,1 0,1 0 0 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1,0 0 0 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1,0 PROPORTION DE FAUX POSITIFS PROPORTION DE FAUX POSITIFS Evaluating ML algorithms 51 Evaluating ML algorithms 52
The ROC curve Comparaison of learning algorithms! Résumé!! Comparison on a single data sets [Dietterich, 1998] recommends using: 5 x 2 cross-validation Paired t-test The McNemar test on a validation set! Comparison on multiples (different) data sets [Demsar, 2006] recommends using: Wilcoxon Signed Ranks Test The Friedman test Evaluating ML algorithms 53 Evaluating ML algorithms 54 Résumé Specific problems! Attention à votre fonction de coût : qu est-ce qui importe pour la mesure de performance?! Données en nombre fini: calculez les intervalles de confiance! Données rares : Attention à la répartition entre données d apprentissage et données test. Validation croisée.! N oubliez pas l ensemble de validation! The distribution of the classes is very unbalanced (e.g. 1% ou 1%O for one of the two classes)! Gray zone (uncertain labels)! Multi-valued functions! L évaluation est très importante Ayez l esprit critique Convainquez-vous vous même! Evaluating ML algorithms 55 Evaluating ML algorithms 56
Other evaluation criteria References! Intelligibility of the learned decision function E.g. SVMs or boosting are not good! Performances in generalization Often not correlated to the previous performance criterion! Various costs Data preparation Computational cost Cost of the ML expertise Cost of the domain expertise! Dietterich, T. G., (1998). Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms. Neural Computation, 10 (7) 1895-1924.!! JapKowicz N. & Shah M. (2011). Evaluating Learning Algorithms. A classification perspective. Cambridge University Press, 2011. (An interesting book)! Evaluating ML algorithms 57 Evaluating ML algorithms 58 The Weka ML toolkit The Weka ML toolkit! http://www.cs.waikato.ac.nz/m!weka/" Evaluating ML algorithms 59 Evaluating ML algorithms 60