BioSecure Signature Evaluation Campaign (ESRA 2011): Evaluating Systems on Quality-based categories of Skilled Forgeries

Similar documents
Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Speech Emotion Recognition Using Support Vector Machine

An Online Handwriting Recognition System For Turkish

Word Segmentation of Off-line Handwritten Documents

IN a biometric identification system, it is often the case that

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

A Case Study: News Classification Based on Term Frequency

Python Machine Learning

Learning Methods in Multilingual Speech Recognition

Reducing Features to Improve Bug Prediction

Learning Methods for Fuzzy Systems

A Case-Based Approach To Imitation Learning in Robotic Agents

Welcome to. ECML/PKDD 2004 Community meeting

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Learning From the Past with Experiment Databases

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Switchboard Language Model Improvement with Conversational Data from Gigaword

CS Machine Learning

Human Emotion Recognition From Speech

Australian Journal of Basic and Applied Sciences

Rule Learning With Negation: Issues Regarding Effectiveness

Lecture 1: Machine Learning Basics

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

A study of speaker adaptation for DNN-based speech synthesis

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

Calibration of Confidence Measures in Speech Recognition

Modeling function word errors in DNN-HMM based LVCSR systems

Spoofing and countermeasures for automatic speaker verification

Speech Recognition at ICSI: Broadcast News and beyond

Problems of the Arabic OCR: New Attitudes

Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method

INPE São José dos Campos

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Knowledge Transfer in Deep Convolutional Neural Nets

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Evolutive Neural Net Fuzzy Filtering: Basic Description

The University of Amsterdam s Concept Detection System at ImageCLEF 2011

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Support Vector Machines for Speaker and Language Recognition

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

A Coding System for Dynamic Topic Analysis: A Computer-Mediated Discourse Analysis Technique

Why Did My Detector Do That?!

Linking Task: Identifying authors and book titles in verbose queries

Modeling function word errors in DNN-HMM based LVCSR systems

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Detecting Wikipedia Vandalism using Machine Learning Notebook for PAN at CLEF 2011

THE WEB 2.0 AS A PLATFORM FOR THE ACQUISITION OF SKILLS, IMPROVE ACADEMIC PERFORMANCE AND DESIGNER CAREER PROMOTION IN THE UNIVERSITY

UMass at TDT Similarity functions 1. BASIC SYSTEM Detection algorithms. set globally and apply to all clusters.

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

SARDNET: A Self-Organizing Feature Map for Sequences

Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Probabilistic Latent Semantic Analysis

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

A Reinforcement Learning Variant for Control Scheduling

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

School Size and the Quality of Teaching and Learning

Data Integration through Clustering and Finding Statistical Relations - Validation of Approach

GROUP COMPOSITION IN THE NAVIGATION SIMULATOR A PILOT STUDY Magnus Boström (Kalmar Maritime Academy, Sweden)

Seminar - Organic Computing

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

On the Combined Behavior of Autonomous Resource Management Agents

Bootstrapping Personal Gesture Shortcuts with the Wisdom of the Crowd and Handwriting Recognition

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

A new Dataset of Telephone-Based Human-Human Call-Center Interaction with Emotional Evaluation

Introduction to Causal Inference. Problem Set 1. Required Problems

Data Fusion Models in WSNs: Comparison and Analysis

Infrared Paper Dryer Control Scheme

THE PENNSYLVANIA STATE UNIVERSITY SCHREYER HONORS COLLEGE DEPARTMENT OF MATHEMATICS ASSESSING THE EFFECTIVENESS OF MULTIPLE CHOICE MATH TESTS

Feature-oriented vs. Needs-oriented Product Access for Non-Expert Online Shoppers

Improvements to the Pruning Behavior of DNN Acoustic Models

Generative models and adversarial training

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Artificial Neural Networks written examination

Lecture Notes in Artificial Intelligence 4343

On-Line Data Analytics

Rule Learning with Negation: Issues Regarding Effectiveness

University of Groningen. Systemen, planning, netwerken Bosman, Aart

A Comparison of Two Text Representations for Sentiment Analysis

Conversation Starters: Using Spatial Context to Initiate Dialogue in First Person Perspective Games

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Probability and Statistics Curriculum Pacing Guide

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

A Comparison of Standard and Interval Association Rules

What s in a Step? Toward General, Abstract Representations of Tutoring System Log Data

Comparison of EM and Two-Step Cluster Method for Mixed Data: An Application

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Transcription:

BioSecure Signature Evaluation Campaign (ESRA 2011): Evaluating Systems on Quality-based categories of Skilled Forgeries N. Houmani 1, S. Garcia-Salicetti 1, B. Dorizzi 1, J. Montalvão 2, J. C. Canuto 2, M. V. Andrade 3, Y. Qiao 4, X.Wang 4, T. Scheidat 5, A. Makrushin 6, D. Muramatsu 7, J. Putz-Leszczynska 8, M. Kudelski 8, M. Faundez-Zanuy 9, J. M. Pascual-Gaspar 10, V. Cardeñoso-Payo 10, C. Vivaracho-Pascual 10, E. Argones Rúa 11, J. L. Alba-Castro 11, A. Kholmatov 12, B. Yanikoglu 12 1 Institut TELECOM; TELECOM SudParis; Dept EPH, Evry, France 2 Universidade Federal de Sergipe, Brazil 3 TECNED Tecnologias Educacionais, Brazil 4 Shenzhen Institutes of Advanced Technology, China 5 Brandenburg University of Applied Sciences, Germany 6 Otto-von-Guericke University of Magdeburg, Germany 7 Seikei University, Japan 8 Warsaw University of Technology / Biometrics and Machine Learning Group, Poland 9 Escola Universitaria Politecnica de Mataro, Spain 10 Universidad de Valladolid, Spain 11 University of Vigo, Spain 12 Sabanci University, Turkey Abstract In this paper, we present the main results of the BioSecure Signature Evaluation Campaign (ESRA 2011). The objective of ESRA 2011 is to evaluate through two different tasks the resistance of different online signature systems to skilled forgeries categorized automatically according to their quality. Task 1 aims at studying with only coordinate time functions the influence of acquisition conditions (digitizing tablet vs. PDA) on systems performance. The two BioSecure Data Sets DS2 and DS3 make this possible, since they contain data from the same 382 people, acquired respectively on a digitizer and on a PDA. Task 2 then aims at assessing the contribution of the five time functions available on a digitizer (coordinates, pressure, pen inclination) on systems resistance to different qualities of skilled forgeries. Results of the 13 systems involved in this competition are reported and analyzed for both tasks in this paper. We observe that the best system in terms of performance on forgeries of bad quality is not necessarily the most resistant to an increased quality of skilled forgeries. Also, we note that mobile conditions are still threatening independently of the quality of forgeries. Finally, when adding pen inclination time functions to pressure and coordinates, we find that the gap between systems in terms of performance is wider than when only pen coordinates and pressure are considered. 1. Introduction In the online signature verification field, four international signature competitions have already been conducted in the last ten years, for comparing different systems on the same databases and with the same evaluation protocols: SVC 2004 [1], BMEC 2007 [2], ICDAR 2009 [3] and BSEC 2009 [4]. In all such competitions, online signature verification systems were evaluated on databases containing different types of forgeries: random forgeries [1,2,3,4], skilled forgeries [1,2,3,4], and synthetic forgeries [2]. In this competition, we focus on skilled forgeries in order to assess the influence of their quality on state-of-the-art systems performance. Indeed, skilled forgery quality can greatly vary depending on: - the ability of an impostor to stick to the genuine signature; - the difficulty for the impostor in reproducing a target signature; - the nature of the information available to the impostor about the target signature (dynamic or static information). The originality of this competition (ESRA 2011) [5] is to refine verification systems performance assessment by considering attacks of different quality levels. Such levels gather skilled forgery samples of different qualities, by means of a Hierarchical Clustering procedure [6] applied on a quantified forgery quality measure proposed by the organizers for online signatures [26]. Such quality measure actually exploits the Personal Entropy measure [7] 1 978-1-4577-1359-0/11/$26.00 2011 IEEE

associated to the target writer. The principle of the forgery quality measure is to quantify to what extent the local estimated probability density functions (PdFs) of a forgery sample stick to the target local PdFs (those of the target writer). To this end, we compute the dissimilarity measure between the target Personal Entropy and the Entropy of the forgery sample estimated with the target model [26]. This competition is carried out on the two largest existing databases containing the same persons, namely on BioSecure Signature Corpus DS2 and DS3 [8,9], respectively acquired on a digitizing tablet and a Personal Digital Assistant (PDA). Such two corpuses allow involving two different protocols for skilled forgeries acquisition. The main objective of BioSecure Signature Evaluation Campaign ESRA 2011 is to provide to the scientific community a new benchmarking methodology for performance assessment, through an increased quality of attacks. ESRA 2011 was divided into two tasks with the following challenges: - The first challenge is to evaluate the impact of mobile acquisition conditions and particularly of the skilled forgery acquisition protocol on systems performance, considering only coordinate time functions; - The second challenge is to study the impact of different time functions among coordinates, pen pressure and pen inclination, on the resistance of systems to attacks. This paper is organized as follows: Section 2 describes the retrieval of skilled forgery categories based on their quality. Section 3 describes the datasets used in this competition. Section 4 details the two tasks of the competition and the corresponding test protocols. Section 5 presents some statistics on the retrieved skilled forgery categories. In Section 6, the participants are described. Experiments are analyzed in Section 7 and Section 8 offers a brief conclusion. 2. Skilled forgery quality categories Signature verification systems submitted to this competition are evaluated on two categories of skilled forgeries, namely of bad quality and of good quality generated following a given signature representation. Indeed, for each person, we cluster his/her skilled forgeries into two categories by applying a Hierarchical Clustering [6] on forgery quality values associated to his/her skilled forgeries. Then, performance evaluation of different verification systems is carried out on the two retrieved forgery quality categories. Figure 1 displays examples of target signatures and of skilled forgeries of good quality and bad quality, when the forgery quality measure is computed on a coordinate-based representation of signatures. As the forgery quality measure is computed on a given feature representation of signatures, it can quantify to what extent a forgery sample gets close to the target signature in terms of different descriptions of a signature. Therefore, for each person, the labeling of skilled forgeries into good and bad quality will change according to the chosen representation of signatures. Figure1: Examples of target signatures and their associated skilled forgeries of (a) good and (b) bad quality. 3. Databases Two datasets containing the same writers were used in this competition [8,9]: DS2 and DS3. DS2 dataset contains two sessions, acquired on a fixed platform (digitizer) two weeks apart, each containing 15 genuine signatures per writer. For skilled forgeries, at each session, a donor is asked to imitate 5 times the shape of the signature of two other people after several minutes of practice. DS3 subset is acquired on the mobile platform PDA HP ipaq hx2790. This dataset contains two sessions, acquired 4-5 weeks apart. Each writer has, per session, 15 genuine signatures and 10 skilled forgeries of mixed type, namely static and dynamic at the same time, thanks to a particular acquisition interface displayed on the PDA touch screen. The impostor could visualize on the PDA touch screen the writing sequence of the target signature and then in a second step, directly sign on the resulting image of the stylus trajectory. Such two steps of the forgery acquisition protocol allowed the impostor to stick simultaneously to the shape and the dynamics of the target signature. For this competition, two development datasets of 50 people from respectively BioSecure DS2 and DS3 were distributed to the participants [5]. Note that the 50 people are the same in the two development datasets. Such datasets are called in the following DS2-50 and DS3-50. The two test datasets of this competition, from BioSecure DS2 and DS3, contain signatures of the same 382 writers. Such two test datasets were kept sequestered by the organizers [5] and are called in the following and DS3-382. 2

4.1. Task 1 4. Tasks and evaluation protocol Participants submitted two systems for this task: one tuned on DS2-50 and the other on DS3-50, considering in both cases only pen coordinates for representing signatures. Meanwhile, other features could be extracted from pen coordinates, as speed, acceleration etc. Indeed, the objective of this task was to assess the impact on the performance of both submitted systems, according to the quality of skilled forgeries. This quality depends greatly on the forgery acquisition protocol, which is different on the fixed platform and the mobile one, and is particularly efficient on the mobile platform since the forger can imitate very thoroughly the dynamics of the target signature and its shape as well. 4.2. Task 2 Participants submitted only one system optimized on DS2-50, considering the 5 time functions available on a digitizer (pen coordinates, pen pressure and pen inclination) for representing signatures. Participants could consider only pen coordinates and pressure or all time functions, and they submitted their best system. The objective of this task is to test each system on the appropriate categorization of skilled forgeries, namely that depending on the chosen representation of signatures of each participant. For this reason, in this task, there will be 2 groups of participant systems: those tested on skilled forgery categories built considering pen coordinates and pressure, and those tested on skilled forgery categories built considering all time functions. 4.3. Methodology for performance evaluation For both tasks, the evaluation protocol is the following: for each enrolled person, 5 genuine signatures of Session 1 are chosen randomly for the reference set; tests are carried out on the remaining 10 genuine signatures of Session 1, on 10 skilled forgeries of Session 1 and on 10 skilled forgeries of Session 2. Note that we did not use genuine signatures of Session 2 as we do not want to deal with time variability issue. In order to have enough forgery samples to perform a categorization according to forgery quality, we needed to mix forgeries of both sessions. Then, for each writer, his/her 20 skilled forgeries were labeled as being of bad quality and good quality, by Hierarchical Clustering [6] on the 20 forgery quality values of such skilled forgeries. For performance assessment, we plot DET (Detection Error Tradeoff) curves [10]. The Equal Error Rate (EER) functioning point is also explicitly reported. For Task 2, we also analyze the relative degradation of the FAR as a function of the decision threshold for the best systems, when switching from bad quality to good quality forgeries. It is noteworthy that for all experiments we used a parametric function [27] in order to compute a marginal error (confidence interval) on FAR (t) and FRR (t) for a value of the decision threshold t corresponding to the EER. We found that for all experiments the confidence interval is lower than 0.01. 5. Statistics on skilled forgery categories In this section, the resulting distribution of skilled forgeries into each quality category is analyzed according to: - which session the forgery sample belongs; - the time functions used for signature representation. For both DS2 and DS3, the quality of skilled forgeries belonging to the two sessions is measured with regard to the target signatures of Session 1. We notice on Tables 1 and 2 that the number of skilled forgeries of bad quality and good quality is similar for forgeries of both sessions. This means that the quality of skilled forgeries of Session 2 is not degraded comparatively to those of Session 1 with regard to the target signatures of Session 1. For this reason, in order to have enough skilled forgeries to perform a categorization according to their quality, and to be able to assess performance reliably, we mix the skilled forgeries of both sessions. Table 1: Distribution of skilled forgeries of bad quality of DS2 and DS3 on the two sessions Bad DS3-382 quality (x,y) (x,y) (x,y,p) (x,y,p,az,alt) Session 1 1544 1767 2180 3149 Session 2 1466 1773 2196 3144 Table 2: Distribution of skilled forgeries of good quality of DS2 and DS3 on the two sessions Good DS3-382 quality (x,y) (x,y) (x,y,p) (x,y,p,az,alt) Session 1 2276 2053 1640 671 Session 2 2354 2047 1624 676 Note that only a simple categorization into good and bad quality forgeries seems reasonable given the few number of forgery samples (20 per person). Indeed, if more samples had been available, we could have refined the forgery categorization. Table 3 shows the distribution of skilled forgeries into bad and good quality depending on the time functions considered for describing signatures. 3

Note in Table 3 that when dynamic features (pen pressure, pen inclination) are progressively added to the signature representation, the number of skilled forgeries of good quality decreases on. This reflects the easiness for a forger to imitate only the spatial aspect of a target signature compared to forging also the dynamics. Also, note that it is more difficult for the forger to imitate simultaneously different dynamic descriptors of the target signature. Indeed, when pen inclination angles are added to pressure in signatures description, forgery quality drops significantly: almost 60% of good quality samples (1917 forgery samples) switch to the bad quality category. Table 3: Distribution of skilled forgeries of good and bad quality on and DS3-382 depending on the signatures description 7. Experimental results 7.1. Results of Task 1 on and DS3-382 7.1.1. Performance on with pen coordinates Experimental results on displayed in Figure 2 and Table 4 show that when confronted to the two quality forgery categories, all systems give better performance on bad quality forgeries than on good quality ones, as expected. Indeed, at the Equal Error Rate (EER) functioning point, systems performance get degraded at least by 2.45% ( VIGO system ) when tested on skilled forgeries of good quality. Such relative degradation can even exceed 20% (see Table 4) at the EER for some systems (Ref, BUAS, SU, SKU). Skilled DS3-382 forgeries (x,y) (x,y) (x,y,p) (x,y,p,az,alt) Good quality 4630 4100 3264 1347 Bad quality 3010 3540 4376 6293 When comparing the number of good quality forgeries for DS2 and DS3, by of course considering only pen coordinates (DS3 was acquired on a PDA), we observe that DS3 contains more forgery samples of good quality than DS2. This can be explained by the forgery acquisition protocol of DS3, better suited to capturing the dynamics of the target writer. Indeed, the impostor visualized on the PDA touch screen, before forging the target signature, the writing sequence of the target signature and then signed directly on the resulting image of the stylus trajectory (see Section 4). From this analysis on the resulting forgery categories, we conclude that the forgery quality measure used for labeling skilled forgery samples behaves well by giving coherent results. In the following, we confront the resulting forgery categories to the performance criterion. 6. Participants Eleven (11) teams (from academia and industry) showed their interest in participating to this competition. All teams participated to both tasks and some teams submitted two systems for one task. The teams are from 7 different countries (Brazil, China, Germany, Japan, Poland, Spain, and Turkey). Thirteen (13) systems were evaluated together with the BioSecure Reference System of Telecom SudParis [22]. Table 9 gives the characteristics of the submitted systems in terms of the type of features extracted, the nature of the used classifier, and the computation of the final score. (a) (b) Figure 2: DET-Curves on with skilled forgeries of (a) bad quality and (b) good quality. More precisely, on skilled forgeries of bad quality, we observe that the best performance is obtained with the Reference system (EER of 2.73%), closely followed by VIGO system (EER of 2.78%). This is not surprising since the latter system is also based on Hidden Markov Models and performs the fusion of Likelihood and Viterbi 4

scores, as the Reference System [21,22]. Nevertheless, although VIGO system gives the best results at low values of FAR, it gets degraded significantly in terms of FRR for higher values of FAR, in comparison to the Reference system. Such two systems are then closely followed by a DTW-based system exploiting fusion also, namely the SKU system (EER of 2.93%). Table 4: EER values (in %) of systems on considering skilled forgeries of bad and good quality and their relative degradation (in %) when switching from bad to good quality forgeries. Task 1 on bad quality good quality Relative degradation BUAS 6.56 8.92 26.45 CNED 4.74 5.36 11.56 MGU 6.44 7.82 17.64 Ref 2.73 4.04 32.42 SIAT 4.07 4.86 16.25 SKU 2.93 4.10 28.53 SU 3.73 5.12 27.14 UFS1 4.40 5.13 14.23 UFS2 4.11 4.68 12.18 VDU 4.38 5.08 13.78 VDU-EUPMt 5.96 6.40 6.87 VIGO 2.78 2.85 2.45 WUT 4.32 4.87 11.3 Now on skilled forgeries of good quality, we clearly observe on Figure 2.b and Table 4 that the best performance is given by VIGO system (EER of 2.85%). This system thus seems more resistant than others to challenging forgeries of : at the EER its relative performance is 30% better than the Reference System that follows. On the other hand, the least robust systems to good quality attacks are based on global distance-based approaches (MGU, BUAS). In between, we find DTW-based approaches, kernel approaches and vector quantization approaches as well (ordered from best to worst as UFS2, SIAT, WUT, VDU, SU, UFS1, CNED, VDU-EUPMt). 7.1.2. Performance on DS3-382 with pen coordinates Results, displayed in Figure 3 and Table 5, show that also on DS3-382, as expected, all systems give better performance on bad quality skilled forgeries than on good quality ones. The relative degradation at the EER (see Table 5) between the two quality categories is in between 2.48% and 30.07%. But on DS3 the tendency reverts compared to DS2: HMM-based approaches follow DTW-based ones. Indeed, on DS3, the SU system, based on DTW approach, is the best system in terms of performance on both forgery quality categories. This could be explained by the fact that such HMM-based approaches exploit a so-called Viterbi score which relies on the difference in the average length of each portion between the test signature and the reference set, portions retrieved when running the Viterbi algorithm with the target HMM [22]. This Viterbi score thus detects when a forgery is too far away in terms of signature length from the target signature; but as in DS3 dataset, the forgery acquisition protocol is quite efficient, the factor length is not anymore discriminant towards forgeries, and the Viterbi score actually may move closer forgeries to target signatures. (a) (b) Figure 3: DET-Curves on DS3-382 with skilled forgeries of (a) bad quality and (b) good quality. Besides, the intra-class variability of writers increases when acquiring signatures on a mobile platform, as previously shown by the organizers through the concept of Personal Entropy [11]. Indeed, Personal Entropy [7], a measure of the degree of disorder (chaos) in the reference set of a writer, increases when switching from a fixed platform to a mobile one. This phenomenon affects the stability of segmentation in the target signature class, and thus degrades the characterization of the target class by the Viterbi score. For these reasons, DTW-based approaches are more efficient in mobile acquisition conditions than the two score fusion-based systems relying on HMMs. 5

Concerning the ranking, the 5 best systems on both bad and good quality forgery categories are the same: they are based either on DTW or on Gaussian Kernels (ranked from best to worst as: SU, VDU, UFS1, SKU, UFS2). On the other hand, the worst systems on both bad and good quality forgery categories on DS3 are the same as those on DS2, ranked from best to worst as MGU, BUAS (global distance approaches) and VDU-EUPMt (Vector Quantization). one hand, for low values of the FAR, WUT system gets degraded with regard to other systems. On the other hand, for high values of the FAR, it is SIAT system that gets degraded compared to the other systems. The best systems are both kernel-based: CNED system and UFS1 system. Table 5: EER values (in %) of systems on DS3-382 considering skilled forgeries of bad and good quality and their relative degradation (in %) when switching from bad to good quality forgeries. Task 1 on DS3-382 bad quality good quality Relative degradation BUAS 13.07 13.67 4.38 CNED 9.33 10.56 11.64 MGU 12.97 13.20 1.74 Ref 8.13 10.92 25.54 SIAT 9.41 9.65 2.48 SKU 7.33 9.19 20.24 SU 6.05 7.15 15.38 UFS1 7.68 8.66 11.31 UFS2 7.20 9.42 23.56 VDU 6.90 7.69 10.27 VDU-EUPMt 14.38 21.17 30.07 VIGO 8.07 9.57 15.67 WUT 9.90 10.26 3.5 Now when comparing the results between DS2 (Table 4) and DS3 (Table 5), we observe a clear degradation of EER values on DS3 (roughly by a factor 2), and that on both quality categories of skilled forgeries, even if the systems have been tuned on each database. This indicates that mobile conditions are still threatening for verification systems. 7.2. Results of Task 2 on In this task, performance analysis is carried out on two different forgery categorizations: when coordinates and pressure are used in signature representation; when pressure and pen inclination are both considered additionally to coordinates for signatures representation. In each case, we analyze performance of the systems exploiting the corresponding signature representation. Note that the test database varies in function of the set of time functions considered. 7.2.1. Performance with pen coordinates and pressure In this task, 4 systems exploited pen coordinates and pressure: WUT, SIAT, CNED and UFS systems. In Figure 4, we note that DTW-based systems (WUT and SIAT) do not behave well for low or high values of FAR: indeed, on (a) (b) Figure 4: DET-Curves on with (x,y,p) on skilled forgeries of (a) bad quality and (b) good quality Table 6: EER values (in %) of systems on on forgeries of bad and good quality with (x,y,p) and their relative degradation (in %) when switching from bad to good quality forgeries Task 2 on bad quality good quality Relative degradation CNED 3.32 4.38 24.20 SIAT 4.27 4.86 12.14 UFS1 3.51 4.31 18.56 WUT 3.48 4.89 28.83 In order to analyze the robustness of the best systems to good quality forgeries, we display in Figure 5 the relative degradation of the FAR when switching from bad to good quality skilled forgeries. We notice that CNED system is more robust to skilled forgeries of good quality than UFS1 system, since for most values of the decision threshold, its relative degradation is lower than 50%. 6

As in Task 1 on DS2, the least performing systems are MGU system and BUAS system, both based on global distance approaches. Also, as in Task 1 on DS2, the score fusion-based VIGO system relying on HMMs gives the best results on both quality categories of skilled forgeries (see also Table 7 for the EER functioning point). Such system is then followed by the DTW-based SKU system. The robustness of such two systems to good quality skilled forgeries differs as shown in Figure 7: note that although VIGO system is the best in absolute terms, its relative degradation (see Figure 7(a)) in the most critical region (low values of the FAR), is more important than that of SKU system that follows (see Figure 7(b)). Figure 5: Relative degradation of the FAR as a function of the decision threshold of (a) CNED system and (b) UFS1 system, when switching from bad to good quality skilled forgeries. 7.2.2. Performance with pen coordinates, pressure and pen inclination In this task, 6 systems exploited all time functions: BUAS, MGU, SKU, VIGO, Ref, VDU-EUPMt systems. When considering all time functions available on the digitizer, we notice on Figure 6 that the gap between systems in terms of performance is widened. (a) (b) (a) (b) Figure 6: DET-Curves on with (x,y,p,az,alt) on skilled forgeries of (a) bad quality and (b) good quality,. Table 7: EER values (in %) of systems on on forgeries of bad and good quality with (x,y,p,az,alt) and their relative degradation when switching from bad to good quality forgeries. Task 2 on bad quality good quality Relative degradation BUAS 4.69 6.39 26.60 MGU 4.64 6.35 26.92 Ref 2.81 5.76 51.221 SKU 2.69 3.68 26.90 VDU-EUPMt 4.41 4.81 8.31 VIGO 1.67 2.43 31.27 (c) Figure 7: Relative degradation of the FAR as a function of the decision threshold of (a) VIGO system, (b) SKU system and (c) Reference system, when switching from bad to good quality skilled forgeries. 7

Finally, note in Figure 6(b) that Reference system gets substantially degraded on good quality forgeries with regard to the two above mentioned systems. Figure 7(c) then confirms this result, showing a relative degradation of Reference system, higher than 400% for most values of the decision threshold. 8. Conclusion In this paper, we presented the most recent online signature competition, namely ESRA 2011, held in conjunction with the international Joint Conference on Biometrics (IJCB 2011). This competition was focused on the evaluation of online signature systems on skilled forgeries of different quality levels available on the two BioSecure Data Sets DS2 and DS3 containing the same 382 persons, acquired respectively on a fixed platform and a mobile one. In this competition two different tasks were defined: Task 1 evaluates the impact of mobile acquisition conditions and particularly of the skilled forgery acquisition protocol on systems performance, considering only coordinate time functions; Task 2 assesses the impact of different time functions among coordinates, pen pressure and pen inclination, on the resistance of systems to different qualities of attacks. 12 participant systems from 11 teams from academia and industry are involved in this competition. The results show that the best system in terms of absolute performance is not necessarily the most resistant to an increased quality in skilled forgeries. In Task 1, we first noted a global degradation of performance on DS3 compared to DS2; secondly, we noted that the ranking of systems in terms of performance is not the same on DS2 and DS3: indeed, Vigo system is the winner on DS2, and SU system is the winner on DS3. In Task 2, forgery categories differ according to the time functions used for representing signatures. We noted that when introducing pen inclination time functions additionally to pressure and coordinates, the gap between systems in terms of performance is wide. In this case, the winning system ( Vigo system ) outperforms significantly the others even in terms of resistance to attacks of increased quality. Acknowledgments We would like to thank the BioSecure Association for putting at our disposal the BioSecure DS2 and DS3 Signature Datasets and its support to this evaluation. References [1] D.Y. Yeung, H. Chang, Y. Xiong, S. George, R. Kashi, T. Matsumoto, and G. Rigoll, "SVC2004: First International Signature Verification Competition", Int. Conference on Biometric Authentication (ICBA), Springer LNCS Vol. 3072, pp.16-22, China, 2004. [2] http://biometrics.it-sudparis.eu/bmec2007/ [3] V.L. Blankers, C.E. van den Heuvel, K.Y. Franke, and L.G. Vuurpij, "The ICDAR 2009 Signature Verification Competition", Proc. of the 10 th Int. Conference on Document Analysis and Recognition, 2009. [4] http://biometrics.it-sudparis.eu/bsec2009/ [5] http://biometrics.it-sudparis.eu/esra2011/ [6] A. K. Jain, M. N. Murty, and P. J. Flynn, "Data clustering data: A Review", ACM Computing Surveys, Vol. 31, N 3, pp. 264-323, 1999. [7] S. Garcia-Salicetti, N. Houmani, and B. Dorizzi, "A Novel Criterion for Writer Enrolment based on a Time-Normalized Signature Sample Entropy Measure", EURASIP Journal on Advances in Signal Processing, Vol. 2009, Article ID 964746, 12 pages, doi. 10.1155/2009/964746, 2009. [8] http://biosecure.it-sudparis.eu/ab/ [9] J. Ortega-Garcia, J. Fierrez, F. Alonso-Fernandez, J. Galbally, M.R. Freire, J. Gonzalez-Rodriguez, C. Garcia-Mateo, J.L Alba-Castro, E. Gonzalez-Agulla, E. Otero-Muras, S. Garcia-Salicetti, L. Allano, B. Ly-Van, B. Dorizzi, J. Kittler, T. Bourlai, N. Poh, F. Deravi, M.N.R Ng, M. Fairhurst, J. Hennebert, A. Humm, M. Tistarelli, L. Brodo, J. Richiardi, A. Drygajlo, H. Ganster, F.M. Sukno, S.K. Pavani, A. Frangi, L. Akarun, and A. Savran, "The Multiscenario Multienvironment BioSecure Multimodal Database (BMDB)", IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.32, Issue 6, June 2010. [10] A. Martin, G. Doddington, T. Kamm, M. Ordowski, and M. Przybocki, "The det curve in assessment of detection task performance", Proc. of EUROSPEECH 97, Vol. 4, pp.1898-1898, Rhodes, Greece, 1997. [11] S. Garcia-Salicetti, N. Houmani, and B. Dorizzi, "A Client-entropy Measure for On-line Signatures", IEEE Biometrics Symposium (BSYM), pp. 83-88, Tampa, USA, Septembre 2008. [12] R.E. Schapire and Y. Singer, "Improved boosting algorithms using confidence-rated predictions", Machine Learning 37(3), pp. 297 336, 1999. [13] D. Muramatsu and T. Matsumoto, "Online Signature Verification Algorithm with a User-Specific Global- Parameter Fusion Model," Proc. of IEEE Int. Conference on Systems, Man, and Cybernetics, pp.492-497, 2009. [14] J. Putz-Leszczyńska and M. Kudelski, "Hidden Signature for DTW Signature Verification in Authorizing Payment Transactions", Journal of Telecommunications and Information Technology (JTIT), Vol. 4, 2010. [15] J. Putz-Leszczyńska, M. Chochowski, L. Stasiak, R. Wardziński, A. Pacut, "Two-stage classifier for off-line signature verification", in the 13 th Biennial Conference of the International Graphonomics Society, Melbourne, Australia, pp. 138-141, Nov. 2007. 8

[16] A. Juels and M. Wattenberg, "A Fuzzy Commitment Scheme", in Proc. of the ACM Conference on Computer and Communications Security, pp. 28-36, 1999. [17] Y. Dodis, R. Ostrovsky, L. Reyzin, and A. Smith, "Fuzzy Extractors: How to Generate Strong Keys from Biometrics and Other Noisy Data", in SIAM J. Comput. 38(1), pp 97-139, 2008. [18] C. Vielhauer and R. Steinmetz, "Handwriting: Feature Correlation Analysis for Biometric Hashes", In: Bourlard, H., Pitas, I., Lam, K., Wang, Y. (Eds.), EURASIP Journal on Applied Signal Processing, Special Issue on Biometric Signal Processing, Hindawi Publishing Corporation, Sylvania, OH, USA, 2004. [19] C. Vielhauer, "Biometric User Authentication for IT Security: From Fundamentals to Handwriting", Springer, New York, 2006. [20] A. Makrushin, T. Scheidat, and C. Vielauer, "Handwriting Biometrics: Feature Selection based Improvements in Authentication and Hash Generation Accuracy", in C. Vielhauer et al. (Eds.): BioID 2011, LNCS 6583, pp. 37 48, doi. 10.1007/978-3-642-19530-3_4, 2011. [21] E. Argones Rua, D. Pérez-Pinar Lopez, and J. L. Alba Castro, "Ergodic hmm-ubm system for on-line signature verification", In Proc. of the 2009 joint COST 2101 and 2102 international conference on Biometric ID management and multimodal communication, BioID MultiComm'09, pp. 340-347, Berlin, Heidelberg, 2009. Springer-Verlag. [22] B. Ly Van, S. Garcia-Salicetti, and B. Dorizzi, "On using the viterbi path along with HMM likelihood information for online signature verification", in IEEE Trans. Systems, Man, and Cybernetics, Part B: Cybernetics, 37(5), pp.1237-1247, 2007. [23] A. Kholmatov and B.A. Yanikoglu, "Identity authentication using improved online signature verification method", Pattern Recognition Letters 26(15), pp. 2400-2408, 2005. [24] B.A. Yanikoglu and A. Kholmatov, "Online Signature Verification Using Fourier Descriptors", EURASIP Journal on Advances in Signal Processing, Vol. 2009, Article ID 260516, 13 pages, doi:10.1155/2009/260516, 2009. [25] J. M. Pascual-Gaspar, M. Faundez-Zanuy and C. Vivaracho "fast online signaturerecognition based on VQ with time modelling", Engineering Applications of Artificial Intelligence, vol 24 (2011) 368-377, March 2011. [26] N.Houmani, S. Garcia-Salicetti and B. Dorizzi, "On Measuring Forgery Quality of On-line Signatures", submitted to Pattern Recognition on June 29 th 2010, revised on July 1 st, 2011. [27] R. M. Bolle, N. K. Ratha, and S. Pankanti, "Error analysis of pattern recognition systems - the subsets bootstrap", Computer Vision and Image Understanding, 93(1), pp. 1-33, 2004. 9

Table 9: Description of the systems classifier System System description SU Sabanci University, Turkey - Pen coordinates and number of extra points in DTW alignment - Score computation: average DTW distance between the test and the 5 reference signatures with user-based normalization taking into account the mean of reference set signatures to the rest and the variation [23,24]. DTW distance Other distances HMM Vector Quantization Isotropic Gaussian Kernels WUT Warsaw University of Technology / Biometrics and Machine Learning Group, Poland SIAT Shenzhen Institutes of Advanced Technology, China VDU Universidad de Valladolid, Spain SKU Seikei University, Japan MGU Otto-von-Guericke University of Magdeburg, Germany BUAS Brandenburg University of Applied Sciences, Germany VIGO University of Vigo, Spain Ref Telecom SudParis, France VDU- EUPMt Escola Universitaria Politecnica de Mataro and Universidad de Valladolid, Spain UFS 1 Universidade Federal de Sergipe, Brazil UFS 2 Universidade Federal de Sergipe, Brazil CNED TECNED Tecnologias Educacionais, Brazil - Universidade Federal de Sergipe, Brazil - Time derivative of coordinates, pressure and signature length. - Global and local classifiers [14,15]. - Employment of an abstract representation of a person s signatures estimated from the available enrollment signatures. - Speed along x and y directions, pressure. - Score computation: normalized distance. - Local features: y coordinate, pressure and time derivative of pen coordinates (dx,dy). - Average DTW distance between the test and the 5 reference signatures. - Mean vector computed on the reference signatures considering as local features: the 5 time functions, pen direction and velocity. - Score based on a fusion model generated by combining many perceptrons, relying on the reference set, using Adaboost algorithm [12,13]. - 131 features extracted from the 5 time functions - Feature selection: sequential forward search algorithm (SFS) - Score based on a Hamming distance [16,17] - 131 features extracted from the 5 time functions - Biometric Hash algorithm - Score based on a Canberra distance [18,19,20] - 11 local features extracted from the 5 time functions. - Weighted sum of 2 scores: likelihood score of a HMM-Universal Background Model, and Viterbi score of a user specific HMM [21,22]. - 25 local features extracted from the 5 time functions - Score: fusion of likelihood score and segmentation score generated by Viterbi algorithm [22]. - 11 local Features derived from coordinates, pressure and time stamps. - Vector quantization [25]. - 7 local features derived from coordinates and pressure - dispersion optimized by a cross validation strategy - Likelihood score. - 5 local features derived from coordinates and pressure - Only k nearest neighboring Gaussian Kernels. - Likelihood score. - 4 local features derived from coordinates and pressure - Only k nearest neighboring Gaussian Kernels. - Likelihood score. 10