IN a biometric identification system, it is often the case that

220 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. 2, FEBRUARY 2010 The Biometric Menagerie Neil Yager and Ted Dunstone, Member, IEEE Abstract It is commonly accepted that users of a biometric system may have differing degrees of accuracy within the system. Some people may have trouble authenticating, while others may be particularly vulnerable to impersonation. Goats, wolves, and lambs are labels commonly applied to these problem users. These user types are defined in terms of verification performance when users are matched against themselves (goats) or when matched against others (lambs and wolves). The relationship between a user s genuine and impostor match results suggests four new user groups: worms, doves, chameleons, and phantoms. We establish formal definitions for these animals and a statistical test for their existence. A thorough investigation is conducted using a broad range of biometric modalities, including 2D and 3D faces, fingerprints, iris, speech, and keystroke dynamics. Patterns that emerge from the results expose novel, important, and encouraging insights into the nature of biometric match results. A new framework for the evaluation of biometric systems based on the biometric menagerie, as opposed to collective statistics, is proposed. Index Terms Biometrics, performance evaluation, authentication, identification, recognition, fingerprint, face, speech, iris, keystroke dynamics. Ç 1 INTRODUCTION IN a biometric identification system, it is often the case that users do not perform consistently well in terms of false match rates (FMR) and false nonmatch rates (FNMR). Where this occurs, researchers and system integrators are interested in identifying the groups of users who are performing poorly as they may be causing a disproportionate number of verification errors. An analysis of these people and their common properties can expose fundamental weaknesses in a biometric system, and by targeting these weaknesses one may be able to develop more robust biometric systems [1]. Several problem user groups have been characterized and are familiar to biometric researchers and practitioners. These groups have been given animals names that analogously reflect their behavior. The concept of the biometric menagerie was formalized by Doddington et al. [2], and its original members are as follows:. Sheep: Sheep make up the majority of the population of a biometric system. On average, they tend to match well against themselves and poorly against others.. Goats: Goats are subjects who are difficult to match. They are characterized by consistently low match scores against themselves and may be involved in false rejects.. Lambs: Lambs are vulnerable to impersonation. When being matched against, they result in relatively high match scores, leading to potential false accepts.. Wolves: Wolves are exceptionally successful at impersonation and prey upon lambs. When matched against enrolled users, they receive relatively high. The authors are with Biometix Pty Ltd, Suite 145, National Innovation Centre, Australian Technology Park, Eveleigh, NSW 1430, Australia. E-mail: {neil, ted}@biometix.com. Manuscript received 4 Dec. 2007; revised 7 July 2008; accepted 13 Nov. 2008; published online 3 Dec. 2008. Recommended for acceptance by P.J. Phillips. For information on obtaining reprints of this article, please send e-mail to: tpami@computer.org, and reference IEEECS Log Number TPAMI-2007-12-0810. Digital Object Identifier no. 10.1109/TPAMI.2008.291. match scores. In some systems, wolves may cause a disproportionate number of the system s false accepts. For any biometric system, the distribution of match scores will naturally vary across a range of results, and it is expected that there will be some genuine matches with low scores and some impostor matches with high scores. However, a few isolated incidences of failed verifications does not warrant labeling a user a goat, lamb, or wolf. Of interest to biometric system designers are users who consistently receive poor scores, outside of what would be expected from random variation. In other words, the score distributions for these problem user groups (goats, lambs, and wolves) are fundamentally different from the distributions of the general population (sheep). Doddington et al. use a variety of tests to demonstrate that the animals defined above exist to a statistically significant degree in their biometric system [2]. The study by Doddington et al. was conducted on speaker verification data. However, the same concepts are applicable to all areas of biometric identification. Several subsequent studies have discovered the existence of the animals in other biometrics. Wayman [3] demonstrates the presence of lambs and wolves in a fingerprint-based data set with a high degree of certainty. Wittman et al. [4] examine face recognition, and find evidence for the existence of goats, wolves, and lambs in their data. Quantitative methods for dealing with the existence of user variation is an active area of research. There are three primary user-specific schemes under investigation [5]: userspecific thresholds, which assign a different decision threshold to each user or template [6]; score normalization, which transforms the score distributions for each user or template to a standard form [7], [8]; and user-specific fusion, that takes into account the score distribution for each user or template when combining scores obtained from different base systems [5], [9]. The papers listed above have established the existence of goats, lambs, and wolves in several specific biometric applications, and presented methods for dealing with their 0162-8828/10/$26.00 ß 2010 IEEE Published by the IEEE Computer Society

YAGER AND DUNSTONE: THE BIOMETRIC MENAGERIE 221 existence. A natural question regards the relationship between these user groups. For example, if a user is known to be a lamb, does that make him or her more likely to be a goat? Doddington et al. [2] report a positive correlation between lambs and wolves in their study. This relationship is not surprising as it reflects a symmetry of the matching algorithm. Wittman et al. [4] demonstrate a relationship between goats and wolves. A recent study by Yager and Dunstone [10] has noted that the traditional biometric animals are based on only genuine match scores (low for goats) or impostor match scores (high for lambs and wolves). A new class of animals can be defined in terms of a relationship between genuine and impostor scores. The animals are called worms, chameleons, phantoms, and doves, and have combinations of low/high and genuine/impostor match scores. Formal definitions for the new animals are established in this study, as well as a test for their existence. An extensive investigation across a broad range of biometric modalities is conducted, and examples of systems containing the new animals are presented. The implication and interpretation of the existence of the animals is discussed, laying the foundations for a new user and group-centric approach to the evaluation of biometric systems. The remainder of the paper is organized as follows: In Section 2, a precise definition of each member of the biometric menagerie is presented. This is followed by experiments in Section 3 with the aim of hunting for the proposed user types in real biometric data. Section 4 contains case studies from fingerprint verification, face recognition, and iris recognition that investigate the causes of the user groups. Finally, the paper concludes with a summary of findings, and potential directions for future research in Section 5. 2 THE BIOMETRIC MENAGERIE This section contains definitions for each denizen of the biometric menagerie. The notation that will be used for the remainder of the paper is established in Section 2.1. Section 2.2 contains a discussion on various ways one might characterize a user s performance within a biometric system. Goats, lambs, and wolves are formally defined in Section 2.3, as well as the statistical tests for their existence. The new additions to the menagerie are presented in Section 2.4. 2.1 Notation Consider a user population P and a set of match scores S. For each pair of users j; k 2P there is a set Sðj; kþ S containing the verification results obtained by matching one of j s samples against an enrollment template belonging to k. User k s genuine scores are represented by the set G k ¼ Sðk; kþ, and k s impostor scores are the set I k ¼ Sðj; kþ[sðk; jþ for all j 6¼ k. A probability density function f S ðjj; kþ is the distribution of match scores obtained by matching samples from j against templates from k. A zoo plot is a diagram that displays each user s performance in relation to the whole population. 2.2 User Performance Statistics In order to proceed, it is necessary to define a statistic that quantifies a user s performance within a biometric system. Each user k should be assigned two values: one indicating how well they match against themselves (g k ) and one indicating how well they match against others (i k ). In general, any well-defined measure of performance may be used; the choice will depend on the type of biometric system under evaluation and the goal of the analysis. A few possibilities follow:. Error counts: Users can be assigned a statistic based on their actual number of verification errors within a system. Their genuine performance is based on a tabulation of their false rejects, and their impostor score is based on their number false accepts. There are two disadvantages of this approach. First, the result is dependent on a specific match score threshold. Second, for systems with very high accuracy rates it is likely that a majority of users are not involved in any match errors. When no errors occur, this approach has no ability to distinguish between the performance of individual users.. Ranks: For identification systems, the most relevant performance measure is how often a user is returned among the top N ranked results.. Maximum and minimum scores: Wittman et al. [4] base a user statistic for identification systems on maximum and minimum match scores. For example, the wolf score is obtained by selecting the top impostor match score for each of a user s probes. Their wolf statistic is the average of these top probe impostor scores. The main disadvantage of this approach is that it is strongly influenced by outlier scores.. Mean scores: If we consider a user s genuine and impostor scores to be real-valued random variables, the distributions can be characterized by the expected value, which is the central moment or mean of the distributions: g k ¼ G k and i k ¼ I k.ifa user is receiving a lot of low genuine match scores, G k will be relatively low, indicating that the user is more likely to have trouble with the biometric system. Conversely, a lot of high impostor scores will lead to a relatively high I k. The advantages of this measure are that it is intuitive, robust to outliers, and places minimal restrictions on the system error rates and structure of the match score data. The primary disadvantage of this approach is that there is not necessarily a direct relationship between a user s mean score values and their participation in system errors. This is a link that must be established experimentally in order to justify the use of central tendencies (see Section 3.2). 2.3 Goats, Lambs, and Wolves 2.3.1 Goats Intuitively, goats are users who are difficult to match, and are characterized by having low genuine match scores. For a goat, their genuine match score distribution is significantly different (lower) from those of the general population. Goats, like the other animals, do not necessarily represent a distinct, mutually exclusive subgroup of users [2], [3]. In fact, it is possible that they do not even exist in a particular system. The concept of goats is better thought of

222 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. 2, FEBRUARY 2010 as a continuum, with users showing a varying degree of goat-like behavior. 2.3.2 Lambs and Wolves Lambs, on average, tend to produce high match scores when being matched against by another user. Similarly, wolves receive high scores when matching against others. For both of these user groups, the match score distributions are significantly different (higher) than those of the general population. The definitions of lambs and wolves are symmetric. For lambs, the person of interest is being matched against (i.e., is enrolled in the system, or belongs to the gallery). For wolves, the person of interest is being matched against others (i.e., is being authenticated, or is a probe). For most of our test sets there is no difference between the way verification data and enrollment data are gathered, and the tests are comprised of cross-matches of all available data. Therefore, the two lambs and wolves are equivalent and will be treated as such. However, for most real-world applications, there is a significant difference between the samples for verification and enrollment are collected, so it is appropriate to maintain the distinction between lambs and wolves. As with goats, there are not necessarily distinct lamb and wolf populations. Rather, users will display varying degrees of lamb-like and wolfish behavior. 2.3.3 Existence Test The definitions presented above do not label particular users as belonging to an animal group. The definitions, in essence, simply state that match score distributions are user dependent. Once this fact is established, it follows that some users are performing better than others. In this way, the presence of the animals is established without explicitly labeling users. Hypothesis testing is used to demonstrate user dependent match score distributions. In Doddington et al. [2], the null hypothesis is formulated as follows: Tthe density function f S ðjk; kþ does not depend on k. In other words, there are no significant differences between the distributions for individual users. The authors show that the null hypothesis was rejected at the 0.01 significance level using both the F-Test (analysis of variance) and the Kruskal- Wallis test. In general, the F-Test is not an appropriate method of hypothesis testing for biometric data due to its implicit assumption of normality. A cursory visual examination of the score distributions for the data sets presented in Section 3.1 suggests that few of the distributions even approximate normality. Therefore, nonparametric approaches should be used where possible. One-way analysis of variance (ANOVA) is a method for testing for differences between independent distributions. Like the F-Test, ANOVA has an implicit assumption of normality. The Kruskal-Wallis test is similar to ANOVA except that scores are replaced by ranks [11], thereby relaxing the assumption of normality. Therefore, the Kruskal-Wallis test is the method employed for this study. However, one limitation of the Kruskal-Wallis test is that it requires at least five random samples from each distribution. The existence test for lambs and wolves is the same as for goats, except that impostor score distributions are used in place of genuine score distributions. The null hypothesis is that f S ðjj; kþ does not depend on j or k. Once again, the Kruskal-Wallis method is used to test the null hypothesis. If the null hypothesis is rejected at the 0.05 significance level, the animal groups are said to exist. 2.4 Worms, Chameleons, Phantoms, and Doves Goats, lambs, and wolves were defined in terms of a user s genuine or impostor match scores. The new animals differ in that they are defined in terms of a relationship between genuine and impostor match scores. Unlike goats, lambs, and wolves, the specific users who belong to the new animals groups will be identified. The existence test is based on whether or not there are more or less members of an animal group than expected. Let G be the set of average genuine performance measures for all users: G¼f[ k2p g k g. Rank all users k 2P by increasing genuine performance statistic values g k. Let G H Pbe the users whose corresponding scores are among the top 25 percent of G. In other words, G H is the 25 percent of users with the highest genuine statistics. Let G L P be the 25 percent of users with the lowest genuine statistics. Similarly, let I¼f[ k2p i k g, and I H Pbe the 25 percent of users with the highest impostor statistics, and I L Pbe the 25 percent of users with the lowest impostor statistics. 2.4.1 Chameleons Intuitively, chameleons always appear similar to others, receiving high match scores for all verifications. Chameleons are users in the set G H \I H. Chameleons rarely cause false rejects, but are likely to cause false accepts. An example of a user who may be a chameleon is someone who has very generic features that are weighted heavily by the matching algorithm. In this case, he or she would receive both high genuine and impostor match scores. The term chameleon has been proposed by Bolle et al. [12]; however, their definition differs from the one presented. In their case, users are chameleons when match scores are symmetric, causing lambs to be wolves and vice versa. The key distinction is that the proposed definition takes both genuine match scores and impostor scores into account. 2.4.2 Phantoms Phantoms belong to the set G L \I L. Phantoms lead to low match scores regardless of who they are being matched against; themselves or others. 2.4.3 Doves Doves are the best possible users in biometric systems. They are defined by the set G H \I L. They are pure and recognizable, matching well against themselves and poorly against others. 2.4.4 Worms Worms are the worst conceivable users of a biometric system, and belong to the set G L \I H. If present, worms are the cause of a disproportionate number of a system s errors.

YAGER AND DUNSTONE: THE BIOMETRIC MENAGERIE 223 2.4.5 Existence Test Each user group is defined in terms of a relationship between genuine and impostor match scores. For example, chameleons are users who tend to have high genuine match scores and high impostor match scores. Since the definitions are based on ranks and quartiles, the expected number of animals for each user type is p jpj, where p ¼ð1=4Þ 2.In other words, each user group should contain approximately 1=16th of the total user population. This is under the assumption that membership in G H or G L and I H or I L are independent. However, if there is a relationship between genuine and impostor performance, this need not be the case. A chameleon population will be indicated by an unusually large number of members of the combined set of high genuine and high impostor performances (i.e., jg H \I H j1=16 jpj). The null hypothesis is that a user s genuine and impostor performance statistics are independent, and therefore there are approximately 1=16th of the population belonging to each user type. Assume that we are interested in the set of chameleons C (the analysis is the same for all user types). Let c be the number of chameleons, c ¼jCj. The null hypothesis states that the probability of a particular person being a chameleon is p ¼ 1=16. Since each user is independent, this is a binomial experiment with n ¼jPjtrials. The hypothesis is two sided and nondirectional. Assume that the number of observed chameleons is greater than the expected number. In order to test the null hypothesis, we calculate the probability of there being c chameleons. This probability can be calculated using the binomial distribution: fðc; n; pþ ¼ Xn n p i ð1 pþ n i : ð1þ i i¼c For large values of n, the binomial distribution can be approximated using a normal distribution with the expected value np and variance npð1 pþ. Assume our desired confidence level is. The null hypothesis is rejected if fðc; n; pþ <. For our experiments ¼ 0:05. Since the test is two tailed, a symmetric argument applies if the observed number of chameleons is less than the expected number. This allows for two possibilities: The null hypothesis will be rejected if there is a significantly low or significantly high number of chameleons. In other words, we can test for a significant absence or presence of the user groups. This method of hypothesis testing is nonparametric and has low computational overhead. Fig. 1 contains the zoo plot, which illustrates where all of the animals reside and their relation to each other. Note the relationship between the new animals and the original animals. The existence of the original animals is not assumed, but when they do exist there is some overlap between the groups. For example, worms are both goats and lambs/wolves. 3 EXPERIMENTS There is no a priori reason to assume a relationship between a user s genuine and impostor performance. If a user matches well against himself or herself, this does not necessary imply any information about how well they will match against Fig. 1. The relationship between genuine and impostor performance and the biometric menagerie. others. Therefore, a series of tests are necessary in order to investigate if any of the proposed animals exist in real biometric data. Experiments will be performed on a variety of data sets, as outlined in Section 3.1. An investigation into the use of mean scores as user performance measures is conducted in Section 3.2. The existence results for the animals are presented in Section 3.3. All experiments are conducted using P erformix, a tool for the statistical analysis and management of biometric data [13]. 3.1 Experimental Data Tests are conducted on a variety of modalities, match algorithms, and data sets. A summary of the data sets is contained in Table 1. Fingerprints. Data are reported for two fingerprint matching algorithms: Fingerprint - Alg I and Fingerprint - Alg II. The matching algorithms differ only in the features used to calculate the match score. The fingerprint registration is performed using a two-stage optimization algorithm [14]. The first algorithm uses minutiae features to calculate the similarity score, and the second uses the nonminutiae features (ridge frequency, orientation, and curvature). The data set is DB1 from the FVC2002 fingerprint verification competition [15]. Iris. The iris data has been obtained from the UK National Physical Laboratory. The specific details of the iris matching algorithm and data set are not included for confidentiality reasons. 2D Face and speech. The match scores for the 2D Face and Speech experiments are from the publicly available XM2VTS multimodal fusion benchmark database, Protocol LP1 1 [16]. The database contains results for a variety of different biometric algorithms for both face and speech samples. The data are partitioned into sets for training and testing score fusion models. Since the current study does not involve a training phase, the experiments are based on all available information (i.e., the scores from the training and evaluation 1. Scores can be downloaded from http://www.idiap.ch/norman/ fusion.

224 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. 2, FEBRUARY 2010 TABLE 1 Summary of Properties of the Experimental Data Sets Population specifies the number of people with both genuine and impostor match scores, and Genuine Matches and Impostor Matches specifies the number (or number range) of matches for each person. sets are concatenated). This database contains the results for five face matching algorithms and three speech matching algorithms. Members of the biometric menagerie were found in the two face sets (DCTs,GMM) and (DCTb,GMM), which correspond to 2D Face - Alg I and 2D Face - Alg II, respectively. Speech corresponds to the (SSC,GMM) algorithm (the other two speech sets did not contain any members of the biometric menagerie). 3D Faces. The 3D face recognition experiment uses Face Recognition Grand Challenge (FRGC) 2.0 face image data corpus [17]. The matching algorithm uses a two-way decomposition to break incoming faces into a number of smaller regions in both space and frequency. Each of these regions is then classified using a specialized classifier and the output of these classifiers is combined using weighted score fusion [18]. Keystroke. Keystroke dynamics uses patterns of a person s typing style as the basis for the biometric [19]. There were two different algorithms tested. The first algorithm, Keystroke - Alg I, is based on the comparison of absolute timings between keystroke pairs. For example, the average time between an a and n keystroke for user A is compared to the same measurement for user B. The overall match score is based on a comparison of all available keystroke pairs. On the other hand, Keystroke - Alg II is based on relative timings between keystroke pairs. For example, user A may consistently type the pair an faster than the pair nd. The relative timings are compared between users A and B to get an overall match score. The data set used to test Keystroke - Alg I and Keystroke - Alg II contained only three samples for each member of the population. Consequently, there are only three genuine match scores available for each person. In order to apply the Kruskal-Wallis test, there must be at least five samples for each distribution. Therefore, this data set could not be tested for goats. Synthetic. A baseline experiment was conducted using synthetic data. The parameters of the system were chosen to be consistent with plausible parameters for a real biometric system. The population is of 300 imaginary people with identical score distributions. For each person, a random number of genuine matches and a random number of impostor matches were generated. The genuine matches were randomly drawn from a beta distribution with parameters ¼ 5 and ¼ 1, and the impostor matches were drawn from a beta distribution with parameters ¼ 2 and ¼ 7. The important properties of the synthetic data are:. The number of matches for each person varies. This was designed to ensure that the existence tests are not sensitive to varying numbers of matches per person.. The score distributions are not normal. The existence tests should not be sensitive to nonnormality.. The user scores are all drawn from the same underlying distributions. Therefore, there should not be any goat, lambs, or wolves in the data.. The genuine and impostor match scores are independent, so there should not be a significant presence or absence of worms, doves, chameleons, or phantoms. 3.2 Correlation between Mean User Scores and System Errors An experiment was conducted to investigate the relationship between a user s mean genuine and impostor scores, and their contribution to system errors. This relationship does not necessarily exist. For example, in a system where errors are caused predominantly by outlier matches, a user s mean genuine and impostor match scores would not be a good indicator of their likelihood for false accepts or rejects. For each of the data sets (see Section 3.1), a global match threshold was set to the equal error rate (EER) threshold of the system. For each user, the number of genuine matches below this threshold (false rejects) and the number of impostor matches above the threshold (false accepts) was tabulated. Furthermore, for each user, the mean genuine and impostor match scores were computed. The Pearson product-moment correlation coefficient was calculated to determine the strength of the linear relationship between the mean genuine score and false rejects, as well as between the mean impostor score and false accepts. The results can be found in Table 2. As can be seen, in all cases, there is a negative correlation between mean genuine scores and false reject counts, and a positive correlation between mean impostor scores and the number of false accepts. This result is moderate to strong for all data sets and algorithms, except

YAGER AND DUNSTONE: THE BIOMETRIC MENAGERIE 225 TABLE 2 Correlation between Mean User Scores and Verification Errors For each system, a global threshold is used (determined by the system EER), and user false accepts and false rejects are tabulated. These error counts are compared to the user s mean genuine and impostor match scores. All results are statistically significant at the 0.001 level. Fig. 2. The zoo plots for all data sets. Each point represents a user, and the position reflects their average score when matching against themself (x-axis) and others (y-axis). (a) Fingerprint - Alg I. (b) Fingerprint - Alg II. (c) 2D Face - Alg I. (d) 2D Face - Alg II. (e) Speech. (f) Iris. (g) Keystroke - Alg I. (h) Keystroke - Alg II. (i) 3D Face. (j) Synthetic. for 2D Face - Alg II where the correlation is rather weak. The conclusion is that, in general, a user s average genuine and impostor match scores are a good indication of their likelihood of being involved in system errors. This justifies the use of central tendencies as a metric for average performance, which will be used as the performance statistics for the tests in the next section. However, it should be kept in mind that other performance measures can be substituted where appropriate. 3.3 Analysis of Animal Existence The zoo plots for each experiment are contained in Fig. 2. Table 3 summarizes the experimental results, and the corresponding probability values can be found in Table 4. The following are the main conclusions that can be drawn: 3.3.1 Goats, Lambs, and Wolves Are Everywhere Every system tested, except the synthetic data, has a population of goats, lambs, and wolves. Recall the definition of these animals: The match score distributions are not independent of the user being matched. Therefore, based on these results, it appears to be a widespread, general property of biometric systems that users have their own match score distributions. 3.3.2 The New Additions to the Biometric Menagerie Exist It is apparent from the results in Table 3 that the new animals may exist in, or be conspicuously absent from, real biometric data. An example of a system containing, or not containing,

226 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. 2, FEBRUARY 2010 TABLE 3 The Biometric Menagerie for the Test Sets For goat, lambs, and wolves, a check mark indicates that the null hypothesis was rejected, and a cross indicates that it was not. For the other animals, a check mark indicates a significant presence, and a cross indicates a significant absence, of the animal. These results are representative of the specific algorithms and data sets tested and should not be extrapolated to other biometric systems of the same modality. every animal type has been found. The interpretation of this is as follows: In some systems, a user s probability of being falsely rejected is not independent of their probability of being falsely accepted. As far as the authors are aware, this is a result that has never been explicitly stated or demonstrated in the biometrics literature. 3.3.3 The Reasons the Animals Exist Depend on a Variety of Factors The existence of the animals may have some relationship to system accuracy. The two top performing systems (Fingerprint - Alg I and 3D Face) are also systems that do not have (or lack) any animal populations. Therefore, the presence of animals may be an indication of algorithmic weaknesses (more will be said on this in Sections 4 and 5). However, this is not necessarily the case since the synthetic data set can be constructed to have an arbitrarily low or high error rate, and will never contain any animals. Therefore, factors other than system accuracy must be involved. The presence of animals varies widely between the systems. Consider the results for the two keystroke recognition algorithms, which are applied to the same user population. Chameleons and phantoms are present in Keystroke - Alg I, but not in Keystroke - Alg II. Therefore, a chameleon in one system is not necessarily a chameleon in another (see the fingerprint case study). Consequently, a person cannot be labeled an animal independent of a specific algorithm and set of templates. In general, a person s success within a system is due to a combination of factors. In particular, it is impacted by the matching algorithm being used, and its interaction with a user s physical and behavioral characteristics, which may lead to poor quality templates. Furthermore, data quality is heavily influenced by the sensing hardware in use and the environmental conditions at the time of capture. 3.3.4 People are Rarely Inherently Hard to Match There is a degree of pessimism in the biometric community about people who are unsuitable for biometric identification. On one hand, there will always be some people who are unable to use a biometric system due to missing or damaged body parts. However, evidence is mounting that there are few people who are destined to have difficulty with biometric authentication due to an inherent unmatchability. The TABLE 4 Probability Values for the Existence (or Absence) of the Animals An empty field indicates a probability value > 0:05, so the null hypothesis was accepted.

YAGER AND DUNSTONE: THE BIOMETRIC MENAGERIE 227 previous conclusion supports this idea as it emphasizes the role of the overall system in the existence of problem user groups. Two recent studies support this claim. It is often claimed that 2 percent of the population have fingerprints that are unsuitable for fingerprint recognition. However, an investigation by Hicklin et al. [20] of a largescale fingerprint matching system (US-VISIT) has shown that this estimate is grossly overstated. The authors conclude that there are very few, if any, users who are intrinsically hard to match (goats and phantoms). Data quality and collection issues are the dominating factors. The conductors of recent iris recognition algorithm evaluation have come to a similar conclusion. IRIS06 evaluated three commercial iris recognition products in an effort to evaluate the state of the art of the field [21]. The researchers found examples of wolf-like behavior, but only for specific images of a person. The authors conclude that Doddington s Zoo phenomenon may be image specific as opposed to individual specific. Once again, this points to enrollment quality issues, rather than inherent properties of an individual. This is an encouraging result for the biometric research community as there does not seem to be an unsurpassable obstacle of people who are inherently difficult to recognize. When errors are common for a particular individual, they can usually be addressed using improved enrollment/capture processes and robust matching algorithms. This is illustrated with the use of several specific examples in the next section. 4 CASE STUDIES IN FINGERPRINTS, IRIS, AND FACE RECOGNITION The previous section presented results that show the presence and absence of the new members of the biometric menagerie in real biometric data sets. However, the actual user groups present vary from experiment to experiment. Results from the iris, face, and fingerprint experiments are examined. The case studies demonstrate the types of inferences that can be drawn using a biometric menagerie-based analysis. 4.1 Iris Recognition The zoo plot (using average user match scores) for the iris recognition system can be found in Fig. 3. In this case, there is a very noticeable absence of worms and doves, with only 3 or 4 of the total population or 208 people falling in these regions. The plot illustrates a fairly strong positive correlation between the average genuine and average impostor match scores. The result of this is a significant phantom population in the lower left corner. Further analysis shows that many of the phantoms are actually people who were wearing glasses when they enrolled in the system. Wearing glasses would increase the difficulty of the feature extraction task, leading to unreliable biometric templates that receive low match scores in many transactions. In this case, the underlying cause for the phantom population is behavioral, and the problem can be addressed through a system policy that requires people to remove their glasses during enrollment. 4.2 Face Recognition In the test of a face recognition system (the details are not included for confidentially reasons), the was a strong ethnic Fig. 3. The plot of average user genuine and impostor match scores for the iris data set. The regions corresponding to worms (W), chameleons (C), phantoms (P), and doves (D) are labeled accordingly. division within the target population. The recognition performance for one of the groups was considerably worse than the main demographic, leading to strong chameleon and phantom populations. The likely cause of this behavior was that the algorithm had been tuned too heavily toward the main population demographic. This illustrates the importance of keeping target population demographics in mind when tuning recognition algorithms. 4.3 Fingerprint Recognition The zoo plots for the Fingerprint - Alg I and Fingerprint - Alg II algorithms can be found in Figs. 4a and 4b. Neither algorithm contained, in a statistically significant sense, any of the new members of the biometric menagerie. However, the results are still interesting. User A has been circled in both zoo plots and has strong phantom properties for both algorithms. An example enrollment image (from FVC2002 DB1 [15]) for this person can be found in Fig. 4c. In this case, the enrollment quality of the finger is very good. There is no noticeable noise or dirt, and the ridges are clearly defined. However, compared to other fingerprints in this data set, the ridge frequency is unusually low. This low ridge frequency leads to many spurious ridges during Gabor filtering (a common preprocessing step for fingerprint images [22]). Consequently, the minutiae data for this fingerprint is unreliable. The result of this is a low match score when matched against itself and others. This is a clear example of the types of situations that can lead to phantom-like behavior. In this case, it is a weakness of the feature extraction algorithm, and the fingerprint itself is not inherently hard to match. The problem can be addressed by improved filtering techniques. User B has a high average genuine match score for both algorithms tested. An examination of the user s images reveals that all enrollments for this finger have large capture areas, clearly defined ridge structures, and consistent regions of capture. These properties make this individual easy to authenticate for both algorithms. The impostor results for this person are different between the algorithms. For the minutiae-based algorithm, the average

228 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. 2, FEBRUARY 2010 Fig. 4. The score plots for the fingerprint matching algorithms. Both algorithms are tested on the same data set, but use different fingerprint features to calculate the match scores. The positions of two users are marked in the plots (a) and (b), and sample images (from FVC2002 DB1 [15]) are presented. (a) The zoo plot for Fingerprint - Alg I. (b) The zoo plot for Fingerprint - Alg II. (c) An enrollment sample for A. (d) An enrollment sample for B. impostor scores are low, making this user a dove. The reasons for the low impostor scores are the same as the reasons for the high genuine scores, namely, high-quality enrollments and large surface areas make it difficult for other fingers to obtain high match scores with matched against B. On the other hand, B has relatively high impostor match scores for the nonminutiae algorithm. The reason for this is that the nonminutiae algorithm is based heavily on overall ridge patterns. This fingerprint is a right loop, which is a common ridge pattern for fingerprints [22], leading to many high match scores when matched against many other fingers. Once again, this is an illustration of how a menagerie-based analysis can help expose algorithmic weaknesses and lead to the development of improved matching techniques. 5 CONCLUSIONS AND FUTURE WORK 5.1 Conclusions The following main conclusions have been drawn from this study:. Goats, lambs, and wolves are very common in biometric systems. This is due to the fact that users tend to have individual genuine and impostor match score distributions. An important consequence of this is that some users will perform better than others within a system.. Four new additions to the biometric menagerie have been identified: worms, doves, chameleons, and phantoms. These animals are defined in terms of a relationship between genuine and impostor match scores, and have been demonstrated to exist in, or be significantly absent from, a wide variety of real biometric data.. The reasons that a particular animal group exist are complex and varied. They depend on a number of factors, including enrollment procedures, feature extraction and matching algorithms, data quality, and intrinsic properties of the user population.. The notion that there are people who are inherently unsuitable for biometric identification has been exaggerated. Matching errors are more likely due to enrollment issues and algorithmic weaknesses than inherent properties of the people involved. This should be viewed as a positive result by those involved in biometric research.

YAGER AND DUNSTONE: THE BIOMETRIC MENAGERIE 229 The presence or absence of the new animals reflects properties of the matching system, user population, or an interaction between the two. This suggests a new method for investigating and evaluating biometric systems. By comparing and contrasting the properties of the various animal groups, algorithmic strengths and weaknesses emerge, as illustrated in the case studies of Section 4. Traditional methods of evaluation focus on collective error statistics such as EERs and ROC curves. These statistics are useful for evaluating systems as a whole, but ignore problems associated with individuals and subgroups of the population. The biometric menagerie is a formal approach to user-centric analysis. 5.2 Future Work There are several directions where this work can be extended. First of all, further testing may reveal general patterns that are not apparent from the tests conducted so far. For example, it may turn out that certain animals are more common in some biometric modalities than others. Another avenue for research would be to compare the results from two or more different algorithms on the same data set. In the case that populations are consistent between the algorithms, it would indicate substantial similarities between the algorithms. This could lead to a stronger definition of the animals based on membership in a group for several different algorithms. On the other hand, if the animals populations are largely different, this would indicate independence between the algorithms. Algorithms with a high degree of independence are strong candidates for multimodal combination. It has been demonstrated that intraindividual correlation should be considered when computing confidence intervals for false accept or false reject rates at a given operating threshold [23]. The present research has demonstrated a potential correlation between genuine and impostor match scores, which have traditionally been treated as independent. This may have implications for computing confidence intervals for performance measures that involve both false accept and false reject rates (such as ROC and DET curves). It has been suggested that true animal-like behavior is more likely for biometrics that have a significant behavioral component. For example, consider a covert surveillance system that uses face recognition to identify people on a watch list. People who walk quickly with their head down will likely be phantoms, while people who walk slowly with their head held high are good candidates for doves. It seems likely that behavioral factors will play a significant role in many biometric systems. Validating and quantifying this role, and its contribution to system error rates, requires further investigation. The traditional animals of the biometric menagerie have been used as a motivation for setting user-specific match thresholds [7]. The new animals proposed in this study give further insight into individual user performance, and may directly lead to more sophisticated schemes for setting userspecific thresholds or score normalization techniques. ACKNOWLEDGMENTS First, the authors would like to thank Jim Wayman for providing the initial inspiration for this paper by posing the simple, but insightful, question Are lambs goats? Second, Aidan Roy of the Institute for Quantum Information Science at the University of Calgary has provided valuable insight into formulating the existence test for the new animals. Finally, numerous people have generously agreed to share data or provide feedback for this study: Jamie Cook of the Queensland University of Technology, Tony Mansfield and Alex Bazin of the National Physical Laboratory, and Ross Summerfield of Centrelink. REFERENCES [1] T. Dunstone and N. Yager, Biometric System and Data Analysis: Design, Evaluation, and Data Mining. Springer, 2008. [2] G. Doddington, W. Liggett, A. Martin, M. Przybocki, and D. Reynolds, Sheep, Goats, Lambs and Wolves: A Statistical Analysis of Speaker Performance in the NIST 1998 Speaker Recognition Evaluation, Proc. Int l Conf. Spoken Language Processing, 1998. [3] J.L. Wayman, Multi-Finger Penetration Rate and ROC Variability for Automatic Fingerprint Identification Systems, technical report, Nat l Biometric Test Center, 1999. [4] M. Wittman, P. Davis, and P. Flynn, Empirical Studies of the Existence of the Biometric Menagerie in the FRGC 2.0 Color Image Corpus, Proc. Computer Vision and Pattern Recognition Workshop, 2006. [5] N. Poh and J. Kittler, Incorporating Model-Specific Score Distribution in Speaker Verification Systems, IEEE Trans. Audio, Speech, and Language Processing, vol. 16, no. 3, pp. 594-606, Mar. 2008. [6] K. Chen, Towards Better Making a Decision in Speaker Verification, Pattern Recognition, vol. 36, no. 2, pp. 329-346, 2003. [7] N. Poh, A. Ross, and S. Bengio, Revisiting Doddington s Zoo: Employing User-Dependent Performance Criterion for Multibiometric Fusion, Proc. Multimodal User Authentication Workshop, 2006. [8] D. Ramos-Castro, J. Fierrez-Aguilar, J. Gonzalez-Rodriguez, and J. Ortega-Garcia, Speaker Verification Using Speaker- and Test- Dependent Fast Score Normalization, Pattern Recognition Letters, vol. 28, no. 1, pp. 90-98, 2007. [9] R. Snelick, U. Uludag, A. Mink, M. Indovina, and A. Jain, Large Scale Evaluation of Multimodal Biometric Authentication Using State-of-the-Art Systems, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 27, no. 3, pp. 450-455, Mar. 2005. [10] N. Yager and T. Dunstone, Worms, Chameleons, Phantoms and Doves: New Additions to the Biometric Menagerie, Proc. AutoID, 2007. [11] W. Daniel, Applied Nonparametric Statistics. Wadsworth Publishing Company, 1989. [12] R. Bolle, J. Connell, S. Pankanti, N. Ratha, and A. Senior, Guide to Biometrics. Springer-Verlag, 2003. [13] Performix Biometric Research and Analysis Software, www. biometix.com/performix.htm, 2007. [14] N. Yager and A. Amin, Fingerprint Verification Using Two Stage Optimization, Pattern Recognition Letters, vol. 27, pp. 317-324, 2006. [15] D. Maio, D. Maltoni, R. Cappelli, J.L. Wayman, and A.K. Jain, FVC2002: Second Fingerprint Verification Competition, Proc. Int l Conf. Pattern Recognition, vol. 3, pp. 811-814, 2002. [16] N. Poh and S. Bengio, Database, Protocol and Tools for Evaluating Score-Level Fusion Algorithms in Biometric Authentications, Pattern Recognition, vol. 39, no. 2, pp. 223-233, 2005. [17] P.J. Phillips, P.J. Flynn, T. Scruggs, K.W. Bowyer, J. Chang, K. Hoffman, J. Marques, J. Min, and W. Worek, Overview of the Face Recognition Grand Challenge, Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2005. [18] J. Cook, V. Chandran, and C. Fookes, 3D Face Recognition Using Log-Gabor Templates, Proc. British Machine Vision Conf., 2006. [19] D. Gunetti and C. Picardi, Keystroke Analysis of Free Text, ACM Trans. Information and System Security, vol. 8, no. 3, pp. 312-347, 2005. [20] A. Hicklin, C. Watson, and B. Ulery, The Myth of Goats: How Many People Have Fingerprints That Are Hard to Match? Technical Report NIST IR 7271, Nat l Inst. of Standards and Technology, 2005.

230 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. 2, FEBRUARY 2010 [21] Authi-Corp, IRIS06 Draft Final Report, http://www. authenti-corp.com/iris06/report/, 2007. [22] D. Maltoni, D. Maio, A. Jain, and S. Prabhakar, Handbook of Fingerprint Recognition. Springer, 2003. [23] T.J. Atkinson and M.E. Schuckers, Approximate Confidence Intervals for Estimation of Matching Error Rates of Biometric Identification Devices, Proc. Biometric Authentication: European Conf. Computer Vision Int l Workshop, 2004. Neil Yager received the PhD degree from the University of New South Wales in 2007 in automated fingerprint recognition. His PhD thesis was awarded the 2007 Malcolm Chaikin Prize for Research Excellence in Engineering. Following his studies, he became employed as the principal research scientist at Biometix, working on the development, implementation, and analysis of biometric authentication systems. He has authored a dozen papers on biometrics and has cowritten a book on biometric data analysis with Ted Dunstone, which was published in December 2008. Ted Dunstone received the PhD degree from Wollongong University, Australia, in 1996. He is currently the managing director of a biometric data analysis company, Biometix. His research interests include misclassification analysis, biometrics system performance, vulnerability detection, and machine learning. He was awarded the 2005 NSW State Pearcey Award for innovative and pioneering achievement and contribution to research and development within the IT&T industry. He is the founder of the Australasian Biometrics Institute and a member of the IEEE. He is also coauthor of a book with Neil Yager on biometric data analysis, which was published in December 2008.. For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.

This article was featured in For access to more content from the IEEE Computer Society, see computingnow.computer.org. Top articles, podcasts, and more. computingnow.computer.org