PIBTD: Scheme IV 100. FRR curves thresholds

Similar documents
Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

WHEN THERE IS A mismatch between the acoustic

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Learning Methods in Multilingual Speech Recognition

user s utterance speech recognizer content word N-best candidates CMw (content (semantic attribute) accept confirm reject fill semantic slots

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Dynamic Pictures and Interactive. Björn Wittenmark, Helena Haglund, and Mikael Johansson. Department of Automatic Control

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Speech Emotion Recognition Using Support Vector Machine

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

A study of speaker adaptation for DNN-based speech synthesis

Support Vector Machines for Speaker and Language Recognition

Clouds = Heavy Sidewalk = Wet. davinci V2.1 alpha3

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Calibration of Confidence Measures in Speech Recognition

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

NCEO Technical Report 27

The Effects of Ability Tracking of Future Primary School Teachers on Student Performance

On the Combined Behavior of Autonomous Resource Management Agents

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Mandarin Lexical Tone Recognition: The Gating Paradigm

Reducing Features to Improve Bug Prediction

Human Emotion Recognition From Speech

Modeling function word errors in DNN-HMM based LVCSR systems

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Probability estimates in a scenario tree

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques

CHAPTER 4: REIMBURSEMENT STRATEGIES 24

INPE São José dos Campos

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Speech Recognition at ICSI: Broadcast News and beyond

Why Did My Detector Do That?!

Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Systematic reviews in theory and practice for library and information studies

Lecture 1: Machine Learning Basics

Modeling function word errors in DNN-HMM based LVCSR systems

Pedagogical Content Knowledge for Teaching Primary Mathematics: A Case Study of Two Teachers

Lecture 2: Quantifiers and Approximation

Application of Virtual Instruments (VIs) for an enhanced learning environment

GROUP COMPOSITION IN THE NAVIGATION SIMULATOR A PILOT STUDY Magnus Boström (Kalmar Maritime Academy, Sweden)

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Word Segmentation of Off-line Handwritten Documents

2 Mitsuru Ishizuka x1 Keywords Automatic Indexing, PAI, Asserted Keyword, Spreading Activation, Priming Eect Introduction With the increasing number o

Spoofing and countermeasures for automatic speaker verification

Assignment 1: Predicting Amazon Review Ratings

SETTING STANDARDS FOR CRITERION- REFERENCED MEASUREMENT

Software Maintenance

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Summarizing Text Documents: Carnegie Mellon University 4616 Henry Street

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Probability and Statistics Curriculum Pacing Guide

Knowledge Transfer in Deep Convolutional Neural Nets

Semi-Supervised Face Detection

Improvements to the Pruning Behavior of DNN Acoustic Models

MERGA 20 - Aotearoa

The distribution of school funding and inputs in England:

Speaker recognition using universal background model on YOHO database

Voice conversion through vector quantization

Proceedings of Meetings on Acoustics

Using dialogue context to improve parsing performance in dialogue systems

Australian Journal of Basic and Applied Sciences

Statewide Framework Document for:

SOFTWARE EVALUATION TOOL

Sector Differences in Student Learning: Differences in Achievement Gains Across School Years and During the Summer

Numeracy Medium term plan: Summer Term Level 2C/2B Year 2 Level 2A/3C

I-COMPETERE: Using Applied Intelligence in search of competency gaps in software project managers.

CS Machine Learning

On-the-Fly Customization of Automated Essay Scoring

Softprop: Softmax Neural Network Backpropagation Learning

Automatic Pronunciation Checker

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

GCSE English Language 2012 An investigation into the outcomes for candidates in Wales

Evaluation of a College Freshman Diversity Research Program

Fragment Analysis and Test Case Generation using F- Measure for Adaptive Random Testing and Partitioned Block based Adaptive Random Testing

Longitudinal Analysis of the Effectiveness of DCPS Teachers

learning collegiate assessment]

Generative models and adversarial training

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

The Perception of Nasalized Vowels in American English: An Investigation of On-line Use of Vowel Nasalization in Lexical Access

School Competition and Efficiency with Publicly Funded Catholic Schools David Card, Martin D. Dooley, and A. Abigail Payne

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Australia s tertiary education sector

Research Update. Educational Migration and Non-return in Northern Ireland May 2008

An Empirical Analysis of the Effects of Mexican American Studies Participation on Student Achievement within Tucson Unified School District

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

A Coding System for Dynamic Topic Analysis: A Computer-Mediated Discourse Analysis Technique

Entrepreneurial Discovery and the Demmert/Klein Experiment: Additional Evidence from Germany

Build on students informal understanding of sharing and proportionality to develop initial fraction concepts.

Transcription:

Determination of A Priori Decision Thresholds for Phrase-Prompted Speaker Verication M. W. Mak, W. D. Zhang, and M. X. He Centre for Multimedia Signal Processing, Department of Electronic and Information Engineering, The Hong Kong Polytechnic University, Hong Kong Abstract Speaker verication systems are often compared based on an equal error rate (equal chance of false acceptance and false rejection) obtained by adjusting a decision threshold during verication. However, the threshold should be found before verication because the identity of a claimant is actually unknown in real-world situations. This paper presents a novel method to determine the decision thresholds of speaker verication systems using enrollment data only. In the method, a speaker model is trained to differentiate the voice of the corresponding speaker and that of a general population. This is accomplished by using the speaker's utterances and those of some other speakers (denoted as anti-speakers) as the training set. Then, an operation environment is simulated by presenting the utterances of some pseudo-impostors (none of them is an anti-speaker) to the speaker model. The threshold is adjusted until the chance of falsely accepting a pseudo-impostor falls below an application dependent level. Experimental evaluations based on 138 speakers of the YOHO corpus suggest that with a simulated operation environment, it is able to determine the best compromise between false acceptance and false rejection. Keywords Speaker verication; threshold determination; elliptical basis function networks. I. Introduction The determination of decision thresholds is a very important problem in speaker verication. A large threshold could make the system annoying to users, while a small one could result in a vulnerable system. Conventional threshold determination methods [1], [2] typically compute the distribution of inter- and intra-speaker distortions, and then chose a threshold to equalize the overlapping area of the distributions, i.e. to equalize the false acceptance rate () and false rejection rate (). The success of this approach, however, relies on whether the estimated distributions match the speaker- and impostor-class distributions. Another approach derives the threshold of a speaker solely from his/her own voice and speaker model [3]. Session-to-session speaker variability, however, contributes much bias to the threshold, rendering the verication system unusable. Due to the diculty in determining a reliable threshold, researchers often report the equal error rate (ERR) of verication systems based on the assumption that an a posteriori threshold can be optimally adjusted during veri- cation. A real-world application, however, is only realistic with a priori thresholds which should be determined before verication. This project was supported by the H.K. Polytechnic University Grant No. 1.42.37.A42. M.X. He is with the Ocean Remote Sensing Institute, Ocean University of Qingdao, China. In recent years, research eort has focused on the normalization of speaker scores to minimize error rates. This includes the likelihood ratio scoring proposed by Higgins et al. [4], where verication decisions are based on the ratio of the likelihood that the observed speech is uttered by the true speaker to the likelihood that it is spoken by an imposter. The a priori threshold is then set to 1., with the claimant being accepted (reject) if the ratio is greater (less) than 1.. Subsequent work based on likelihood normalization [5], [6], cohort normalized scoring [7], and minimum verication error training [8] also shows that including an impostor model during verication not only improves speaker separability, but also allows decision thresholds to be easily set. Although these approaches help to select an appropriate threshold, they may cause the system to favor rejecting true speakers, resulting in a high. For example, Higgins et al. [4] reported that the is more than 1 times larger than the. A recent report [9] based on a similar normalization technique but dierent threshold setting procedure also found that the average of and is about 3 to 5 times larger than the ERR, suggesting that the ERR could be an over optimistic estimate of the true system performance. This paper proposes an a priori threshold determination method to address the above problem. The method is different from that of Higgins et al. in that rather than using a ratio speaker set formed by pooling the nearest reference speakers, we used two speaker sets, namely anti-speaker set and pseudo-impostor set, to determine the threshold. For each speaker, a speaker model is trained to dierentiate the voices of the speaker and the anti-speakers. Then, the pseudo-impostor set is used to determine the threshold. To enhance the capability of the speaker models without increasing the enrollment time, we sample the utterances of 45 anti-speakers and 45 pseudo-impostors to form a training set for building the speaker models and for determining the thresholds. Therefore, an operation environment for the speaker model is eectively simulated. Experimental results show that the simulated operation environment enables the verication performance to be accurately predicted during enrollment time, thereby providing a reliable means of determining the decision thresholds. This paper is organized as follows. Section II outlines the speaker models and the verication procedure. The a priori threshold determination methods are explained in Section III, and their performance are compared in Section IV. The proposed speaker models and threshold determination

methods are compared with that of Higgins et al. [4] in Section V. Finally, we conclude our discussions in Section VI. II. Speaker Verification A. Speaker Models: EBF Networks Elliptical basis function (EBF) networks have been used as speaker models in this work [1]. EBF networks can be applied to speaker verication as follows. Each registered speaker is assigned an EBF network with two outputs. The rst output is trained to output a `1' for the speaker's speech and a `' for other speakers' utterances, and vice versa for the second output. Therefore, two sets of data are required for constructing a speaker model, one of them being derived from the speaker and another from other speakers. We denote the second set of data as the anti-speaker set hereafter. Of particular interest is that the EBF networks incorporate the idea of likelihood ratio scoring in their discriminative training procedure. An EBF network does not require a set of cohort or background speakers during verication; rather, it embeds the characteristics of the background speakers in its parameter estimation procedure during enrollment. B. Verication Procedure For each verication session, the feature vectors derived from the utterances of a claimant are concatenated to form a vector sequence T = [~x 1 ; ~x 2 ; : : : ; ~x T ]. The sequence is then divided into a number of overlapping segments containing T s consecutive vectors. Note that this approach is similar to that of [11] where each segment is considered to be independent. For a segment T s, the normalized average outputs z k X = 1 k = 1; 2 (1) T s ~x2t s e ~y k(~x) P 2 r=1 e~y r(~x) corresponding to the speaker and anti-speaker classes are computed, where ~y k (~x) = y k(~x) P (C k represents the scaled output and P (C k ) the prior probability of class C k. ) Verication decisions are based on the criterion: If z 1? z 2 > then accept the claimant then reject the claimant where 2 [?1; 1] is an a priori threshold that has been determined during enrollment (see Section III below). A verication decision is made for each segment, and the error rate (either or ) is the proportion of incorrect verication decisions to the total number of decisions. Details of the verication procedure can be found in [1]. III. Determination A. s and s versus Thresholds To determine the a priori thresholds, we need to obtain the and as a function of thresholds using enrollment data only. We propose three methods to (2) achieve this goal. They are denoted as Baseline, Pseudo- Impostor Based Threshold Determination (PIBTD) and Sampling Pseudo-Impostor Based Threshold Determination (SPIBTD) in this paper. A.1 Baseline This method, being very similar to that of Higgins et al. [4], is to form a baseline for comparison. Specically, for each registered speaker in the system, ve other speakers whose speech are closest to that of the speaker are selected from the population to form an anti-speaker set. Note that this is analogous to the ratio speakers of Higgins et al. The speech of the speaker and the anti-speakers are used to train a speaker model. Then, the same speech data from the anti-speakers are applied to the speaker model according to the above verication procedure. The as a function of the threshold is obtained by adjusting the threshold, resulting in an curve. Similarly, the speaker's utterances which have been used to train the speaker model are presented to the model to obtain an curve. A typical example of these curves is shown in Fig. 1. It shows that the during verication is considerably higher than that during enrollment, suggesting that the system is vulnerable to impostor's attacks. A.2 Pseudo-Impostor Based Threshold Determination (PIBTD) Using the same set of utterances for training the speaker model as well as for producing the and curves has a serious drawback. After training, the speaker model is likely to bias towards representing the training utterances of the speaker and anti-speakers. If the same set of utterances is applied to the speaker model for producing and curves, the curves obtained are likely to be biased. To resolve this problem, PIBTD uses an alternative set of speakers, called pseudo-impostor set, together with another set of utterances (for which the speaker model has never seen before) produced by the registered speaker to obtain the and curves. More specically, after training the speaker model, ve pseudo-impostors are randomly selected from the population and applied to the speaker model. These pseudo-impostors, being dierent from the anti-speakers and never seen by the speaker model before, are more likely to form a better representation of the impostor population. This prevents the curve from shifting along the threshold axis drastically during verication, as shown in Fig. 1. Therefore, the verication error rate becomes more predictable. A.3 Sampling Pseudo-Impostor Based Threshold Determination (SPIBTD) Obviously, the representation of the impostor population can be improved by increasing the number pseudoimpostors and anti-speakers. However, increasing the size of these sets will also lead to unrealistic enrollment time. SPIBTD aims at reducing the error rate and improving the robustness of the thresholds without increasing the enrollment time. The basic idea is to randomly select the

feature vectors from a large number of pseudo-impostors and anti-speakers for training a speaker model as well as for determining a threshold. In this way, the number of training vectors and the enrollment time remain the same as compared to PIBTD. Another advantage of this sampling strategy is that the resulting training vectors become more representative of the impostor population, for they are derived from more pseudo-impostors as compared to PIBTD. As shown in Fig. 1, this makes the position of curves more predictable as compared to PIBTD. The reduction in the displacement between the curves obtained during enrollment and verication means that the curve may be used to determine the threshold. More specically, the threshold is adjusted until the obtained during enrollment falls below an application dependent level. B. Threshold Selection Schemes Once the and curves are obtained, a threshold can be determined as follows. If the and curves cross each other, the crossing point will be chosen as the a priori threshold. Therefore, the a priori threshold is to equalize the chance of false acceptance or false rejection during enrollment. However, when the and curves do not cross each other, there exist a range of thresholds for which both and are zero. Four threshold selection schemes are proposed to handle this situation and they are summarized in Table I. IV. Experimental Evaluations In this work, all of the 138 speakers (18 male, 3 female) in the YOHO corpus [12] have been used for the experimental evaluations. For each speaker in the corpus, there are 4 enrollment sessions with 24 utterances in each session and 1 verication sessions of 4 utterances each. Each utterance is composed of three 2-digit numbers (e.g. 34-52-67). All sessions were recorded in an oce environment using a high quality telephone handset and sampled at 8 khz. The enrollment process involves two steps. First, for each speaker in the corpus, 72 utterances from the speaker's rst three enrollment sessions and 48 utterances from the 4 enrollment sessions of 5 anti-speakers (Baseline and PIBTD) were used to train a speaker model. For SPIBTD, the 48 utterances were randomly selected from 45 anti-speakers. Second, the a priori threshold was determined by using either the anti-speaker set (Baseline) or the pseudo-impostor set (PIBTD and SPIBTD) 1 together with the speaker's speech. The speaker's speech was derived from the training utterances (Baseline) or from other utterances for which the model has never seen before (PIBTD and SPIBTD). Verication was performed using each speaker in the corpus as a claimant, with 45 impostors being randomly selected from the remaining speakers (excluding the antispeakers and pseudo-impostors) and rotating through all speakers. The speaker's utterances, which were derived 1 Five pseudo-impostors were used in PIBTD, whereas in SPIBTD, the pseudo-impostor set is constructed by selecting the feature vectors of 45 pseudo-impostors randomly. from his/her 1 verication sessions, were concatenated to form a sequence of features vectors. Similarly, impostors' feature vectors were randomly selected from the utterances of 45 impostors and then concatenated to form a vector sequence whose length is the same as that formed by the speaker's utterances. Verication decisions were made according to (2) with the segment length T s in (1) being set to 3. 2 For each genuine trial, a window covering 3 vectors was advanced forward along the vector sequence by one vector position. This arrangement produces approximately 1 genuine trials and 1 impostor attempts for each speaker. LP-derived cepstral coecients were used as acoustic features. For each utterance, the silent regions were removed by a silent detection algorithm based on the energy and zero crossing rate of the signal. The remaining signals were pre-emphasized by a lter with transfer function 1? :95z?1. Twelfth-order LP-derived cepstral coecients were computed using a 28 ms Hamming window at a frame rate of 14 ms. These feature vectors were used to train a set of speaker models (EBF networks) with 12 inputs, two outputs, and 32 centers, where 8 centers were contributed by the corresponding speaker and the remaining 24 by the anti-speakers. A. and Versus Thresholds Fig. 1 depicts s and s as a function of thresholds of one of the 138 speakers. Some interesting results can be observed from these gures. First, Fig. 1 shows that there is a large displacement between the curve corresponding to enrollment and that corresponding to verication when anti-speakers' utterances were used to determine the curve during enrollment. Second, when pseudoimpostors were used to obtain the curve during enrollment, the displacement is considerably reduced, as shown in Fig. 1. Third, the displacement is further reduced for SPIBTD (Fig. 1) where feature vectors were randomly sampled from a large number of pseudo-impostors, suggesting that the verication performance of the system can be reliably predicted. Fig. 1 also suggests that the curve provides a reliable means of determining the threshold. B. Comparing Dierent Threshold Selection Schemes B.1 Using Baseline Fig. 2 compares the four selection schemes for the baseline method by plotting the a posteriori thresholds against the a priori thresholds corresponding to 138 speakers in the YOHO corpus. The a posteriori thresholds were chosen to equalize and during verication. Fig. 3 plots the versus the of 138 speakers using dierent threshold selection schemes. Fig. 2 shows that most of the a priori thresholds are greater than the a posteriori ones. This suggests that choosing the zero crossing of the curves (Scheme I) 2 This is roughly equivalent to the length of 3 utterances. Thus, the results can be compared with those of [4].

1 Baseline: Scheme II 1 PIBTD: Scheme IV 1 SPIBTD: Scheme IV 8 curves 8 curves 8 curves curves curves or 6 4 Th or 6 4 Th or 6 4 curves 2 2 2 Th -.4 -.2.2.4.6.8 Threshold -.2.2.4.6.8 Threshold -.4 -.2.2.4.6.8 Threshold Fig. 1. s and s as a function of decision thresholds of a speaker during enrollment (solid) and verication (dots) using Baseline, PIBTD, and SPIBTD. The label \Th" denotes the a priori threshold found by the corresponding best threshold selection scheme. Scheme Description Best for I Selecting the zero crossings of the curves as thresholds None II Selecting the middle of the zero crossings of the and curves as Baseline thresholds III Selecting the zero crossings of the curves as thresholds PIBTD IV Selecting the point in the curve such that its corresponding attains a pre-dened value PIBTD & SPIBTD TABLE I Threshold selection schemes. All and curves are obtained at enrollment time. The last column lists the best threshold determination method(s) for each threshold selection scheme..8 Baseline: Scheme I.8 Baseline: Scheme II.8 Baseline: Scheme III.8 Baseline: Scheme IV.7.7.7.7.6.5.4.3.2.6.5.4.3.2.6.5.4.3.2.6.5.4.3.2.1.1.1.1.1.2.3.4.5.6.7.8.1.2.3.4.5.6.7.8.1.2.3.4.5.6.7.8.1.2.3.4.5.6.7.8 Fig. 2. A posteriori equal error thresholds versus a priori thresholds found by choosing the zero crossings of curves (Scheme I), the middle of the zero crossing of and curves (Scheme II), the zero crossings of curves (Scheme III), and the point at which attains.5% (Scheme IV) as thresholds. The baseline method was used to obtain the and curves in all cases. is likely to overestimate the thresholds, resulting in high during verication as shown in Fig. 3. On the other hand, Figs. 2 and 2 suggest that using Schemes III and IV is likely to yield underestimated thresholds, resulting in high during verication (see Figs. 3 and 3). Therefore, a reasonable choice is to select the middle of the zero crossing of the and curves as the threshold, ie Scheme II. This scheme not only results in comparable a priori and a posteriori thresholds, as shown in Fig. 2, but also minimizes and for some of the speakers simultaneously during verication, as shown in Fig. 3. However, there are still many speakers having a low but a very high or vice versa, suggesting that the baseline method is not very robust. B.2 Using PIBTD Fig. 4 plots the a posteriori thresholds versus the a priori thresholds for PIBTD using the four threshold selection schemes as mentioned above. Similar to Baseline, choosing the zero crossings of curves is likely to overestimate the thresholds, resulting in high as shown in Fig. 5. This overestimation has been reduced by choosing the middle of the zero crossing of the and curves, as shown in Fig. 4. However, Fig. 5 shows that this still leads to high for many speakers. For PIBTD, both Scheme III and Scheme IV give a very good match between a priori and a posteriori thresholds as shown in Figs. 4 and 4. They also gives a reasonable trade-o between s and s, as illustrated in Figs. 5 and 5.

1 Baseline: Scheme I 1 Baseline: Scheme II 1 Baseline: Scheme III 1 Baseline: Scheme IV 8 8 8 8 6 6 6 6 4 4 4 4 2 2 2 2 2 4 6 8 1 2 4 6 8 1 2 4 6 8 1 2 4 6 8 1 Fig. 3. s versus s corresponding to 138 speakers. All errors are based on the a priori thresholds determined by choosing the zero crossings of curves (Scheme I), the middle of the zero crossing of and curves (Scheme II), the zero crossings of curves (Scheme III), and the point at which the attains.5% as thresholds. The baseline method was used in all cases..8 PIBTD: Scheme I.8 PIBTD: Scheme II.8 PIBTD: Scheme III.8 PIBTD: Scheme IV.7.7.7.7.6.5.4.3.2.6.5.4.3.2.6.5.4.3.2.6.5.4.3.2.1.1.1.1.1.2.3.4.5.6.7.8.1.2.3.4.5.6.7.8.1.2.3.4.5.6.7.8.1.2.3.4.5.6.7.8 Fig. 4. A posteriori equal error thresholds versus a priori thresholds found by choosing the zero crossings of curves, the middle of the zero crossing of and curves, the zero crossings of curves, and the point at which the attains.5% as thresholds. PIBTD was used to obtain the and curves. 1 PIBTD: Scheme I 1 PIBTD: Scheme II 1 PIBTD: Scheme III 1 PIBTD: Scheme IV 8 8 8 8 6 6 6 6 4 4 4 4 2 2 2 2 2 4 6 8 1 2 4 6 8 1 2 4 6 8 1 2 4 6 8 1 Fig. 5. s versus s corresponding to 138 speakers. All errors are based on the a priori thresholds determined by choosing the zero crossings of curves, the middle of the zero crossing of and curves, the zero crossings of curves, and the point at which the attains.5% as thresholds. PIBTD was used in all cases. B.3 Using SPIBTD In SPIBTD, the curves obtained during enrollment become very close to those obtained during verication (see Fig. 1) because the speech are sampled from a large set of pseudo-impostors. Therefore, it makes sense to use the curves obtained during enrollment to determine the threshold. The question remains to be answered is that which part of the curve should be used for determining the threshold. To this end, we plot the curves of 1 speakers in Fig. 6. A closer look at this gure reveals that the curves become at when the error rate is close to zero. This is caused by the fact that the feature vectors of the 45 pseudo-impostors form a much scattered distribution in the feature space as compared to those formed by 5 anti-speakers or 5 pseudo-impostors. Recall from (1) and (2) that verication decisions are based on the average dierence between the two scaled network outputs. Therefore, if the pseudo-impostors' vectors spread over a wide region in the feature space, the chance of having some of these vectors being closed to the speaker's vectors becomes high. This phenomenon is illustrated in Fig. 7 where the distributions of speaker's speech and impostors' speech are assumed to be uni-modal. In Fig. 7, the false acceptance region is small as the pseudo-impostor patterns spread over a small area in the feature space, resulting in a small (E 1 ). On the other hand, Fig. 7 shows

Fig. 6. 5 4 3 2 1.1.2.3.4.5.6.7 Threshold curves of 1 speakers obtained by using SPIBTD. T2 T2 Speaker s Patterns Speaker s Patterns False acceptance region T1 T1 False acceptance region E1 curve T1 Pseudo-impostor Patterns E1 E2 Pseudo-impostor Patterns T2 Decision boundary curve T1 T2 Decision boundary Fig. 7. Diagrams showing the eect of having a scattered distribution of pseudo-impostor patterns on the curves. The dashed lines represent the decision boundaries formed by setting the decision threshold to T1 and T2. that if the pseudo-impostor patterns spread over a large area, more patterns will be falsely accepted. The consequence is that the changes by an insignicant amount in a large range of threshold values, suggesting that using the zero crossing of curves may result in unreliable thresholds. Fig. 6 shows that the threshold becomes very sensitive to the when the latter falls below.5%. To overcome this dicult, Scheme IV focuses on the region where the threshold is less sensitive to the, and selects the threshold at which the attains an application dependent level. In this work, the level was set to.5%. Fig. 8 are comparable to Fig. 2 and Fig. 4, suggesting that using the zero crossing of as threshold is not appropriate for SPIBTD. The large number of high in Fig. 9 also agrees with this observation. While choosing the middle of the and curves as threshold brings the a priori thresholds slightly closer to the a posteriori thresholds, it still causes an unacceptably high for many speakers, as shown in Fig. 9. Fig. 9 suggests that selecting the zero crossings of curves as thresholds can only reduce the number of speakers with high slightly. This is because the thresholds are very sensitive to the at the region of zero crossings, resulting in overestimated thresholds. Comparisons between Fig. 8 and Fig. 8 as well as Fig. 9 and Fig. 9 reveal that choosing a threshold that produces a pre-dened at enrollment time is the best approach as it gives the best compromise between and. We can see from Fig. 9 that the number of speakers with high is progressively reduced when we shift the focus from the curves to the curves (Scheme I to Scheme IV). Clearly, the sampling strategy of SPIBTD not only makes the verication performance more predictable, but also provides us a reliable means of determining the thresholds. The main reason is that sampling the speech of a large number of impostors and anti-speakers is able to produce a better representation of the impostor population and to build more robust speaker models. C. Comparisons Based on the Best Threshold Selection Scheme The above results show that Baseline, PIBTD, and SPIBTD require dierent threshold selection schemes to achieve the best trade-o between and (see Table I). It is of interest to compare the results of these methods by using their respective best threshold selection scheme. 3 To this end, we compare Fig. 2, Fig. 4, Fig. 4 and Fig. 8 in terms of robustness in threshold determination, and compare Fig. 3, Fig. 5, Fig. 5 and Fig. 9 in terms of the error rates obtained during verication. Evidently, the a priori and a posteriori thresholds obtained by the baseline method have the largest dierence, causing small but very large or vice versa for most of the speakers, as shown in Fig. 3. This makes the performance of the system dicult to predict. While the number of speakers with high is smaller in Figs. 5 and 5, this has to be achieved by increasing the number of speakers with high. Among all the methods, SPIBTD produces the most predictable system when it is combined with Scheme IV. D. Comparisons Based on Average Error Rates Table II summarizes the average (based on 138 speakers in the YOHO corpus),, and ERR obtained by Baseline, PIBTD, and SPIBTD with dierent threshold selection schemes. The results show that using Scheme I results in very high during verication, although both and during enrollment are very small. Scheme II is only appropriate for the baseline method as it produces a comparatively high during verication for PIBTD 3 Here, we consider the scheme that produces the `best' balance between and as the best scheme.

.8 SPIBTD: Scheme I.8 SPIBTD: Scheme II.8 SPIBTD: Scheme III.8 SPIBTD: Scheme IV.7.7.7.7.6.5.4.3.2.6.5.4.3.2.6.5.4.3.2.6.5.4.3.2.1.1.1.1.1.2.3.4.5.6.7.8.1.2.3.4.5.6.7.8.1.2.3.4.5.6.7.8.1.2.3.4.5.6.7.8 Fig. 8. A posteriori equal error thresholds versus a priori thresholds found by choosing the zero crossings of curves, the middle of the zero crossing of and curves, the zero crossings of curves, and the point at which (during enrollment) attains.5% as thresholds. SPIBTD was used to obtain the and curves. 1 SPIBTD: Scheme I 1 SPIBTD: Scheme II 1 SPIBTD: Scheme III 1 SPIBTD: Scheme IV 8 8 8 8 6 6 6 6 4 4 4 4 2 2 2 2 2 4 6 8 1 2 4 6 8 1 2 4 6 8 1 2 4 6 8 1 Fig. 9. s versus s corresponding to 138 speakers. All errors are based on the a priori thresholds determined by choosing the zero crossings of curves, the middle of the zero crossing of and curves, the zero crossings of curves, and the point at which attains.5% as thresholds. SPIBTD was used in all cases. and SPIBTD. Scheme III produces a good compromise between and for PIBTD but not for Baseline and SPIBTD. Finally, Scheme IV gives the lowest and for SPIBTD as well as a good compromise between and for PIBTD. One should bear in mind that the gures in Table II are based on the average of 138 speakers. Therefore, a good match between the average and does not mean that it can also produce a good match for individual speakers. For example, Fig. 5 shows that combining PIBTD and Scheme IV produces unmatched and for many speakers although closely matched averages (3.41% versus 4.31% ) can be obtained. One may notice that the s during enrollment for Scheme IV in Table II are not equal to.5%. This is because the and curves of some speakers cross each other and the intersection point was chosen as the threshold, resulting in an being higher than.5%. Therefore, the average is slightly higher than the pre-dened value. Table II also demonstrates that SPIBTD is the best method in terms of equal error rate (ERR), suggesting that sampling a large set of pseudo-impostors and anti-speakers is able to improves the capability of the EBF networks in modeling the anti-speakers and rejecting impostors. The ERR of SPIBTD is also about half of that of Higgins et al. [4] (.7% against 1.8%), suggesting that sampling the utterances from a large number of anti-speakers is able to produce a more robust speaker model. Note also that even our baseline method has a lower ERR as compared to that of Higgins et al. This implies that incorporating anti-speakers' speech into a speaker model has merits. The last column in Table II lists the obtained by setting all thresholds to zero. Recall from Section II that the average dierence of the two normalized network outputs is compared with a threshold for making a verication decision. Recall also that Baseline, PIBTD, and SPIBTD use the same set of speaker data (but dierent sets of antispeaker data) for building a speaker model. Therefore, for a xed threshold, the becomes an indicator of how good the anti-speakers are modeled by the EBF networks. The last column of Table II clearly shows that SPIBTD is more capable of modeling the anti-speakers. V. Comparing to Higgins' Model Our proposed speaker models and threshold determination methods have three advantages over Higgins' ones [4]. First, Higgins' model requires to select 5 closest speakers (among 137) to form the ratio speaker set during verication, causing a computation burden. Our methods, on the other hand, embed the features of the ratio speakers in the speaker models during enrollment. Rather than selecting 5 closest ratio speakers, our methods only need to sample the feature vectors of 45 anti-speakers for constructing a speaker model, which only take a fraction of the time. One may argue that we simply shift the com-

Enrollment Verication Zero Threshold Method % % % % ERR % % Scheme I..26.3 55.21 Baseline Scheme II.. 4.4 3.21 Scheme III.4. 43.81.16 Scheme IV.. 46.2.14 Scheme I.25.36.21 38.7 PIBTD Scheme II.26.26.57 1.8 Scheme III.29.24 3.9 4.71 Scheme IV.33.26 3.41 4.31 Scheme I.18.29.16 41.72 SPIBTD Scheme II.18.18.18 18.99 Scheme III.19.16.32 9.81 Scheme IV.62.15 1.12 3.94 TABLE II 1.11 59.74.7 1.38 Average error rates obtained by different methods. For Scheme IV, the pre-defined was set to.5%. putation burden from the verication sessions to the enrollment sessions. However, a low computation overhead during verication is certainly an advantage in real-world applications where real-time response is essential. During verication, our methods only require to compute two likelihood functions (one for the speaker class and the other for the anti-speaker class), whereas Higgins' model requires six (one for the speaker and another ve for the ratio speakers) likelihood functions to be evaluated. The second advantage of our proposed speaker models is that they are more robust in rejecting impostors. If the impostor's speech is closer to the speaker's speech than to the speech of the 5 ratio speakers, Higgins' model will accept the impostor. In our case, the speaker model is constructed by sampling the speech of the corresponding speaker and 45 anti-speakers. The latter forms a better representation of the impostor population as compared to using the speech of only 5 ratio speakers. Therefore, it is more likely that the impostor' speech is closer to the antispeakers' speech than to the speaker's speech, resulting in a rejection. The third advantage being that our methods provide several means of nding a threshold that can strike a balance between and, whereas Higgins' model makes no attempts to achieve this goal. Our experimental results show that the proposed methods not only produce an ERR that is only half of that obtained by Higgins et al. (.7% versus 1.8%), but also produce a good compromise between and. For example, Higgins et al. obtained 4.2% and.37% using 3 utterances per trial, whereas we obtained 3.94% and 1.12% using speech segments with length approximately equal to three utterances. VI. Conclusions This paper addresses the problem of determining a priori thresholds for phrase-prompted speaker verication. Conventional approaches have been compared with the proposed one. Experimental evaluations based on 138 speakers in the YOHO corpus have been carried out. It was shown that robust thresholds can be obtained by simulating an operation environment as close as possible to the real one. Our proposed method is able to predict the verication performance accurately by using enrollment data only, leading to more reliable thresholds. With the proposed method, it is able to nd a better balance between s and s. References [1] S. Furui. Cepstral analysis technique for automatic speaker verication. IEEE Trans. on Acoustics Speech and Signal Processing, ASSP-29(2):254{272, 1981. [2] D. K. Burton. Text-dependent speaker verication using vector quantization source coding. IEEE Trans. on Acoustics, Speech, and Signal Processing, ASSP-35(2):133{143, 1987. [3] J. M. Naik, L. P. Netsch, and G. R. Doddington. Speaker verication over long distance telephone lines. In Proc. ICASSP'89, 1989. [4] A. Higgins, L. Bahler, and J. Porter. Speaker verication using randomized phrase prompting. Digital Signal Processing, 1:89{ 16, 1991. [5] T. Matsui and S. Furui. Likelihood normalization for speaker verication using a phoneme- and speaker-independent model. Speech Communications, 17:19{116, 1995. [6] C. S. Liu, C. H. Lee, B. H. Juang, and A. E. Rosenberg. Speaker recognition based on minimum error discriminative training. In Proc. ICASSP'94, volume 1, pages 325{328, 1994. [7] A. E. Rosenberg, J. DeLong, C. H. Lee, B. H. Juang, and F. K. Soong. The use of cohort normalized scores for speaker verication. In Proc. ICSLP'92, volume 2, pages 599{62, 1992. [8] A. E. Rosenberg, O. Siohan, and S. Parthasarathy. Speaker verication using minimum verication error training. In Proc. ICASSP'98, pages 15{18, 1998. [9] J. B. Pierrot et al. A comparison of a priori threshold setting procedures for speaker verication in the CAVE project. In Proc. ICASSP'98, pages 125{128, 1998. [1] M. W. Mak and C. K. Li. Elliptical basis function networks and radial basis function networks for speaker verication: A comparative study. In IJCNN'99, July 1999. [11] D. A. Reynolds and R. C. Rose. Robust text-independent speaker identication using Gaussian mixture speaker models. IEEE Trans. on Speech and Audio Processing, 3(1):72{83, 1995. [12] Jr. J. P. Campbell. Testing with the YOHO CD-ROM voice verication corpus. In ICASSP95, pages 341{344, 1995.