Text-Independent Speaker Verification Using Utterance Level Scoring and Covariance Modeling

Size: px

Start display at page:

Download "Text-Independent Speaker Verification Using Utterance Level Scoring and Covariance Modeling"

Merry Curtis
6 years ago
Views:

1 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 10, NO. 6, SEPTEMBER Text-Independent Speaker Verification Using Utterance Level Scoring and Covariance Modeling Ran D. Zilca, Member, IEEE Abstract This paper describes a computationally simple method to perform text independent speaker verification using second order statistics. The suggested method, called utterance level scoring (ULS), allows obtaining a normalized score using a single pass through the frames of the tested utterance. The utterance sample covariance is first calculated and then compared to the speaker covariance using a distortion measure. Subsequently, a distortion measure between the utterance covariance and the sample covariance of data taken from different speakers is used to normalize the score. Experimental results from the 2000 NIST speaker recognition evaluation are presented for ULS, used with different distortion measures, and for a Gaussian mixture model (GMM) system. The results indicate that ULS as a viable alternative to GMM whenever computational complexity and verification accuracy needs to be traded. Index Terms Covariance modeling, speaker verification, text independent, utterance level scoring. I. INTRODUCTION MOST current text independent speaker verification systems are based on the following approach. When a new speaker enrolls to the system, an explicit expression of the probability density function (PDF) representing the speech features in the enrollment session is estimated and serves as a speaker model. Typically this estimation is based on a predefined PDF structure such as a Gaussian mixture model (GMM) [1], and the PDF is obtained by estimating the GMM parameters, namely means variances and weights. Modeling of the opposite hypothesis (i.e., that the speech originated from a different speaker than the claimant) is performed in one of the two following ways. 1) World modeling or universal background model (UBM): Estimating the PDF of a larger session that includes many different speakers [2]. 2) Cohort modeling: training few different speaker models and taking the average score against the cohort speakers as the world s score [1]. When a test utterance needs to be verified against a claimant speaker model, each of the speech frames is scored against the claimant speaker model and against the world model or cohort background models. The normalized score of individual frames is then accumulated to a total score of the tested utterance, typically by averaging. We refer to this approach as frame level scoring (FLS). The FLS approach is based on the notion that the tested utterance includes only a subset of the phonetic space modeled Manuscript received January 22, 2001; revised January 24, The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Philip C. Loizou. The author was with the Research and Development Division, Amdocs, Israel. He is now with the IBM Research T. J. Watson Center, Yorktown Heights, NY USA ( zilca@us.ibm.com). Digital Object Identifier /TSA by the text independent model, and therefore cannot be compared to it in its entirety. In fact, each frame score is effectively influenced only by the closest Gaussian components. This multimodal nature of modeling and verification mandates extensive computation since each frame is scored separately against multimodal models during verification. In addition, training multimodal speaker and background models usually requires iterative procedures. In the heart of the FLS approach lies the assumption that the exact position in the feature space of a vector extracted from a single speech frame contains the information regarding the speaker s identity, rather then the nature of the distribution of the entire utterance frames. However in practice it is clear that the exact position of the feature vector extracted from a single frame is significantly affected by the channel, handset type and additive noise. Although averaging the individual frame scores reduces these effects, this observation motivates a pursue for a speaker verification system that scores an utterance by comparing the utterance distribution in feature space to the speaker and world models, rather than averaging frame scores. Following this approach, once the utterance model is computed the utterance score is calculated in a single operation. We therefore term it utterance level scoring (ULS). ULS was suggested previously for multimodal models [3]. Yet, it was shown to be suitable mainly when long test utterances are available, due to the need of training a multimodal utterance model. We model each tested utterance solely by its sample covariance matrix. This simple utterance model does not include many parameters to be estimated and therefore allows shorter utterances. In brief, we are looking for a speaker verification method that would be computationally light and compact. This is obtained first by using uni-modal speaker and world models, composed only of the second order statistics of the training data. In addition, the use of ULS allows further reduced computation since it allows going through the frames of a speech utterance only once for computing its sample covariance (the utterance model), and subsequently scoring against speaker and world/background models with a single operation. We also conjecture that our suggested approach may have desirable robustness properties, due to the use of ULS rather than FLS. In a previous study we presented preliminary experimental results of ULS with second order statistics [4]. This paper describes more thorough experimentation and provides an in-depth discussion about the robustness properties of the suggested method. The rest of the paper is organized as follows. Section II provides an overview of covariance models for speaker verification, both with FLS and with ULS. The computational advantages of using ULS with covariance modeling are given in Section III. This is fol /02$ IEEE

2 364 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 10, NO. 6, SEPTEMBER 2002 lowed by a description of the experiments that were conducted in the 2000 NIST speaker recognition evaluation in Section IV, and a presentation of the experimental results in Section V. We then discuss the results and provide our conclusions. II. COVARIANCE MODELS The term covariance model (CM) [4] refers to a single Gaussian speaker model, ignoring the mean vector. Thus training a speaker CM involves only computing the sample covariance matrix,, of the training data where is the number of training feature vectors, is the th training vector and is the sample mean of the training data. The purpose of ignoring the mean vector is to focus only on the shape of the distribution of features extracted from the speaker data, rather then their exact location in feature space. However, many speaker verification systems that use cepstral vectors perform cepstral mean subtraction [5] to compensate for channel effects, resulting in a zero mean regardless of the CM approach. A. Using CM With Frame Level Scoring CM may be used for FLS, in which case each speech frame is scored using an explicit Gaussian expression or a Mahalanobis distance. We refer to this CM-based system as a single Gaussian full covariance GMM, or SG. When used for speaker verification, the score with regard to the speaker CM should be normalized using the background or world CM,. The background CM is calculated similar to (1), but using vectors collected from a variety of different speakers. Since SG follows a likelihood ratio classification approach [6], the normalized score of a single frame,, is obtained by subtracting the two scores where is a th component feature vector extracted from a speech frame and and are the explicit Gaussian PDF expressions, e.g., Up to an expression that depends only on the covariance determinants, the Gaussian expressions in (3) may be replaced with Mahalanobis distances when substituted in (2) When using SG, the means and may be either used directly as described above, or set to zero in order to disregard the exact location in feature space. As mentioned earlier, when using cepstral coefficients with cepstral mean subtraction, the means equal zero by nature regardless of the modeling approach. B. Using CM With Utterance Level Scoring An alternative approach to SG would be to first calculate the sample covariance of an utterance and then score the utterance (1) (2) (3) (4) with a single operation with respect to each model (i.e., the speaker and background covariances). Unlike the SG/GMM system presented in the previous subsection, the ULS approach does not concern single frames. Instead, it involves measuring the similarity between the shape of the distribution of the entire tested utterance and the shape of the speaker and world distributions. Same as for the speaker and world data, the covariance matrix of the tested utterance represents its shape. Therefore the first step in performing ULS verification is to compute the sample covariance matrix of the utterance, i.e., where is the number of the frames within the utterance and is the sample mean of the utterance data. Once is computed, it should be compared with the world and claimed speaker covariances, to find which it matches most. Since this scoring stage may be performed in a single operation, using distortion measures between covariance matrices, We term this approach ULS. The distortion measure between the utterance and the speaker covariances should be normalized by the similarity to the world covariance. Let us consider a general distortion measure between two covariance matrices. We suggest to use the following form of normalized utterance score for ULS: As seen from this expression, a speech utterance having a small distortion measure with respect to the claimant speaker s CM, relative to its distortion measure from the world CM, will produce a high score, and vice versa. Several distortion measures between covariance matrices were previously used for speaker identification, with no world/background normalization (e.g., [7] [9]). We chose to investigate the new normalization method suggested in (6) using two of these distortion measures: the divergence shape and the arithmetic-harmonic sphericity measure. The main computational advantage of ULS over likelihood based methods, is that although the utterance sample covariance is calculated frame by frame, each frame calculation is performed regardless of the speaker model. Consequently, once the utterance covariance is obtained, scoring with respect to a model (either speaker or world) involves only a single operation. Speaker verification involves scoring with respect to a speaker model and at least one more additional world or background model, in order to obtain a normalized score. Therefore, ULS allows to obtain a normalized score while going through each frame only once (unlike likelihood based systems where the process of going thorough the entire utterance frame by frame needs to be repeated for every model). This computational advantage is more significant when several models rather than a single world model are used for score normalization. For example, when using cohort based normalization [1], or when gender and handset-type dependent world models are used not only for normalization but also as gender (5) (6)

3 ZILCA: TEXT-INDEPENDENT SPEAKER VERIFICATION USING UTTERANCE LEVEL SCORING AND COVARIANCE MODELING 365 and handset-type detectors (by using the best scoring world model for normalization). 1) Divergence Shape Ratio: The divergence shape is an information-theoretic measure between two Gaussian classes, introduced by Campbell as a measure of dissimilarity between a reference speaker utterance and a tested utterance [8]. Given the PDFs of two classes and, the symmetric directed divergence (resulting from the sum of discriminating information for class 1 versus class 2 and vice verse) equals Assuming the two classes follow a Gaussian distribution with means and covariances respectively, and substituting the Gaussian PDF expression of the form (3) in (7), we obtain an expression composed of a sum of two components, as follows: (8) where is the trace operator and. The first component depends solely on the covariances, and may therefore be attributed to the dissimilarity between the shapes of the two classes. The second expression is a distortion measure between the two means, and is related to the distance in feature space between the two distributions. Campbell suggested using only the first component, termed the divergence shape (DS), i.e., The divergence shape was introduced in the context of closed set speaker identification. We propose using this measure for speaker verification score normalization, as part of the approach stated in (6). Also, while the DS was tested previously only on clean speech using line spectra pairs (LSP), the experiments presented herein are performed on conversational telephone-quality speech, using Mel frequency cepstral coefficients (MFCC). We suggest a new normalized ULS method, called the divergence shape ratio (DSR), where the utterance score,, is calculated following (6) with the divergence shape as the distortion measure, i.e., (7) (9) (10) 2) Sphericity Measure Ratio: The sphericity measure ratio (SMR) is computed by applying (6) to a distortion measure termed arithmetic-harmonic sphericity measure, or the sphericity measure (SM). The expression for the SM is derived from the analysis in [9] and it was recently used in [10]. The same as for DS, this measure was previously used only for closed set speaker identification. We used the SM for background normalization in speaker verification in accordance with (6), and tested it on the 2000 NIST speaker recognition evaluation. A detailed analysis on using second order statistical measures and the derivation of the arithmetic-harmonic sphericity measure may be found in [9]. The SM is computed as follows: (11) and the SMR III. COMPUTATIONAL ADVANTAGES OF UTTERANCE LEVEL SCORING (12) In order to demonstrate the computational efficiency of using the CM/ULS approach, let us consider three criteria for the evaluation of computational complexity: number of memory cells required for storing a speaker model; CPU time for speaker model training; CPU time for verifying a test utterance. SG shares the modest storage requirements with DSR and SMR (resulting from using CM), but requires more verification CPU time, since verification is performed at the frame level, for each model. A. Speaker Model Storage Requirements A CM includes parameters, where is the dimension of the speech feature vector. A diagonal covariance GMM with components requires parameters, and a full covariance GMM requires. When no compression is used, each parameter requires a single memory cell. For example, given a 18th order cepstral feature vector, the CM s require only 171 memory cells. A 16 component full covariance GMM requires 3040 cells, a 64 component diagonal covariance GMM requires 2368 cells, and 512 diagonal GMM adapted from a background model [2] requires cells (!). B. Enrollment and Verification CPU Times CPU times cannot be evaluated as straightforward as the storage requirements, since computation time depends on the implementation specifics (e.g., number of training iterations, trading elaborate functions with look-up tables). In addition, executing identical experiments on different computing environments would result in different measured times depending on the compiler/interpreter, and the hardware setup. However, the dramatic difference in computational requirements between the CM/ULS approach and multimodal FLS systems, may be demonstrated using a single example. CPU times, in seconds, are shown for such an example in Table I. All times were measured for the utterance eaaa and speaker 4309 that were chosen randomly from the NIST-1999 evaluation. Training and verification were performed using Matlab simulation on a P-III, 450 MHz workstation. To allow a fair comparison, approximated enrollment and verification was used for different GMM configurations as follows. Non-adapted GMM s were trained using DB-GMM (K-means with no EM iterations) [11]. Verification for these methods was computed using only the 20 nearest components of the world model for each frame. Adapted GMM verification was performed using a five-component approximation as described in [12]. The computational advantage of the covariance modeling methods is evident. Training is four orders of magnitude faster than GMM. SG verification is 17 to 24 times faster than GMM,

4 366 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 10, NO. 6, SEPTEMBER 2002 TABLE I CPU TIME [SECONDS] FOR ENROLLMENT AND VERIFICATION and DSR verification is more than three orders of magnitude faster than GMM verification, owing to utterance level scoring. Trials on other utterances and speakers exhibited similar results. IV. EXPERIMENTAL SETTINGS The experiments described in this section were performed as part of the participation in the 2000 NIST speaker recognition evaluation. The data includes conversational telephone-quality speech taken from the Switchboard 2 corpus. The evaluation includes scoring and verifying 6096 target trials and impostor trials involving 926 target speakers, male and female. All of the test segments are recorded from calls made from a telephone number that is different from the one used to enroll. Therefore, all test utterances may be considered to be collected using a different handset than the one used for training the speaker model. Each speaker is trained using a single twominute session, while testing utterances range between few seconds and a minute (with a primary focus on utterances with s duration). All the speech utterances (both training and testing) are labeled by the type of handset (electret or carbonbutton) using the MIT-LL handset classifier [13]. A detailed description of the evaluation settings and rules may be found in [14]. A. Signal Processing An identical signal processing procedure was applied to all the tested modeling and verification systems. The speech vector was first segmented into 25 ms frames every 12.5 ms. Each frame was multiplied by a Hamming window and analyzed using a voicing detector. Only voiced frames are selected for further use. Very aggressive voicing detection was applied, filtering about half of the speech data. FFT based 18th order MFCC were extracted from each speech frame using a triangle filter bank. Cepstral mean subtraction was then performed to compensate for convolutional channel effects; i.e., the mean vector of each utterance was subtracted from each single MFCC vector within the utterance. B. Tested Systems Four systems were tested, as follows. Fig. 1. DET curves for all trials. GMM bold, DSR solid, SG dashed, and SMR dash-dot. 1) Frame Level Scoring a) Diagonal covariance GMM obtained using Bayesian Adaptation [2]. b) SG. 2) Utterance Level Scoring a) DSR. b) SMR. SG, DSR, and SMR are described in this paper. In GMM the model is trained by adapting from a background model that includes 512 components, which is also used for background normalization during verification. C. Modeling Settings As for the signal-processing portion of the experiments, the following modeling settings were kept identical for all the tested systems. Four different world models were trained for each system, depending on the handset type (carbon-button or electret) and gender. The world models were trained using the entire test data of the 1999 NIST evaluation, about two hours of data for each model. For the CM systems (SG, DSR, and SMR) this involved simply calculating the sample covariance matrix of this data. For the GMM the world model was trained using the DB-GMM procedure [11]. That is, the data was clustered using the K-means algorithm, and the GMM parameters were calculated as follows: the weights were given by the relative number of vectors in each cluster, and the means and variances were the sample mean and variance of each cluster. No EM iterations were performed. The speaker GMMs were adapted from the prospective world model. The choice of world model for normalization of the verification trials was according to the type of handset in the training session and the claimant speaker s gender (the evaluation does not include any cross gender trials). For GMM the verification procedure was approximated using five Gaussian components, as described in [12].

5 ZILCA: TEXT-INDEPENDENT SPEAKER VERIFICATION USING UTTERANCE LEVEL SCORING AND COVARIANCE MODELING 367 Fig. 2. DET curves for match/mismatch conditions. Et=et bold, cb=et solid, et=cb dashed, and cb=cb dash-dot. V. RESULTS The results of the experiments are given as DET curves [15] showing the trade off between false impostor acceptance and false utterance rejection each system is capable to produce. Fig. 1 shows a performance comparison between the four tested systems, for the entire set of trials. It is clearly seen that GMM performs the best, and also that SMR is the best CM system. For numerical comparison purposes, let us consider the equal error rate (EER) which is the point on the DET curve where the false acceptance and false rejection rates are equal. GMM obtains 17.3% EER, while SMR, SG and DSR reach 23.8%, 26.4% and 27.6% respectively. In other words the CM systems exhibit up to 60% relative degradation in EER with respect to GMM. These results may be expected considering the simplistic nature of CM. However, SMR performs only 6.5% below GMM (37% relative degradation). In order to examine the robustness of the tested systems let us consider the DET curves shown in Fig. 2. The NIST-2000 speaker recognition evaluation includes only handset-mismatched trials (different telephone-number), so in terms of the handset, only robustness to the handset type may be evaluated. We use the following notation for the different match/mismatch conditions. Training and verification performed using an electret handset. Training and verification performed using a carbon button handset. Training performed using electret and verification with carbon button. Training performed using carbon button and verification with electret. As expected, all the tested systems exhibit best performance for and worst performance for. This is related to the degradation caused by the low quality carbon button microphone, as expressed in the speaker model, and the mismatch in handset types. However, the worst degradation for con-

368 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 10, NO. 6, SEPTEMBER 2002 TABLE II ROBUSTNESS MEASURES ditions is observed for SG 47.1% EER that is very close to a 50% random guess.

verification utterance is different than the one used for training the claimed speaker s model. It is the average of two components.

A second robustness measure,, represents system sensitivity to the type of handset in matched conditions. is calculated as the relative EER degradation between matched and conditions, i.e., (14) We have measured the EER for the different conditions based on the performance shown in Fig.

6 368 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 10, NO. 6, SEPTEMBER 2002 TABLE II ROBUSTNESS MEASURES ditions is observed for SG 47.1% EER that is very close to a 50% random guess. In order to compare the robustness of the systems let us define the following robustness measures (13) represents the relative degradation in accuracy when the type of the handset used to collect the verification utterance is different than the one used for training the claimed speaker s model. It is the average of two components. The first one,, is the relative degradation in EER between and conditions, and the second,, is the degradation from to. A second robustness measure,, represents system sensitivity to the type of handset in matched conditions. is calculated as the relative EER degradation between matched and conditions, i.e., (14) We have measured the EER for the different conditions based on the performance shown in Fig. 2 and calculated the values of, and as shown in Table II. As seen, DSR exhibits high sensitivity to the handset type but is not very sensitive to handset type mismatch. SMR on the other hand is not sensitive to the handset type, but rather to mismatch. The extreme robustness of SMR to handset type is noteworthy only 9% relative degradation. This property is also observable in the absolute EER values: in conditions SMR obtains 23% EER which is only a relative degradation of 18% with respect to the respective GMM EER, which is 19.5%. Yet, it should be noted that from Fig. 2(d) it appears that SMR has better performance for thresholds tuned around the EER than for other regions of the DET curve. VI. DISCUSSION The clearest advantage of the suggested CM/ULS approach is its extreme computational simplicity. This advantage may be useful in applications where verification accuracy may be traded with computational requirements. In addition, this approach may be useful for other open set speaker recognition problems, such as the open-set speaker identification task. This task includes no knowledge of a claimant speaker and the system is required to blindly recognize the sepaker s identity among a predefined set of speaker models. In addition it is required to decide whether the speaker exists at all in the speaker database by thresholding the scores. Fast approximated open-set speaker identification may be performed by scoring against all the speaker models using ULS to determine the top scoring Fig. 3. Score histogram, all trials (a) GMM and (b) SMR. speakers, followed by a more precise GMM/FLS to determine the speaker s identity among the best scoring ones. This approach utilizes the tradeoff between the scoring accuracy and the number of best-detected speakers. However, although the computational simplicity of CM/ULS may be useful in some applications, it is highly desirable to consider possible ways to improve its recognition accuracy. Since (6) is a form of score transformation, the first possibility that comes in mind for modifying CM/ULS is using a different form of score transformation than (6). However, in preliminary experiments we tried several different simple mathematical expressions involving the two distortion measures and chose the form of (6) because of the resulting Gaussian-like score histograms. The fact that score distributions are close to Gaussian is also verified in the DET curves, which are approximately linear. An example of score histograms is shown in Fig. 3. The figure shows the impostor score histogram superimposed on the true speaker histogram. The two histograms are displayed in different granularity in order to have similar occurrence values. Fig. 3(a) shows the score histograms for GMM, where score normalization is obtained using frame level likelihood ratio, and Fig. 3(b) shows the histograms for SMR where normalization is performed using (6). The difference in verification accuracy is clear, as the GMM histograms are much more easily separable than SMR. However, the SMR histograms are seen to be Gaussian-like, same as for GMM. This supports the use of a score transformation function of the form (6).

7 ZILCA: TEXT-INDEPENDENT SPEAKER VERIFICATION USING UTTERANCE LEVEL SCORING AND COVARIANCE MODELING 369 VII. CONCLUSIONS In this paper, we have described a computationally simple method to perform text independent speaker verification. The method models the speaker and background using a sample covariance. Verification is performed by comparing the utterance sample covariance with the covariance models, using a distortion measure. Since this method allows to score the utterance with respect to a covariance model in a single computational operation rather than frame by frame, we termed this method utterance level scoring (ULS). The suggested approach provides substantial computational simplification with respect to likelihood ratio classifiers such as GMM. ULS using several distortion measures was tested on the 2000 NIST speaker recognition evaluation corpus and compared to GMM and a single Gaussian classifier. We also compared the tested systems in different conditions of mismatch between the type of handset used for enrollment and verification. The results indicate that ULS may be a viable technique when computational complexity needs to be gracefully traded with verification accuracy, in particular whenever it is required to score a tested utterance with respect to several models. Examples for such scenarios include open set speaker identification and using world models as gender and handset-type detectors (by using the best scoring world model for normalization). REFERENCES Fig. 4. DET curves SMR using UBM and cohort normalization for all trials, and for electret data. Another modification of CM/ULS that might prove useful is using cohort-based normalization rather than a single world model. Since CM/ULS scoring involves a single operation once the utterance covariance has been calculated, the additional computational cost of using cohort normalization [1] instead of a single world model is relatively small. When using cohort normalization, no world model is pre-trained, and for the verification procedure the denominator of (6) is replaced by an average of scores with respect to few speaker models. Fig. 4 shows the performance of SMR for cohort normalization comparing to normalization using a world model. We chose SMR due to its superior performance relative to DSR and SG. The cohort speakers were taken from the 1999 NIST speaker recognition evaluation. As with the world models, different normalization was used according to gender and handset type. There are 857 male-electret cohort speakers, 591 male-carbon button, 1449 female-electret, and 523 female-carbon button. Fig. 4 clearly indicates that cohort normalization performs worse than world normalization for SMR. [1] D. A. Reynolds, Speaker identification and verification using Gaussian mixture speaker models, Speech Commun., vol. 17, pp , [2] D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, Speaker verification using adapted Gaussian mixture models, Digital Signal Process., vol. 10, no. 1 3, pp , [3] H. S. M. Beigi, S. H. Maes, and J. S. Sorensen, A distance measure between collections of distributions and its application to speaker recognition, in Proc. IEEE Int. Conf. Acoust. Speech Sig. Processing, 1998, pp [4] R. D. Zilca, Text independent speaker verification using covariance modeling, IEEE Signal Processing Lett., vol. 8, pp , Apr [5] R. J. Mammone, X. Zhang, and R. P. Ramachandran, Robust speaker recognition, IEEE Signal Processing Mag., vol. 13, pp , Sept [6] K. Fukunaga, Introduction to Statistical Pattern Recognition. New York: Academic, [7] H. Gish, Robust discrimination in automatic speaker identification, in Proc. IEEE Int. Conf. Acoust. Speech Sig. Processing, 1990, pp [8] J. P. Campbell, Speaker recognition: A tutorial, Proc. IEEE, vol. 85, no. 9, pp , Sept [9] F. Bimbot, I. M. Magrin-Chagnolleau, and L. Mathan, Second order statistical measures for text independent speaker identification, Speech Commun., vol. 17, pp [10] C. Alonso-Martinez and M. Faundez-Zanuy, Speaker identification in mismatch training and testing conditions, in Proc. IEEE Int. Conf. Acoust. Speech Sig. Processing, 2000, pp [11] R. D. Zilca and Y. Bistritz, Distance-based Gaussian mixture model for speaker recognition over the telephone, in Proc. ICSLP-2000, Beijing, China, 2000, pp [12] D. A. Reynolds, Comparison of background normalization methods for text-independent speaker verification, in Proc. Eurospeech-97, Rhodes, Greece, 1997, pp [13], HTIMIT and LLHDB: Speech corpora for the study of handset transducer effects, in Proc. IEEE Int. Conf. Acoust. Speech Sig. Proc., vol. 2, 1997, pp [14] [Online] [15] A. Martin, G. Doddington, T. Kamm, M. Ordowski, and M. Przybocki, The DET curve in assessment of detection task performance, in Proc. Eurospeech-97, Rhodes, Greece, 1997, pp

8 370 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 10, NO. 6, SEPTEMBER 2002 Ran D. Zilca (M 92) received the B.S. degree with honors in electrical engineering from Ben Gurion University, Israel, in 1993, and M.S. degree in electrical engineering from Tel Aviv University, Israel, in From 1993 to 1999, he was with the Israel Defense Forces (IDF), where he served as an R&D Engineer and Project Manager. From 1999 to 2001, he was a Research Scientist with the R&D division of Amdocs, Israel, where he conducted research in text independent speaker recognition for landline and cellular environments, focusing on computationally efficient large population open-set speaker identification. During this time, he also served as the project manager for the development of a robust voice portal for cellular providers. In 2001, he joined the Human Language Technology Department at the IBM T. J. Watson Center, Yorktown Heights, NY, where he is conducting research in telephone speaker recognition. His current research interests include statistical pattern recognition, and robust speaker verification and identification.

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS Elliot Singer and Douglas Reynolds Massachusetts Institute of Technology Lincoln Laboratory {es,dar}@ll.mit.edu ABSTRACT