Discriminative Scoring for Speaker Recognition Based on I-vectors

Discriminative Scoring for Speaker Recognition Based on I-vectors Jun Wang, Dong Wang, Ziwei Zhu, Thomas Fang Zheng and Frank Soong Center for Speaker and Language Technologies (CSLT) Tsinghua University, Beijing 100084, P.R.China E-mail: wangjun stock@hotmail.com; wangdong99@mails.tsinghua.edu.cn; zhuziwei1@outlook.com; Corresponding author: Thomas Fang Zheng E-mail: fzheng@tsinghua.edu.cn Microsoft Research Asia, Beijing 100084, P.R.China E-mail: frankkps@microsoft.com Abstract The popular i-vector approach to speaker recognition represents a speech segment as an i-vector in a lowdimensional space. It is well known that i-vectors involve both speaker and session variances, and therefore additional discriminative approaches are required to extract speaker information from the total variance space. Among various methods, the probabilistic linear discriminant analysis (PLDA) achieves stateof-the-art performance, partly due to its generative framework that represents speaker and session variances in a hierarchical way. A disadvantage of PLDA, however, lies in its Gaussian assumption of the prior/conditional distributions on the speaker and session variables, which is not necessarily true in reality. This paper presents a discriminative scoring approach which models i-vector pairs using a neural network (NN) so that the posterior probability that an i-vector pair belongs to the same person is read off from the NN output directly. This discriminative approach does not rely on any artificial assumptions on data distributions and can learn speaker-related information with sufficient accuracy provided that the network is large enough and the training data are abundant. Our experiments on the NIST SRE08 interview speech data demonstrated that the NNbased approach outperforms PLDA in the core test condition, and combining the NN and PLDA scores leads to further gains. I. INTRODUCTION Joint factor analysis (JFA) has gained much success in speaker recognition. This approach assumes that the speaker variance and session variance are derived from two independent random variables (factors) that follow standard Gaussian distributions a priori (usually in low-dimensional subspace). The speaker representation of a speech segment is then derived by inferring the posterior probability of the speaker factor given the speech signal [1]. Recent research reveals that speaker and session variances may not be clearly separated by JFA, and session factors inferred from JFA may still involve some speaker information. A better approach would be representing speaker and session variances as a single total variance factor, so that more speaker information can be retained in the posterior inference. By this approach, a speaker segment can be represented by an i-vector which corresponds to the mean of the inferred posterior distribution of the total variance factor. This is widely known as the total variance model or i-vector model [2]. Involving both speaker and session variances is a particular advantage of the i-vector model because more speaker-related information is retained; however, at the same time, it is also an obvious disadvantage: the mixed representation leads to less discrimination among speakers. It is therefore important to employ some discriminative approaches to suppress session variance and accentuate speaker variance. For example, the with-in class covariance normalization (WCCN) technique employs a linear transform that is derived by optimizing generalized linear kernels [3], and the nuisance attribute projection (NAP) seeks a projection that minimizes discrepancy of signal pairs recorded in difference channels [4]. These approaches, although originally proposed for the SVM-based framework, have been demonstrated very effective in i-vector systems as a post-processing to enhance speaker discrimination with i- vectors [5]. Another approach that remarkably improves representative power of i-vectors is the probabilistic linear discriminant analysis (PLDA) [6]. On the one hand, PLDA a probabilistic version of LDA and so inherits LDA s discriminative nature; on the other hand, PLDA is a generative model which places a prior on the underlying class variable, and so can handle classes with very limited or even no training data. This is a big advantage of PLDA in speaker recognition, since in most situations only very few utterances are available for enrollment and test, and thus a generative model like PLDA is much more reasonable than a discriminative model such as support vector machines (SVM) because the latter usually requires quite some speech utterances to train one vs. all classifiers. In spite of the great success achieved by PLDA, there are still some limitations with this model. Particularly, it assumes Gaussian forms for prior distributions of speaker factors as well as the class conditional distributions of i-vectors, and the model is not directly optimized with respect to the recognition task, i.e., the true/imposter speaker decision. We therefore seek a discriminative model which relaxes the Gaussian assumption and can predict posteriors of true/imposter speakers directly. A possible approach is to design an SVM which conducts the prediction based on i-vector pairs, but this has been demonstrated in our experiments to be highly impractical, due to the large amount of i-vector combinations and the high dimensionality of i-vectors. In this paper, we propose a discriminative approach which conducts the true/imposter decision by a neural network (NN). Different from the naive method that builds discriminative models based on raw i- vector pairs, we first extract some discriminative features and then based on these features to construct the NN. This simple approach is demonstrated to be very effective and achieves better performance than PLDA in the NIST SRE08 core test. The rest of the paper is organized as follows: Section II summarizes the i-vector and PLDA technique, Section III presents the NN-based scoring approach, and Section IV presents the experiments. The paper is concluded in Section V. II. THEORY BACKGROUND A. i-vector model The conventional approach to speaker recognition is based on the universal background model-gaussian mixture model (UBM-GMM) architecture. The i-vector approach is an extension to the UBM-GMM approach and assumes that both speaker and session variances of a speech segment concentrate on a low-dimensional subspace of the model supervector (the concatenation of the mean vectors of all the GMM 978-616-361-823-8 2014 APSIPA APSIPA 2014

components). This subspace is referred to as the total-variance space, and a speech segment can be represented by an identity vector (i-vector) in this space. Mathematically, letting the UBM involve C Gaussian components, and the acoustic features of a speech segment associated with the c-th component follow a Gaussian distribution with mean M c and covariance Σ c. The i-vector model assumes that M c is generated from a low-dimensional variable w R M which normally distributed in the i-vector space, via a linear transformation T c : M c = m c + T c w (1) where m c R D is the mean of the c-th component of the UBM, T c R D M is the loading matrix associated with the c-th component. w follows the standard normal distribution N (0, I). The loading matrices {T c } can be trained by an EM procedure [7]. Once {T c } has been obtained, a speech segment X can be represented by the posterior probability p(w X) which can be inferred according to (1). Specifically, since the prior p(w) is a Gaussian, the posterior p(w X) is a Gaussian as well: p(w X) N ( w, Ξ) (2) where the mean vector w and covariance matrix Ξ can be computed from the zero- and first- order statistics of X. Details of the derivation can be found in [2]. In speaker recognition, the mean vector w is taken as the identity vector (i-vector) of the speech segment, and the true/imposter decision is conducted based on the distance (cosine distance is an often choice) between the i-vectors of the test speech and the enrollment speech. Note that an i- vector involves both speaker and non-speaker (e.g., channel, content, emotion) information. In order to improve discriminative capability for speakers, transforms such as WCCN [3], NAP [4] and LDA [8] are usually applied before computing the distance. B. PLDA It is well known that the linear discriminant analysis (LDA) corresponds to a generative model given by: w i,j = w i + Au where w i,j is an observation vector (i-vector in speaker recognition) of the i-th class, w i is the mean vector of the class, and u follows a Gaussian distribution: u N (0, I). This formulation implies that LDA assumes that the class conditional distributions p(w i,j w i ) are Gaussians that share the same covariance. Ioffe extended this model by placing a Gaussian prior on w i, which leads to a hierarchical Bayesian model shown in Figure 1. Fig. 1. The graphical model of PLDA, where N is the number of samples of class i. By this extension, the class mean w i is treated as a continuous variable instead of a discrete parameter as in the traditional N LDA. This significantly improves model generability and thus classes with very few samples can be well represented due to the prior [6]. This is particularly attractive for speaker recognition where in most cases only a few enrollment/test utterances (and hence i-vectors) are available for a speaker. A multitude of researches reported that PLDA can significantly improve performance of i-vector systems and achieve state-ofthe-art performance [5]. III. NN-BASED DISCRIMINATIVE MODEL A. Concern for PLDA In spite of the success of PLDA in speaker recognition, this model possesses some limitations, particularly the underlying Gaussian assumption on the prior and class conditional distributions. There is little justification for this assumption except the concern on computation tractability in model training and inference. We notice that this assumption can be relaxed to some extent by replacing Gaussians with Gaussian mixtures, as mentioned in [9], however this will greatly improve model complexity and the effectiveness has not yet been demonstrated in speaker recognition. Another concern for PLDA is the generative modeling itself: the optimization objective is to fit the data. Although the fitting takes class discrimination into account, it is still suboptimal with respect to the recognition task, i.e., the true/imposter decision. A desirable model should be discriminative in nature, and the objective function should be the true/imposter decision error rate. A simple discriminative approach designs a one vs. all classifier for each class, as has been used in the conventional SVM-based systems and can be simply migrated to i-vector systems [10]. This approach, however, needs to build many classifiers and suffers from data sparsity. An ideal approach is to build a single classifier that can make decisions for all speakers, as the PLDA approach does. The most straightforward way is to collecting a number of i-vector pairs and label them as same/different speakers, and train a discriminative model to predict the posterior probability of a pair of i-vectors belonging to the same speaker. This approach, according to our experiments, is promising with a small set of enrollment speakers; however, when the number of speakers increase, the performance decreases drastically, suggesting that it is hard for the discriminative model to learn discriminative patterns from raw i-vectors. B. NN-based scoring We turn to an alternative way to build the discriminative model. First of all, a number of i-vector pairs {(v i,1, v i,2 )} are collected and are labeled as positive (+1) and negative (-1) samples according to whether v i,1 and v i,2 belong to the same speaker or not, leading to a training set = {(v i,1, v i,2 ; l i )} where l i is the label of the i-th pair. In order to obtain the most discriminative information while keeping the feature set compact, LDA is applied to project the i-vector pairs to a low dimensional subspace, resulting in a projected training set = {(v i,1, v i,2 ; l i)} where v is the image of v with the LDA projection. A number of discriminative features are then extracted, leading to a ready-to-use training set = {(f i ; l i )}, where f i is the feature set derived from the pair (v i,1, v i,2 ). In this work, we use a simple feature set: the difference on the first n dimensions of the two vectors in the pair, i.e., {f i (j) = v i,1 (j) v i,2 (j) 2 ; j = 0,..., n 1}. Note that v i,1 and v i,2 are in the LDA projection space and hence their first n dimensions are assumed to retain the most discriminative information among all the dimensions. In addition, considering the success of the cosine distance in traditional i-vector

systems, it is also taken as a feature. In summary, the feature set involves n + 1 elements: where [f i (0), f i (1),...f i (n 1), < v i,1, v i,2 > v i,1 v i,2 ] f i (j) = v i,1(j) v i,2(j) 2. With the training data, a discriminative model can be constructed and optimized with respect to the true/imposter decision error rate. The NN is chosen as the discriminative model in this study though any discriminative model (e.g., SVM) is fine. The entire system is shown as Figure 2, where s nn is the posterior probability that the input i-vector pair represents the same speaker. L D A Fig. 2. Architecture of the NN-based scoring. It should be noted that training the NN model requires balanced positive and negative samples, and so the output of the model is a class posterior based on equal priors, i.e., genuine speakers and imposters have equal weights. Therefore the model can not be used to make decisions directly. A threshold on the posterior s nn needs to be determined on a development set to achieve the best performance in terms of the evaluation metric, which is the equal error rate (EER) in this study. From this perspective, the NN-based approach is a scoring approach which extends the normally used scoring approach based on cosine distance. In fact, if the feature set involves only the cosine distance, this approach is equal to the cosine scoring. C. PLDA-NN combination The advantage of the NN-based approach, when compared to PLDA, relies on the fact that it relaxes the Gaussian assumption of PLDA. This advantage, however, is validated only when the training data are abundant enough to ensure a reliable learning for the discriminative boundary. This condition is not always satisfied and data sparsity is always a challenge for speaker recognition. For the areas of the i-vector pair space where no or little training samples are available, the NN approach is expected to be inferior to the PLDA approach, due to the Gaussian prior assumption of the latter. It is a natural idea to combine the two approaches and leverage their respective advantages. In this paper we take a simple score average approach which combines the posterior probability from the NN (s nn ) and the likelihood ratio from PLDA (s plda ) by linear interpolation, and uses the combined score to make decision. This is formulated by: s cmb = αs nn + (1 α)s plda (3) where α is a tunable parameter that can be determined on a development set. TABLE I EVALUATION CONDITIONS Trials Description c1 957 enrollment & test are from same mic types c2 17941 enrollment & test are from different mic types c3 18898 no consideration for mic types in enrollment & test A. Databases IV. EXPERIMENTS We conduct the experiments on the interview data of NIST 2008 speaker recognition evaluation (SRE08). All the data are recordings of females, and each enrollment or test speech segment consists of 2 minutes of speech signals. The test is composed of three conditions as shown in Table I, where condition 3 is the full trial set, while condition 1 and 2 considers trials with the same and different microphone types in enrollment and test, respectively. The i-vector system (including parameters of the UBM and T matrix) was trained with 7196 female speech recordings (12837 utterances in total) selected from the Fisher telephone speech database. The same database was also used to train the LDA and PLDA models. B. Experimental setup All the speech data used in this study are sampled at 8 khz and the sample precise is 16 bits. The acoustic feature used is 19-dimensional Mel frequency cepstral coefficients (MFCCs) together with the log energy. The first and second order derivatives are augmented to the static features, resulting in 60-dimensional feature vectors. The UBM involves 2048 Gaussian components and was trained with about 4000 female utterances selected from the Fisher database randomly. The T matrix of the i-vector system was trained with all the female utterances in the Fisher database, and the dimension of the i-vectors is 400. The LDA and PLDA models were trained with utterances of 7196 female speakers, again randomly selected from the Fisher database. The dimension of the LDA projection space is set to 150. In order to train the NN model, we selected 32500 pairs of i-vectors that were extracted using the speech segments randomly selected from the Fisher database. As mentioned, the discriminative features are selected based on the first n dimensions of the LDA-projected i-vectors. In order to determine the appropriate n, we selected 100 speakers from the SRE08 database as the cross validation (CV) dataset, which consists of about 3000 trials. A number of NN structures were tested, and the best structure was selected based on performance on the CV set. The optimal structure we found involves 2 hidden layers, and each hidden layer contains 200 units. Table II presents performance of the i-vector baseline, i-vector plus LDA and i-vector plus PLDA on the three evaluation conditions. It is clear that both LDA and PLDA systems outperform the i-vector baseline, and the PLDA system obtains the best overall performance (condition 3), confirming the power of this model. TABLE II PERFORMANCE OF BASELINE SYSTEMS Condition i-vector i-vector + LDA i-vector + PLDA c1 4.05% 1.56% 2.18% c2 28.50% 23.50% 19.7% c3 28.63% 23.35% 19.50%

C. Discriminative feature selection The first experiment optimizes the selection of discriminative features for NN. Choose the first n dimensions of LDAprojected i-vectors to extract the discriminative features, based on which to build the NN, and test performance on the CV dataset. Figure 3 shows the EER results with different n. It can be seen that n = 10 is a good trade-off: a smaller n may lead to losing of speaker information, and a larger n suffers from over-fitting towards non-speaker variance. Fig. 5. Performance of the NN-based system compared with three baseline systems. Results are reported in terms of EER and on the three evaluation conditions. TABLE III PAIR-WISED T-TESTS p LDA vs NN PLDA vs NN LDA vs PLDA c2 1.53e-07 0.015 2.13e-07 c3 2.15e-07 0.040 1.59e-07 Fig. 3. Performance of the NN-based scoring with various number of discriminative features. Results are reported in terms of EER on the CV set. To investigate generability of the feature selection, the NNs built with different n are tested on the evaluation dataset, leading to the results illustrated in Figure 4. It is observed that the curves on the evaluation set show similar patterns as that on the CV set, although the optimal choices of n are not exactly the same. This suggests that the feature selection based on the CV set is well generalized. E. Combining NN and PLDA Figure 6 presents performance of the combined approach with various α (ref. (3)). It can be found that the combined approach indeed provides better performance with an appropriate setting of α. Fig. 6. Performance of the PLDA-NN combination system on the three evaluation conditions. Fig. 4. Performance of the NN-based scoring with various number of discriminative features. Results are reported in terms of EER on the three evaluation conditions. D. NN-based scoring Based on the selected discriminative features, i.e., n = 10, the NN-based system was constructed. The EER results on the three evaluation conditions are presented in Figure 5. It is clear that the NN-based approach outperforms the three baselines on all the three conditions. In order to confirm the observation, pari-wised t-tests are conducted to compute the significance level (p value) among the three competitive models: LDA, PLDA and NN. The results are shown in Table III. Note that the dataset in condition 1 is too small to compute a reliable p, therefore only results of condition 2 and 3 are reported. It is observed that both PLDA and NN outperforms LDA in a very significant way, whereas the NN system outperforms the PLDA system in a weakly significant way. V. CONCLUSIONS This paper presented an NN-based scoring approach for i- vector speaker recognition systems. We argue that by relaxing the Gaussian assumption in PLDA and optimizing the model with respect to the decision task directly, the NNbased approach may achieve better performance than PLDA in situations where training samples are abundant. Furthermore, the NN and PLDA approaches are complementary and so can be combined to obtain further gains. These conjectures are confirmed by experiments conducted on the SRE08 interview data. We admit that this study is preliminary; particularly, the discriminative features used here are rather simple and the combination approach is rather naive. A better feature selection method and a better combination approach may significantly improve the NN-based approach, which we leave as the future work. VI. ACKNOWLEDGEMENTS This work was supported by National Basic Research Program (973 Program) of China under Grand No. 2013CB329302 and the National Science Foundation of China (NSFC) under the project No. 61371136 and No. 61271389.

REFERENCES [1] P. Kenny, G. Boulianne, P. Ouellet, and P. Dumouchel, Joint factor analysis versus eigenchannels in speaker recognition, IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 4, pp. 1435 1447, 2007. [2] N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, Front-end factor analysis for speaker verification, IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788 798, 2011. [3] A. O. Hatch, S. S. Kajarekar, and A. Stolcke, Within-class covariance normalization for SVM-based speaker recognition. in INTER- SPEECH 06, 2006. [4] A. Solomonoff, C. Quillen, and W. M. Campbell, Channel compensation for SVM speaker recognition, in Proc Odyssey, Speaker Language Recognition Workshop 2004, 2004, pp. 57 62. [5] C. S. Greenberg, V. M. Stanford, A. F. Martin, M. Yadagiri, G. R. Doddington, J. J. Godfrey, and J. Hernandez-Cordero, The 2012 NIST speaker recognition evaluation, 2013. [6] S. Ioffe, Probabilistic linear discriminant analysis, in ECCV 2006, 2006, pp. 531 542. [7] P. Kenny, G. Boulianne, and P. Dumouchel, Eigenvoice modeling with sparse training data, IEEE Transactions on Speech and Audio Processing, vol. 13, no. 3, pp. 345 354, 2005. [8] M. McLaren and D. V. Leeuwen, Source-normalised-and-weighted LDA for robust speaker recognition using i-vectors, IEEE Transactions on Audio, Speech, and Language Processing, pp. 5456 5459, 2011. [9] S. J. Prince and J. H. Elder, Probabilistic linear discriminant analysis for inferences about identity, in ICCV 07. IEEE, 2007, pp. 1 8. [10] N. Dehak, R. Dehak, P. Kenny, P. Ouellet, and P. Dumouchel, Support vector machines versus fast scoring in the low-dimensional total variability space for speaker verification, in International Conference on Spoken Language Processing - ICSLP. IEEE, 2009, pp. 1559 1562.