Using MMSE to improve session variability estimation. Gang Wang and Thomas Fang Zheng*

350 Int. J. Biometrics, Vol. 2, o. 4, 2010 Using MMSE to improve session variability estimation Gang Wang and Thomas Fang Zheng* Center for Speech and Language Technologies, Division of Technical Innovation and Development, Tsinghua ational Laboratory for Information Science and Technology, Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China E-mail: wang-g07@mails.tsinghua.edu.cn E-mail: fzheng@tsinghua.edu.cn *Corresponding author Abstract: In this paper, the Session Variability Subspace Projection (SVSP) method based on model compensation for speaker verification was improved using the Minimum Mean Square Error (MMSE) criterion. The issue of SVSP is that the speaker s session-independent supervector is approximated by the average of all his or her session-dependent GMM-supervectors when estimating SVSP matrix. However, the error between them does obviously exist. Our goal is to minimise it using MMSE criterion. Compared with the original SVSP, the proposed method could achieve an error rate reduction of 6.7% for EER and 5.3% for minimum detection cost function over the IST SRE 2006 1C4W-dataset. Keywords: speaker verification; session variability; MMSE; model compensation. Reference to this paper should be made as follows: Wang, G. and Zheng, T.F. (2010) Using MMSE to improve session variability estimation, Int. J. Biometrics, Vol. 2, o. 4, pp.350 357. Biographical notes: Gang Wang is pursuing his PhD in the Department of Computer Science and Technology, Tsinghua University, Beijing, China. His research interests focus on speaker recognition and multi-speaker segmentation. Thomas Fang Zheng received his PhD Degree from the Department of Computer Science and Technology, Tsinghua University, Beijing, China, in 1997. He is a Research Professor and Vice Dean of the Research Institute of Information Technology, Tsinghua University and Director of Center for Speech and Language Technologies, Division of Technical Innovation and Development, Tsinghua ational Laboratory for Information Science and Technology. He is an IEEE senior member and his current research interests include speech recognition, speaker recognition and natural language processing. Copyright 2010 Inderscience Enterprises Ltd.

Using MMSE to improve session variability estimation 351 1 Introduction The mismatch caused by session variability is still a big problem in speaker recognition, in spite of the great progress that has been made in this field. Session variability includes transmission channel effects, transducer characteristics, background noise, intra-speaker variability, and so on. Several methods have been proposed to solve this problem, which can be categorised into three domains: feature domain, model domain and score domain. In the feature domain, typical methods include cepstral mean subtraction (Furui, 1981), RASTA filter (Hermansky and Morgan, 1994), feature warping (Pelecanos and Sridharan, 2001) and feature mapping (Reynolds, 2003); in the model domain, typical methods include speaker model synthesis (Teunen et al., 2000; Wu et al., 2007), factor analysis (Kenny et al., 2005; Vogt and Sridharan, 2006), uisance Attribute Projection (AP) (Campbell et al., 2006) and SVSP based on model compensation (Deng et al., 2007); while in the score domain, typical methods include Hnorm (Reynolds, 1996), Tnorm (Auckenthaler et al., 2000), Znorm (Li and Porter, 1988) and ZTnorm (Kenny et al., 2008). Methods in the model domain have been recently proposed, have become very popular, and have achieved impressive reductions in verification error rates (Kenny et al., 2005; Vogt and Sridharan, 2006; Campbell et al., 2006; Deng et al., 2007). SVSP (Deng et al., 2007) can greatly reduce the computational complex of session variability compared with the traditional factor analysis method (Vogt and Sridharan, 2006), and it can be directly used for the GMM-UBM system and improve the performance of speaker verification compared with the conventional GMM-UBM method. However, SVSP is not perfect, as expected. In the SVSP algorithm, the session variability of a test utterance will be used to compensate for the tentative speaker models whose session variability (which, of course, is mostly different from that estimated from the test utterance) has already been removed during the training phase. Its performance depends on the estimation accuracy of the session variability to some extent. But SVSP uses the average of a speaker s all session-dependent GMM supervectors to approximate the session-independent supervector of the speaker when estimating the SVSP matrix; this will reduce the accuracy of session variability. Accordingly, the performance of speaker recognition will be enhanced. Considering the above, an improved method of SVSP matrix estimation based on MMSE is proposed in this paper. Its main idea is to estimate SVSP matrix within constraint of minimising session-independent GMM supervector mean square error on the development data set and improve the accuracy of session variability. This paper is organised as follows. In Section 2 the SVSP method will be briefly reviewed and then the proposed method will be introduced. In Section 3, experiments will be described and results will be given. Finally, conclusions and perspectives will be presented in Section 4. 2 SVSP review and improvement 2.1 SVSP review Given a speaker s Gaussian mixture model, a GMM supervector M(s, i) can be formed by concatenating the GMM component mean vectors (Wu et al., 2007; Kenny et al., 2005;

352 G. Wang and T.F. Zheng Vogt and Sridharan, 2006). The supervector is a sum of a session-independent GMM supervector m(s) with an additional session-dependent supervector Uz(s, i) (Campbell et al., 2006), which is illustrated in Figure 1 and can be described as M (,) si = ms () + Uzsi (,). (1) Figure 1 The decomposition of GMM supervector In equation (1), the GMM supervector M(s, i) is dependent on the ith session of speaker s; z(s, i) is the latent factor which is assumed to belong to a standard normal distribution. U is a CD R C low-rank matrix from the constrained session variability subspace, where C is the number of Gaussian components in a Universal Background Model (UBM), D is the dimension of the acoustic feature vectors, and R C is the rank of U matrix and R C << CD. The computation method of U can be found in Vogt and Sridharan (2006). ote that the eigenvectors used to form the U matrix are orthogonal. So the derived projection matrix P can be written as t t P = UU and PU = UU U = U. (2) Figure 2 illustrates the detailed work flow of the SVSP method based on model compensation for GMM-UBM systems, which is divided into two parts. The part surrounded by the dashed rectangle will be modified in this paper so as to improve the estimation of session variability and details will be introduced in Section 2.2. The UBM is trained using hundreds of speakers utterances containing all kinds of sessions with the EM algorithm (Dempster et al., 1977), so it can be regarded as session-independent. A GMM supervector µ can be formed from the UBM. Given a training utterance i uttered by speaker s, the speaker model can be obtained from UBM using the conventional MAP adaptation method (Reynolds et al., 2000) with only the mean vectors changed. A GMM supervector M(s, i) can then be formed from this speaker model. Afterwards, the session variability is removed from the GMM supervector by projection, which can be written as ms () = ( I PM ) (,) si (3) where m(s) is the session-independent GMM supervector of speaker s. During the recognition phase, given a test utterance j uttered by speaker t, a speaker model is adapted from UBM with the conventional MAP method (Reynolds et al., 2000), then a GMM supervector M(t, j) is formed from it. Afterwards, the session variability of the test utterance is calculated by equation (4)

Using MMSE to improve session variability estimation 353 Uz(, t j) = PM(, t j). (4) Therefore, the session-j-dependent GMM supervector of speaker s M(s, j) can be obtained after session-independent GMM supervector m(s) is compensated with Uz(t, j) as M (, s j) = ( I P) M(,) s i + PM(, t j) (5) where M(s, j) can be regarded as the model of speaker s whose session is the same as that of the test utterance t. Similarly, the session-j-dependent UBM supervector M(ubm, j) can be obtained after µ was compensated with Uz(t, j) as M ( ubm, j) = µ + PM ( t, j) (6) where µ denotes the session-independent UBM supervector, and M(ubm, j) can be regarded as a session-dependent UBM supervector whose session is the same as that of the test utterance t. Figure 2 The block diagram of SVSP based on model compensation for GMM-UBM system 2.2 The proposed SVSP estimation method In Campbell et al. (2006), the U matrix estimation is performed under the assumption that session-independent GMM supervector of a speaker s is approximated by the average of the speaker s all session-dependent GMM supervectors, which can be written as

354 G. Wang and T.F. Zheng 1 s M () s = M(,) s i m() s (7) s i = 1 where s is the number of utterances spoken by speaker s, and each utterance is regarded as one session. However, the error between M (s) and m(s) does obviously exist. On the one hand, if the U matrix is ideally precisely estimated, the error between M (s) and m(s) must be expected to be as small as possible. On the other hand, a U matrix is ideal if after using it the difference among the session-independent GMM supervectors of those different sessions from a same speaker after removing session variability is made minimal. From the two points, a conclusion can be drawn that a better U matrix should meet equation (9). 1 1 MSE( U) = ( M( s, i) M ( s)) H H s 2 s= 1 s i= 1 (8) U* = argminmse( U). (9) U In equation (8), H is the number of those unique speakers in the development data set, and MSE(U) is the mean square error of U matrix. 1 s M () s = ( I P) M(,). s i (10) s i = 1 Equation (10) was used to replace equation (7) to iteratively estimate the U matrix after the initial U matrix was estimated using the original algorithm. If the reduction of MSE(U) becomes very small or a predefined maximum iteration times has passed, the iteration will be stop. The detailed flow chat is given in Figure 3. Figure 3 The flow chart of U matrix estimation based on MMSE

Using MMSE to improve session variability estimation 355 3 Experiments and results The experiments were performed on the ational Institute of Standards and Technology (IST) Speaker Recognition Evaluation (SRE) 2006 corpus (ational Institute of Standards and Technology, 2004, 2005, 2006) and focused on the single side one conversation training, single-side one conversation test. Feature extraction was performed on a 20-millisecond frame every 10-millisecond. The pre-emphasis coefficient was 0.97 and hamming windowing was applied to each pre-emphasised frame. Voice activity detection based on energy was performed with each frame labelled either valid or invalid. 16-dimensional MFCC features were extracted from the utterances only for those valid frames with 30 triangular Mel filters used in the MFCC calculation. For each frame, the MFCC coefficients and their first delta coefficients formed a 32-dimentional feature vector. To reduce channel effects, mean and variance normalisation was applied to the extracted features. Two gender-dependent UBMs consisted of 1024 mixture components and were trained from the IST SRE 2004 1C4W dataset using the EM algorithm (Dempster et al., 1977). For the MAP training (Reynolds et al., 2000), only mean vectors were adapted with a relevance factor of 16. The data used for estimate the U matrix were from the single-side 8 conversation training in IST SRE 2005, which consisted of 279 females and 202 males. The baseline system is a speaker verification system based on the conventional GMM-UBM. Equal Error Rate (EER) and min-dcf (ational Institute of Standards and Technology, 2004, 2005, 2006) are used to evaluate the performance of the system. The parameters of DCF are the same as in ational Institute of Standards and Technology (2004, 2005, 2006). 3.1 The rank of the U matrix The rank of the U matrix is critical for SVSP algorithm, which will affect the accuracy of the estimation of session variability. The results for different session variability subspace sizes are given in Table 1 where R C is the rank of the U matrix. The system used in this experiment was based on the original SVSP. Experimental results show that the system can achieve the best Min-DCF and EER when R C = 50. Table 1 Min-DCF and EER results for different U matrix rank R C MSE (U) Min-DCF ( 10 2 ) EER (%) 10 1028.2 9.4 9.56 30 940.5 7.7 8.25 50 891.7 7.6 7.93 70 856.8 7.8 8.12 90 828.3 7.9 8.37 3.2 Optimising the U matrix The proposed method was used to optimise the U matrix which was described in Section 2.2. With about 5 8 times iteration, equation (9) can be satisfied. After optimisation, the Min-DCF and EER will be reduced with MSE(U) going small.

356 G. Wang and T.F. Zheng It can be seen from Table 2 that the best Min-DCF and EER could be achieved when R C = 70. Table 2 The DCF and EER results for the propose method R C MSE (U) Min-DCF ( 10 2 ) EER (%) 10 1025.5 8.2 9.37 30 931.1 7.5 8.15 50 870.3 7.4 7.85 70 816.2 7.2 7.40 90 818.7 7.6 8.19 3.3 Comparison between the two methods Figure 4 shows the performance comparison among three methods: baseline (GMM-UBM), original SVSP, and improved SVSP. Compared with the original SVSP, the proposed method achieves a relative reduction of 6.7% for EER and 5.3% for DCF, though the curves cross with each other. Figure 4 The DET curve comparison among baseline, the original SVSP, and the improved SVSP (see online version for colours) 4 Conclusions In this paper the method of evaluation of the U matrix was analysed and an evaluation method was given in equations (8) and (9). Here, a U matrix has better performance means the system using it achieves a smaller EER and DCF. To some extent the smaller MSE(U) is, the better the performance of U matrix is. The proposed method can better estimate the U matrix, so the session variability can be estimated more accurately and the effect of session can be better reduced. The experimental results can show the effectiveness of the proposed SVSP matrix estimation method based on MMSE.

Using MMSE to improve session variability estimation 357 However, it is still possible for some useful speaker dependent information from a speaker model to be removed using equation (3). For those sessions that do not exist in the development data set, the performance will not show improvement. References Auckenthaler, R., Carey, M. and Thomas, H.L. (2000) Score normalization for text-independent speaker verification system, Digital Signal Processing, Vol. 10, pp.1 16. Campbell, W.M., Sturim, D.E., Reynolds, D.A. and Solomonoff, A. (2006) SVM based speaker verification using a GMM supervector kernel and AP variability compensation, ICASSP, Vol. 1, pp.97 100. Dempster, A., Laird,. and Rubin, D. (1977) Maximum likelihood from incomplete data via the em algorithm, J. Roy. Stat. Soc., Vol. 39, pp.1 38. Deng, J., Zheng, T.F. and Wu, W.H. (2007) Session variability subspace projection based model compensation for speaker verification, ICASSP, Vol. IV, pp.57 60. Furui, S. (1981) Cepstral analysis technique for automatic speaker verification, IEEE Trans. Acoust. Speech Signal Processing, Vol. 29, pp.254 272. Hermansky, H. and Morgan,. (1994) RASTA processing of speech, IEEE Transactions on Speech and Audio Processing, Vol. 2, pp.578 589. Kenny, P., Boulianne, G. and Dumouchel, P. (2005) Eigenvoice modeling with sparse training data, IEEE Transactions on Speech and Audio Processing, Vol. 13, o. 3, pp.345 354. Kenny, P., Dehak,., Dehak, R., Gupta, V. and Dumouchel, P. (2008) The role of speaker factors in the IST extended data task, Proceedings of Odyssey 2008: The Speaker and Language Recognition Workshop. Li, K.P. and Porter, J.E. (1988) ormalizations and selection of speech segments for speaker recognition scoring, ICASSP, Vol. 1, pp.595 598. ational Institute of Standards and technology (2004, 2005, 2006) IST Speech Group Website, http://www.nist.gov/speech Pelecanos, J. and Sridharan, S. (2001) Feature warping for robust speaker verification, Odyssey, pp.213 218. Reynolds, D.A. (1996) The effect of handset variability on speaker recognition performance: experiments on the switchboard corpus, ICASSP, Vol. 1, pp.113 116. Reynolds, D.A. (2003) Channel robust speaker verification via feature mapping, ICASSP, Vol. 2, pp.53 56. Reynolds, D.A., Quatieri, T.F. and Dunn, R. (2000) Speaker verification using adapted Gaussian mixture models, Digital Signal Processing, Vol. 10, o. 1, pp.194 141. Teunen, R., Shahshahani, B. and Heck, L.P. (2000) A model based transformational approach to robust speaker recognition, ICSLP, Vol. 2, pp.495 498. Vogt, R. and Sridharan, S. (2006) Experiments in session variability modeling for speaker verification, ICASSP, Vol. 1, pp.897 900. Wu, W., Zheng, T.F., Xu, M-X. and Soong, F. (2007) A cohort-based speaker model synthesis for mismatched channels in speaker verification, IEEE Trans. on Audio, Speech and Language Processing, Vol. 15, o. 6, pp.1893 1903.