Discriminative Scoring for Speaker Recognition Based on I-vectors

Similar documents
A study of speaker adaptation for DNN-based speech synthesis

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems

Spoofing and countermeasures for automatic speaker verification

Probabilistic Latent Semantic Analysis

Human Emotion Recognition From Speech

Speech Emotion Recognition Using Support Vector Machine

WHEN THERE IS A mismatch between the acoustic

Python Machine Learning

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

Lecture 1: Machine Learning Basics

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Support Vector Machines for Speaker and Language Recognition

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Speaker recognition using universal background model on YOHO database

Calibration of Confidence Measures in Speech Recognition

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Assignment 1: Predicting Amazon Review Ratings

Knowledge Transfer in Deep Convolutional Neural Nets

Learning Methods in Multilingual Speech Recognition

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Word Segmentation of Off-line Handwritten Documents

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques

Truth Inference in Crowdsourcing: Is the Problem Solved?

Speech Recognition at ICSI: Broadcast News and beyond

Switchboard Language Model Improvement with Conversational Data from Gigaword

Comment-based Multi-View Clustering of Web 2.0 Items

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Deep Neural Network Language Models

Rule Learning With Negation: Issues Regarding Effectiveness

Speaker Recognition For Speech Under Face Cover

INPE São José dos Campos

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Australian Journal of Basic and Applied Sciences

Learning From the Past with Experiment Databases

CS Machine Learning

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

SUPRA-SEGMENTAL FEATURE BASED SPEAKER TRAIT DETECTION

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

arxiv: v2 [cs.cv] 30 Mar 2017

Learning Methods for Fuzzy Systems

Reducing Features to Improve Bug Prediction

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

arxiv: v1 [cs.cl] 2 Apr 2017

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Rule Learning with Negation: Issues Regarding Effectiveness

Comparison of network inference packages and methods for multiple networks inference

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Affective Classification of Generic Audio Clips using Regression Models

On-the-Fly Customization of Automated Essay Scoring

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Mandarin Lexical Tone Recognition: The Gating Paradigm

Measurement. When Smaller Is Better. Activity:

Automatic Pronunciation Checker

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

On the Formation of Phoneme Categories in DNN Acoustic Models

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Speaker Recognition. Speaker Diarization and Identification

Generative models and adversarial training

Artificial Neural Networks written examination

Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment

LOW-RANK AND SPARSE SOFT TARGETS TO LEARN BETTER DNN ACOUSTIC MODELS

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Speaker Identification by Comparison of Smart Methods. Abstract

Speech Recognition by Indexing and Sequencing

(Sub)Gradient Descent

arxiv: v1 [math.at] 10 Jan 2016

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

Proceedings of Meetings on Acoustics

arxiv: v1 [cs.lg] 3 May 2013

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

A Case Study: News Classification Based on Term Frequency

Data Integration through Clustering and Finding Statistical Relations - Validation of Approach

On the Combined Behavior of Autonomous Resource Management Agents

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Cultivating DNN Diversity for Large Scale Video Labelling

Evidence for Reliability, Validity and Learning Effectiveness

BMBF Project ROBUKOM: Robust Communication Networks

Semi-Supervised Face Detection

Probability and Statistics Curriculum Pacing Guide

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Introduction to Simulation

Transcription:

Discriminative Scoring for Speaker Recognition Based on I-vectors Jun Wang, Dong Wang, Ziwei Zhu, Thomas Fang Zheng and Frank Soong Center for Speaker and Language Technologies (CSLT) Tsinghua University, Beijing 100084, P.R.China E-mail: wangjun stock@hotmail.com; wangdong99@mails.tsinghua.edu.cn; zhuziwei1@outlook.com; Corresponding author: Thomas Fang Zheng E-mail: fzheng@tsinghua.edu.cn Microsoft Research Asia, Beijing 100084, P.R.China E-mail: frankkps@microsoft.com Abstract The popular i-vector approach to speaker recognition represents a speech segment as an i-vector in a lowdimensional space. It is well known that i-vectors involve both speaker and session variances, and therefore additional discriminative approaches are required to extract speaker information from the total variance space. Among various methods, the probabilistic linear discriminant analysis (PLDA) achieves stateof-the-art performance, partly due to its generative framework that represents speaker and session variances in a hierarchical way. A disadvantage of PLDA, however, lies in its Gaussian assumption of the prior/conditional distributions on the speaker and session variables, which is not necessarily true in reality. This paper presents a discriminative scoring approach which models i-vector pairs using a neural network (NN) so that the posterior probability that an i-vector pair belongs to the same person is read off from the NN output directly. This discriminative approach does not rely on any artificial assumptions on data distributions and can learn speaker-related information with sufficient accuracy provided that the network is large enough and the training data are abundant. Our experiments on the NIST SRE08 interview speech data demonstrated that the NNbased approach outperforms PLDA in the core test condition, and combining the NN and PLDA scores leads to further gains. I. INTRODUCTION Joint factor analysis (JFA) has gained much success in speaker recognition. This approach assumes that the speaker variance and session variance are derived from two independent random variables (factors) that follow standard Gaussian distributions a priori (usually in low-dimensional subspace). The speaker representation of a speech segment is then derived by inferring the posterior probability of the speaker factor given the speech signal [1]. Recent research reveals that speaker and session variances may not be clearly separated by JFA, and session factors inferred from JFA may still involve some speaker information. A better approach would be representing speaker and session variances as a single total variance factor, so that more speaker information can be retained in the posterior inference. By this approach, a speaker segment can be represented by an i-vector which corresponds to the mean of the inferred posterior distribution of the total variance factor. This is widely known as the total variance model or i-vector model [2]. Involving both speaker and session variances is a particular advantage of the i-vector model because more speaker-related information is retained; however, at the same time, it is also an obvious disadvantage: the mixed representation leads to less discrimination among speakers. It is therefore important to employ some discriminative approaches to suppress session variance and accentuate speaker variance. For example, the with-in class covariance normalization (WCCN) technique employs a linear transform that is derived by optimizing generalized linear kernels [3], and the nuisance attribute projection (NAP) seeks a projection that minimizes discrepancy of signal pairs recorded in difference channels [4]. These approaches, although originally proposed for the SVM-based framework, have been demonstrated very effective in i-vector systems as a post-processing to enhance speaker discrimination with i- vectors [5]. Another approach that remarkably improves representative power of i-vectors is the probabilistic linear discriminant analysis (PLDA) [6]. On the one hand, PLDA a probabilistic version of LDA and so inherits LDA s discriminative nature; on the other hand, PLDA is a generative model which places a prior on the underlying class variable, and so can handle classes with very limited or even no training data. This is a big advantage of PLDA in speaker recognition, since in most situations only very few utterances are available for enrollment and test, and thus a generative model like PLDA is much more reasonable than a discriminative model such as support vector machines (SVM) because the latter usually requires quite some speech utterances to train one vs. all classifiers. In spite of the great success achieved by PLDA, there are still some limitations with this model. Particularly, it assumes Gaussian forms for prior distributions of speaker factors as well as the class conditional distributions of i-vectors, and the model is not directly optimized with respect to the recognition task, i.e., the true/imposter speaker decision. We therefore seek a discriminative model which relaxes the Gaussian assumption and can predict posteriors of true/imposter speakers directly. A possible approach is to design an SVM which conducts the prediction based on i-vector pairs, but this has been demonstrated in our experiments to be highly impractical, due to the large amount of i-vector combinations and the high dimensionality of i-vectors. In this paper, we propose a discriminative approach which conducts the true/imposter decision by a neural network (NN). Different from the naive method that builds discriminative models based on raw i- vector pairs, we first extract some discriminative features and then based on these features to construct the NN. This simple approach is demonstrated to be very effective and achieves better performance than PLDA in the NIST SRE08 core test. The rest of the paper is organized as follows: Section II summarizes the i-vector and PLDA technique, Section III presents the NN-based scoring approach, and Section IV presents the experiments. The paper is concluded in Section V. II. THEORY BACKGROUND A. i-vector model The conventional approach to speaker recognition is based on the universal background model-gaussian mixture model (UBM-GMM) architecture. The i-vector approach is an extension to the UBM-GMM approach and assumes that both speaker and session variances of a speech segment concentrate on a low-dimensional subspace of the model supervector (the concatenation of the mean vectors of all the GMM 978-616-361-823-8 2014 APSIPA APSIPA 2014

components). This subspace is referred to as the total-variance space, and a speech segment can be represented by an identity vector (i-vector) in this space. Mathematically, letting the UBM involve C Gaussian components, and the acoustic features of a speech segment associated with the c-th component follow a Gaussian distribution with mean M c and covariance Σ c. The i-vector model assumes that M c is generated from a low-dimensional variable w R M which normally distributed in the i-vector space, via a linear transformation T c : M c = m c + T c w (1) where m c R D is the mean of the c-th component of the UBM, T c R D M is the loading matrix associated with the c-th component. w follows the standard normal distribution N (0, I). The loading matrices {T c } can be trained by an EM procedure [7]. Once {T c } has been obtained, a speech segment X can be represented by the posterior probability p(w X) which can be inferred according to (1). Specifically, since the prior p(w) is a Gaussian, the posterior p(w X) is a Gaussian as well: p(w X) N ( w, Ξ) (2) where the mean vector w and covariance matrix Ξ can be computed from the zero- and first- order statistics of X. Details of the derivation can be found in [2]. In speaker recognition, the mean vector w is taken as the identity vector (i-vector) of the speech segment, and the true/imposter decision is conducted based on the distance (cosine distance is an often choice) between the i-vectors of the test speech and the enrollment speech. Note that an i- vector involves both speaker and non-speaker (e.g., channel, content, emotion) information. In order to improve discriminative capability for speakers, transforms such as WCCN [3], NAP [4] and LDA [8] are usually applied before computing the distance. B. PLDA It is well known that the linear discriminant analysis (LDA) corresponds to a generative model given by: w i,j = w i + Au where w i,j is an observation vector (i-vector in speaker recognition) of the i-th class, w i is the mean vector of the class, and u follows a Gaussian distribution: u N (0, I). This formulation implies that LDA assumes that the class conditional distributions p(w i,j w i ) are Gaussians that share the same covariance. Ioffe extended this model by placing a Gaussian prior on w i, which leads to a hierarchical Bayesian model shown in Figure 1. Fig. 1. The graphical model of PLDA, where N is the number of samples of class i. By this extension, the class mean w i is treated as a continuous variable instead of a discrete parameter as in the traditional N LDA. This significantly improves model generability and thus classes with very few samples can be well represented due to the prior [6]. This is particularly attractive for speaker recognition where in most cases only a few enrollment/test utterances (and hence i-vectors) are available for a speaker. A multitude of researches reported that PLDA can significantly improve performance of i-vector systems and achieve state-ofthe-art performance [5]. III. NN-BASED DISCRIMINATIVE MODEL A. Concern for PLDA In spite of the success of PLDA in speaker recognition, this model possesses some limitations, particularly the underlying Gaussian assumption on the prior and class conditional distributions. There is little justification for this assumption except the concern on computation tractability in model training and inference. We notice that this assumption can be relaxed to some extent by replacing Gaussians with Gaussian mixtures, as mentioned in [9], however this will greatly improve model complexity and the effectiveness has not yet been demonstrated in speaker recognition. Another concern for PLDA is the generative modeling itself: the optimization objective is to fit the data. Although the fitting takes class discrimination into account, it is still suboptimal with respect to the recognition task, i.e., the true/imposter decision. A desirable model should be discriminative in nature, and the objective function should be the true/imposter decision error rate. A simple discriminative approach designs a one vs. all classifier for each class, as has been used in the conventional SVM-based systems and can be simply migrated to i-vector systems [10]. This approach, however, needs to build many classifiers and suffers from data sparsity. An ideal approach is to build a single classifier that can make decisions for all speakers, as the PLDA approach does. The most straightforward way is to collecting a number of i-vector pairs and label them as same/different speakers, and train a discriminative model to predict the posterior probability of a pair of i-vectors belonging to the same speaker. This approach, according to our experiments, is promising with a small set of enrollment speakers; however, when the number of speakers increase, the performance decreases drastically, suggesting that it is hard for the discriminative model to learn discriminative patterns from raw i-vectors. B. NN-based scoring We turn to an alternative way to build the discriminative model. First of all, a number of i-vector pairs {(v i,1, v i,2 )} are collected and are labeled as positive (+1) and negative (-1) samples according to whether v i,1 and v i,2 belong to the same speaker or not, leading to a training set = {(v i,1, v i,2 ; l i )} where l i is the label of the i-th pair. In order to obtain the most discriminative information while keeping the feature set compact, LDA is applied to project the i-vector pairs to a low dimensional subspace, resulting in a projected training set = {(v i,1, v i,2 ; l i)} where v is the image of v with the LDA projection. A number of discriminative features are then extracted, leading to a ready-to-use training set = {(f i ; l i )}, where f i is the feature set derived from the pair (v i,1, v i,2 ). In this work, we use a simple feature set: the difference on the first n dimensions of the two vectors in the pair, i.e., {f i (j) = v i,1 (j) v i,2 (j) 2 ; j = 0,..., n 1}. Note that v i,1 and v i,2 are in the LDA projection space and hence their first n dimensions are assumed to retain the most discriminative information among all the dimensions. In addition, considering the success of the cosine distance in traditional i-vector

systems, it is also taken as a feature. In summary, the feature set involves n + 1 elements: where [f i (0), f i (1),...f i (n 1), < v i,1, v i,2 > v i,1 v i,2 ] f i (j) = v i,1(j) v i,2(j) 2. With the training data, a discriminative model can be constructed and optimized with respect to the true/imposter decision error rate. The NN is chosen as the discriminative model in this study though any discriminative model (e.g., SVM) is fine. The entire system is shown as Figure 2, where s nn is the posterior probability that the input i-vector pair represents the same speaker. L D A Fig. 2. Architecture of the NN-based scoring. It should be noted that training the NN model requires balanced positive and negative samples, and so the output of the model is a class posterior based on equal priors, i.e., genuine speakers and imposters have equal weights. Therefore the model can not be used to make decisions directly. A threshold on the posterior s nn needs to be determined on a development set to achieve the best performance in terms of the evaluation metric, which is the equal error rate (EER) in this study. From this perspective, the NN-based approach is a scoring approach which extends the normally used scoring approach based on cosine distance. In fact, if the feature set involves only the cosine distance, this approach is equal to the cosine scoring. C. PLDA-NN combination The advantage of the NN-based approach, when compared to PLDA, relies on the fact that it relaxes the Gaussian assumption of PLDA. This advantage, however, is validated only when the training data are abundant enough to ensure a reliable learning for the discriminative boundary. This condition is not always satisfied and data sparsity is always a challenge for speaker recognition. For the areas of the i-vector pair space where no or little training samples are available, the NN approach is expected to be inferior to the PLDA approach, due to the Gaussian prior assumption of the latter. It is a natural idea to combine the two approaches and leverage their respective advantages. In this paper we take a simple score average approach which combines the posterior probability from the NN (s nn ) and the likelihood ratio from PLDA (s plda ) by linear interpolation, and uses the combined score to make decision. This is formulated by: s cmb = αs nn + (1 α)s plda (3) where α is a tunable parameter that can be determined on a development set. TABLE I EVALUATION CONDITIONS Trials Description c1 957 enrollment & test are from same mic types c2 17941 enrollment & test are from different mic types c3 18898 no consideration for mic types in enrollment & test A. Databases IV. EXPERIMENTS We conduct the experiments on the interview data of NIST 2008 speaker recognition evaluation (SRE08). All the data are recordings of females, and each enrollment or test speech segment consists of 2 minutes of speech signals. The test is composed of three conditions as shown in Table I, where condition 3 is the full trial set, while condition 1 and 2 considers trials with the same and different microphone types in enrollment and test, respectively. The i-vector system (including parameters of the UBM and T matrix) was trained with 7196 female speech recordings (12837 utterances in total) selected from the Fisher telephone speech database. The same database was also used to train the LDA and PLDA models. B. Experimental setup All the speech data used in this study are sampled at 8 khz and the sample precise is 16 bits. The acoustic feature used is 19-dimensional Mel frequency cepstral coefficients (MFCCs) together with the log energy. The first and second order derivatives are augmented to the static features, resulting in 60-dimensional feature vectors. The UBM involves 2048 Gaussian components and was trained with about 4000 female utterances selected from the Fisher database randomly. The T matrix of the i-vector system was trained with all the female utterances in the Fisher database, and the dimension of the i-vectors is 400. The LDA and PLDA models were trained with utterances of 7196 female speakers, again randomly selected from the Fisher database. The dimension of the LDA projection space is set to 150. In order to train the NN model, we selected 32500 pairs of i-vectors that were extracted using the speech segments randomly selected from the Fisher database. As mentioned, the discriminative features are selected based on the first n dimensions of the LDA-projected i-vectors. In order to determine the appropriate n, we selected 100 speakers from the SRE08 database as the cross validation (CV) dataset, which consists of about 3000 trials. A number of NN structures were tested, and the best structure was selected based on performance on the CV set. The optimal structure we found involves 2 hidden layers, and each hidden layer contains 200 units. Table II presents performance of the i-vector baseline, i-vector plus LDA and i-vector plus PLDA on the three evaluation conditions. It is clear that both LDA and PLDA systems outperform the i-vector baseline, and the PLDA system obtains the best overall performance (condition 3), confirming the power of this model. TABLE II PERFORMANCE OF BASELINE SYSTEMS Condition i-vector i-vector + LDA i-vector + PLDA c1 4.05% 1.56% 2.18% c2 28.50% 23.50% 19.7% c3 28.63% 23.35% 19.50%

C. Discriminative feature selection The first experiment optimizes the selection of discriminative features for NN. Choose the first n dimensions of LDAprojected i-vectors to extract the discriminative features, based on which to build the NN, and test performance on the CV dataset. Figure 3 shows the EER results with different n. It can be seen that n = 10 is a good trade-off: a smaller n may lead to losing of speaker information, and a larger n suffers from over-fitting towards non-speaker variance. Fig. 5. Performance of the NN-based system compared with three baseline systems. Results are reported in terms of EER and on the three evaluation conditions. TABLE III PAIR-WISED T-TESTS p LDA vs NN PLDA vs NN LDA vs PLDA c2 1.53e-07 0.015 2.13e-07 c3 2.15e-07 0.040 1.59e-07 Fig. 3. Performance of the NN-based scoring with various number of discriminative features. Results are reported in terms of EER on the CV set. To investigate generability of the feature selection, the NNs built with different n are tested on the evaluation dataset, leading to the results illustrated in Figure 4. It is observed that the curves on the evaluation set show similar patterns as that on the CV set, although the optimal choices of n are not exactly the same. This suggests that the feature selection based on the CV set is well generalized. E. Combining NN and PLDA Figure 6 presents performance of the combined approach with various α (ref. (3)). It can be found that the combined approach indeed provides better performance with an appropriate setting of α. Fig. 6. Performance of the PLDA-NN combination system on the three evaluation conditions. Fig. 4. Performance of the NN-based scoring with various number of discriminative features. Results are reported in terms of EER on the three evaluation conditions. D. NN-based scoring Based on the selected discriminative features, i.e., n = 10, the NN-based system was constructed. The EER results on the three evaluation conditions are presented in Figure 5. It is clear that the NN-based approach outperforms the three baselines on all the three conditions. In order to confirm the observation, pari-wised t-tests are conducted to compute the significance level (p value) among the three competitive models: LDA, PLDA and NN. The results are shown in Table III. Note that the dataset in condition 1 is too small to compute a reliable p, therefore only results of condition 2 and 3 are reported. It is observed that both PLDA and NN outperforms LDA in a very significant way, whereas the NN system outperforms the PLDA system in a weakly significant way. V. CONCLUSIONS This paper presented an NN-based scoring approach for i- vector speaker recognition systems. We argue that by relaxing the Gaussian assumption in PLDA and optimizing the model with respect to the decision task directly, the NNbased approach may achieve better performance than PLDA in situations where training samples are abundant. Furthermore, the NN and PLDA approaches are complementary and so can be combined to obtain further gains. These conjectures are confirmed by experiments conducted on the SRE08 interview data. We admit that this study is preliminary; particularly, the discriminative features used here are rather simple and the combination approach is rather naive. A better feature selection method and a better combination approach may significantly improve the NN-based approach, which we leave as the future work. VI. ACKNOWLEDGEMENTS This work was supported by National Basic Research Program (973 Program) of China under Grand No. 2013CB329302 and the National Science Foundation of China (NSFC) under the project No. 61371136 and No. 61271389.

REFERENCES [1] P. Kenny, G. Boulianne, P. Ouellet, and P. Dumouchel, Joint factor analysis versus eigenchannels in speaker recognition, IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 4, pp. 1435 1447, 2007. [2] N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, Front-end factor analysis for speaker verification, IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788 798, 2011. [3] A. O. Hatch, S. S. Kajarekar, and A. Stolcke, Within-class covariance normalization for SVM-based speaker recognition. in INTER- SPEECH 06, 2006. [4] A. Solomonoff, C. Quillen, and W. M. Campbell, Channel compensation for SVM speaker recognition, in Proc Odyssey, Speaker Language Recognition Workshop 2004, 2004, pp. 57 62. [5] C. S. Greenberg, V. M. Stanford, A. F. Martin, M. Yadagiri, G. R. Doddington, J. J. Godfrey, and J. Hernandez-Cordero, The 2012 NIST speaker recognition evaluation, 2013. [6] S. Ioffe, Probabilistic linear discriminant analysis, in ECCV 2006, 2006, pp. 531 542. [7] P. Kenny, G. Boulianne, and P. Dumouchel, Eigenvoice modeling with sparse training data, IEEE Transactions on Speech and Audio Processing, vol. 13, no. 3, pp. 345 354, 2005. [8] M. McLaren and D. V. Leeuwen, Source-normalised-and-weighted LDA for robust speaker recognition using i-vectors, IEEE Transactions on Audio, Speech, and Language Processing, pp. 5456 5459, 2011. [9] S. J. Prince and J. H. Elder, Probabilistic linear discriminant analysis for inferences about identity, in ICCV 07. IEEE, 2007, pp. 1 8. [10] N. Dehak, R. Dehak, P. Kenny, P. Ouellet, and P. Dumouchel, Support vector machines versus fast scoring in the low-dimensional total variability space for speaker verification, in International Conference on Spoken Language Processing - ICSLP. IEEE, 2009, pp. 1559 1562.