SPEAKER VERIFICATION WITH A PRIORI THRESHOLD DETERMINATION USING KERNEL-BASED PROBABILISTIC NEURAL NETWORKS

Size: px

Start display at page:

Download "SPEAKER VERIFICATION WITH A PRIORI THRESHOLD DETERMINATION USING KERNEL-BASED PROBABILISTIC NEURAL NETWORKS"

Alexandrina Barrett
6 years ago
Views:

1 ) ) SPEAKER VERIFICATION WITH A PRIORI THRESHOLD DETERMINATION SING KERNEL-BASED PROBABILISTIC NERAL NETWORKS Kwok-Kwong iu an Man-Wai Mak Center for Multimeia Signal Processing Dept. of Electronic an Information Engineering The Hong Kong Polytechnic niversity, China Sun-uan Kung Dept. of Electrical Engineering Princeton niversity SA ABSTRACT This paper compares kernel-base probabilistic neural networks for speaker verification. Experimental evaluations base on 138 speakers of the OHO corpus using probabilistic ecision-base neural networks PDBNNs), Gaussian mixture moels GMMs) an elliptical basis function networks EBFNs) as speaker moels were conucte. The original PDBNN training algorithm was also moifie to make PDBNNs appropriate for speaker verification. Results show that the equal error rate obtaine by PDBNNs an GMMs is about half of that of EBFNs 1.1 vs. 2.73), suggesting that GMM- an PDBNN-base speaker moels outperform the EBFN one. This work also fins that the globally supervise learning of PDBNNs is able to fin a set of ecision threshols that reuce the variation in FAR, whereas the a hoc approach use by the EBFNs an GMMs is not able to o so. This property makes the performance of PDBNN-base systems more preictable. 1. INTRODCTION In speaker verification systems, each registere speaker is assigne a speaker-epenent moel characterizing his or her own voice. Typically, each of these moels is traine to estimate the likelihoo of the corresponing speaker given an utterance. Gaussian mixture moels GMMs) [1] an elliptical basis function networks EBFNs) [2] have been wiely use as speaker moels because of their capability to moel arbitrary ensity functions. However, GMMs an EBFNs have limitations as they o not provie a proper mechanism for setting ecision threshols, making the verification systems vulnerable to impostor attacks. Therefore, a more avance speaker moel is neee. Probabilistic ecision-base neural networks PDBNNs), propose by Lin et al. [3], can be consiere as This project was supporte by the Hong Kong Polytechnic niversity Grant No. G-W76 an RGC Project No. Poly 512/1E. S.. Kung was also a Distinguishe Chair Professor of The Hong Kong Polytechnic niversity. a special form of GMMs with trainable ecision threshols. PDBNNs were use to implement a hierarchical face recognition system in [3] with excellent results 7.75 recognition, 2.25 false rejection, misclassification an false acceptance). The characteristics of PDBNNs ecision bounaries have been investigate in our previous stuy [4], where the strengths of PDBNNs are highlighte by comparing the recognition accuracy an ecision bounaries of PDBNNs against those of GMMs. We have also emonstrate in [4] that the thresholing mechanism of PDBNNs is very effective in etecting ata not belonging to any known classes. In light of this fining, this paper applies PDBNNs to speaker verification in an attempt to improve the robustness of speaker verification systems against intruer attacks. 2. PROBABILISTIC DECISION-BASED NERAL NETWORKS Probabilistic ecision-base neural networks PDBNNs) are a probabilistic variant of their preecessor, DBNNs [5], for robust pattern classification. PDBNNs employ a moular network structure. In other wors, a PDBNN is compose of a number of small sub-networks, with each subnetwork representing one class. Each class follows a probabilistic constraint, an the likelihoo function for each class is a mixture of Gaussian istributions. The subnet iscriminant functions of a PDBNN are esigne to moel some log-likelihoo functions of the form 1)!#", $'&*) + -.+,, where ' ;: <+ represents the parameters of the = th mixture component, > is the total number of mixture components,.+ is the probability ensity function of the = th component an + is the prior probability also calle mixture coefficients) of the = th component an is the ecision

2 Q A threshol of the? -th subnet. is a Gaussian istribution with mean 5 an covariance 7 Learning in PDBNNs is ivie into two phases: locally unsupervise L) an globally supervise GS). In the L learning phase, each subnet is traine inepenently, an no mutual information across the classes is use. Specifically, PDBNNs aopt the expectation-maximization EM) algorithm [6] to maximize the log-likelihoo function $G & B;CD 8 2), with respect to the parameters 5, 7 ), an +, where C HJI B KJLM;NOPPP8'QSR enotes the set of inepenent an ientically istribute training patterns from class. In the globally supervise GS) training phase, target values are utilize to fine-tune the ecision bounaries. The network weights will be upate whenever misclassification occurs. Specifically, reinforce learning is applie to the subnet corresponing to the correct class so that the weight vector is upate in a irection parallel to the graient of 'T, whereas anti-reinforce learning is applie to the unuly) winning subnet to move the weight vector along the opposite irection: V' V[W ]\_a` ' e [W ' reinforce) anti-reinforce) This has the effect of increasing the chance of classifying the same pattern correctly in the future. 3. APPLICATIONS TO SPEAKER VERIFICATION 3.1. Enrollment Proceures Each registere speaker was assigne a personalize network GMM, EBFN or PDBNN) which was traine to recognize the speech erive from two classes speaker class an anti-speaker class. To this en, two groups of kernel functions one representing the speaker himself/herself while the other representing the speakers in the anti-speaker class) were assigne to each network. We enote the group corresponing to the speaker class as the speaker kernels an the one corresponing to the anti-speaker class as the anti-speaker kernels. For each registere speaker, a unique anti-speaker set containing a preefine number of antispeakers was create. This set was use to create the antispeaker kernels. The anti-speaker kernels enable us to incorporate the iea of scoring normalization [7] in the training proceures, which enhances the networks capability in iscriminating the true speakers from the impostors.. Each of the GMMs an PDBNNs is compose of 12 inputs 12th-orer cepstral coefficients were use as features), a pre-efine number of kernels, an one output. On the other han, the EBFNs contain 12 inputs, a pre-efine number of kernels, an 2 outputs with each output representing one class speaker class or anti-speaker class). All of the covariance matrices in the kernels are iagonal. We applie the K-means algorithm to initialize the positions of the speaker kernels. Then, the kernels covariance matrices were initialize by the K-nearest neighbors algorithm gfhnm. In other wors, all off-iagonal elements were zero an the iagonal elements being equal) of each matrix were initialize to the average Eucliean istance between the corresponing center an its K-nearest centers. The EM algorithm was subsequently use to finetune the mean vectors, covariance matrices, an mixture coefficients. The same proceure was also applie to etermine the mean vectors an covariance matrices of the anti-speaker kernels, using the speech ata erive from the anti-speaker set. The enrollment process for constructing a PDBNNbase speaker moel involves two phases: locally unsupervise L) training an globally supervise GS) training. The L training phase is ientical to the GMM training escribe above. In the GS training phase, the speaker s enrollment utterances an the utterances from all enrollment sessions of the anti-speakers were use to etermine a ecision threshol see Section 3.3 below). For the EBFN-base speaker moels, the speaker kernels an anti-speaker kernels obtaine from the GMM training escribe above were combine to form a hien layer. Finally, singular value ecomposition was applie to etermine the output weights. Details of the enrollment proceure for EBFNs can be foun in [2] Verification Proceures Verification was performe using each speaker in the OHO corpus as a claimant, with 64 impostors being ranomly selecte from the remaining speakers excluing the anti-speakers an the claimant) an rotating through all the speakers. For each claimant, the feature vectors of the claimant s utterances from his/her verification sessions in OHO were concatenate to form a claimant sequence. Likewise, the feature vectors of the impostor s utterances were concatenate to form an impostor sequence. For PDBNNs an GMMs, the following steps were performe uring verification. The feature vectors from the claimant s speech ikjlji<m nm *o6pppnm qpsr/r were ivie into a number of overlapping segments containing 2 ctu j consecutive vectors as shown below.

3 m i G m G L $ I L 1st segment, v/w x y8z { nm o nm * Mnm ~}nnm q6nm * M nm p nm p m nm o nm * Mnm ~}nnm q6, the average normalize log- For the -th segment gi likelihoo G L $ I ŝ vš 2n segment, v x y8z { * M nm p nm p ƒm p o nm p r 6nm p r < m * m < 8R 3) of the PDBNN-base < an GMM-base speaker moels was compute, where m q an /m < represent the loglikelihoo function Eqn. 2) of the speaker an antispeakers respectively. Verification ecisions were base on the following criterion: If GŽ accept the claimant reject the claimant where is a speaker-epenent ecision threshol see Section 3.3 below for the proceure of etermining ). A verification ecision was mae for each segment, with the error rate either false acceptance or false rejection) being the proportion of incorrect verification ecisions to the total number of ecisions. In this work, in Eqn. 3) was set to i.e., 7 secons of speech), an each segment was separate by five vector positions. More specifically, the -th segment contains the vectors G G c G c I<m nm o PPPnm ; R where ~ L \ tš j. Note that iviing the vector sequence into a number of segments has also been successfully use in [1], [2] for increasing the number of ecisions. For the EBFN-base speaker moels, verification ecisions were base on the ifference between the scale network outputs [2]. Details of the verification proceure can be foun in [2]. In this work, equal error rate EER) false acceptance rate being equal to false rejection rate was use as a performance inex to compare the verification performance among ifferent speaker moels. As the speaker moels remain fixe once they have been traine, EER can be use to compare the moels ability in iscriminating the speaker features from the impostor features Determination of Decision Threshols The proceures for etermining the ecision threshols of PDBNNs, GMMs an EBFNs are ifferent. For the GMMbase an EBFN-base speaker moels, the utterances from all enrollment sessions of 16 ranomly selecte antispeakers were use for threshol etermination. Specifically, these utterances were concatenate an the proceure 4) escribe in Section 3.2 was applie. The threshol was ajuste until the false acceptance rate FAR) fell below a pre-efine level. In this work, we set this level to.5. The reason behin using anti-speakers utterances rather than speaker s utterances is that it is much easier to collect the speech of a large number of anti-speakers. Hence, the threshols obtaine are more reliable than those that woul be obtaine from speaker s speech. In aition, using a preefine FAR to etermine the ecision threshols enables us to preict the robustness of the verification system against impostor attacks [8]. To aopt PDBNNs to speaker verification an to etermine the ecision threshol of PDBNN-base speaker moels, three moifications on the PDBNN s training algorithm have been mae. First, the original PDBNNs use one threshol per network. However, in our case, for each speaker we use one network to moel the speaker class an another one to moel the anti-speaker class. To make PDBNNs applicable to speaker verification, we moifie the likelihoo computation such that only one threshol is require. Specifically, instea of comparing the subnet s log-likelihoo against its corresponing threshol as in the original PDBNNs, we compare a normalize score against a single ecision threshol as in Eqns. 3) an 4). In the secon moification, we change the frequency at which the threshol is upate. The original PDBNN aopts the batch-moe supervise learning. Our speaker verification proceure, however, aopts a segmental type of learning see Section 3.2). Specifically, we moifie the GS training to work on a segmental moe as follows. Let io be the œ -th segment extracte from speaker s speech patterns C or from anti-speakers speech patterns C, the normalize segmental score is compute by < ˆ6 vž L $ ˆ6 vž /m q $ gi * ŝ vž m < 8R m < where /m < q an m < enotes the log-likelihoo Eqn. 1)) of speaker s speech an impostors speech respectively. For each segment, a verification ecision was mae accoring to the criterion: Ÿ Ž accept the claimant reject the claimant where is the ecision threshol of the PDBNN-base speaker moel after learning from segment io at epoch. We ajuste whenever misclassification occurs. Specifically, we upate accoring to /m <

4 A where [W E [W Ag if io C [W A@ \_Mª if i C an gio~ an gi gioa t gi «an ª are respectively the reinforce an antireinforce learning rates more on Ag A next ƒ is a penalty function, an g n is the erivative of ; M. In the thir moification, we introuce a new metho to compute the learning rates. In the original PDBNNs, the learning rates for optimizing the threshols are ientical for both reinforce an anti-reinforce learning. However, in some situations, there may be many false acceptances an only a few false rejections or vice versa), which means that anti-reinforce learning will occur more frequent than reinforce learning or vice versa). In orer to reuce the unbalance in the learning frequency, we make the reinforce anti-reinforce) learning c rate to be proportional to the number of false acceptance rejections), i.e., Mª c QS ± Q ± \ c Q ± " c QS ± " QS[W ± \ QS ± " where Q ± [W " an Q ± [W represent respectively the total number of false rejections an false acceptances at epoch L an is a positive learning parameter. As a result, a frequent learner will have a smaller learning rate while a non-frequent learner will have a higher learning rate. This arrangement can prevent the reinforce learning or the antireinforce learning from ominating the learning process., In this work, we only moifie the ecision threshols in the globally supervise training with the mean vectors 5 an covariance matrices 7 remain unchange. This is because we want to maintain the maximum likelihoo nature of the moels Speaker Verification Experiments In this section, experimental evaluations on close-set textinepenent speaker verification base on all speakers 8 male, female) in the OHO corpus [] are presente. We use speaker kernels, 16 anti-speaker kernels, an 16 anti-speakers for creating each speaker moel. The aim is to evaluate the robustness of ifferent pattern classifiers speaker moels) for speaker verification. To emonstrate the robustness of ifferent classifiers, speech from the enrollment sessions of the OHO corpus was use for training while the speech from the verification sessions was use Miss probability in ) Speaker Detection Performance False Alarm probability in ) Figure 1: DET curves for speaker 277. Thick curve: EBFN-base speaker moel. Thin curve: GMM-base an PDBNN-base speaker moels. Note that the DET curves for the GMM an the PDBNN are ientical since the GS training only upates the threshol of the PDBNN. Speaker Moel FAR ) FRR ) EER ) GMMs EBFs PDBNNs Table 1: Average error rates obtaine by the GMMs, EBFNs an PDBNNs. The pre-efine FAR for GMMs an EBFNs was set to.5. for testing. Verification was performe using each speaker in the corpus as a claimant, with 64 impostors being ranomly selecte from the remaining speakers excluing the anti-speakers) an rotating through all speakers. Table 1 summarizes the average FAR, FRR, an EER obtaine by the PDBNN-, GMM- an EBFN-base speaker moels. All results are base on the average of 138 speakers in the OHO corpus. The results, in particular the EER, emonstrate the superiority of the GMMs an PDBNNs over the EBFNs. The EER of GMMs an PDBNNs are the same since their kernel parameters are ientical. Table 1 also emonstrates the superiority of the threshol etermination proceure of PDBNNs. In particular, Table 1 shows that the globally supervise learning of PDBNNs can make the average FAR very small uring verification, whereas the a hoc approach use by the EBFNs an GMMs is not able to o so. Recall from our previous iscussion that the pre-efine FAR was set to.5. The average FAR of EBFNs an GMMs are, however, very ifferent from this value. This suggests that it may be ifficult for us to preict the performance of the EBFNs an GMMs

5 ² ² ² FRR ) FRR ) FRR ) FAR ) a) FAR ) b) FAR ) c) Figure 2: FRRs versus FARs uring verification) of 138 speakers using a) GMMs, b) PDBNNs an c) EBFNs as speaker moels. in etecting the impostor attacks. Figure 1 shows the DET curves [] for speaker 277 using ifferent types of speaker moels. In the DET plots, we use a nonlinear scale for both axes so that systems proucing Gaussian istribute scores will be represente by straight lines. This property helps sprea out the receiver operating characteristics ROCs), making comparison of well-performe systems much easier. Note that the DET curves for GMM an PDBNN are ientical in this experiment because the globally supervise training upates the threshols of PDBNNs only. It is evient from Figure 1 that the GMM- an PDBNN- speaker moels outperform the EBFN one. Figure 2 epicts the FAR an FRR of iniviual speakers in the GMM-, EBFN- an PDBNN-base speaker verification systems. Eviently, most of the speakers in the PDBNN-base system exhibit a low FAR. On the other han, the GMMs an EBFNs exhibit a much large variation in FAR. We conjecture that the globally supervise learning in PDBNNs is able to fin ecision threshols that minimize the variation in FAR. 4. CONCLSIONS This paper aresses the problem of builing speaker verification systems using kernel-base probabilistic neural networks such as GMMs, EBFNs an PDBNNs. The moelling capability of these pattern classifiers are compare. Experimental results inicate that GMM- an PDBNNbase speaker moels outperform the EBFN ones. This work also fins that the globally supervise learning of PDBNNs can reuce the FARs an maintain their variation to a low level. 5. REFERENCES [1] D. A. Reynols an R. C. Rose. Robust text-inepenent speaker ientification using Gaussian mixture speaker moels. IEEE Trans. on Speech an Auio Processing, 31):72 83, 15. [2] M.W. Mak an S.. Kung. Estimation of elliptical basis function parameters by the EM algorithms with application to speaker verification. IEEE Trans. on Neural Networks, 114):61 6,. [3] S. H. Lin, S.. Kung, an L. J. Lin. Face recognition/etection by probabilistic ecision-base neural network. IEEE Trans. on Neural Networks, Special Issue on Biometric Ientification, 81): , 17. [4] K. K. iu, M. W. Mak, an C. K. Li. Gaussian mixture moels an probabilistic ecision-base neural networks for pattern classification: A comparative stuy. Neural Computing & Applications, 8: , 1. [5] S.. Kung. Digital Neural Networks. Prentice Hall, New Jersey, 13. [6] A. P. Dempster, N. M. Lair, an D. B. Rubin. Maximum likelihoo from incomplete ata via the EM algorithm. J. of Royal Statistical Soc., Ser. B., 31):1 38, 177. [7] C. S. Liu, H. C. Wang, an C. H. Lee. Speaker verification using normalize log-likelihoo score. IEEE Trans on Speech an Auio Processing, 41):56 6, 16. [8] W. D. hang, M. W. Mak, C. K. Li, an M. X. He. A priori threshol etermination for phrase-prompte speaker verification. In Eurospeech, volume 2, pages 23 26, 1. [] Jr. J. P. Campbell. Testing with the OHO CD-ROM voice verification corpus. In ICASSP 5, pages , 15. [] A. Martin, G. Doington, T. Kamm, M. Orowski, an M. Przybocki. The DET Curve in assessment of etection task performance. In Eurospeech 7, pages , 17.

Speech Emotion Recognition Using Support Vector Machine

Speech Emotion Recognition Using Support Vector Machine Yixiong Pan, Peipei Shen and Liping Shen Department of Computer Technology Shanghai JiaoTong University, Shanghai, China panyixiong@sjtu.edu.cn,