Using MMSE to improve session variability estimation. Gang Wang and Thomas Fang Zheng*

Similar documents
DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

A study of speaker adaptation for DNN-based speech synthesis

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Support Vector Machines for Speaker and Language Recognition

Spoofing and countermeasures for automatic speaker verification

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

WHEN THERE IS A mismatch between the acoustic

Speech Emotion Recognition Using Support Vector Machine

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Speaker recognition using universal background model on YOHO database

Human Emotion Recognition From Speech

Modeling function word errors in DNN-HMM based LVCSR systems

Learning Methods in Multilingual Speech Recognition

Modeling function word errors in DNN-HMM based LVCSR systems

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Speaker Recognition For Speech Under Face Cover

Probabilistic Latent Semantic Analysis

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Calibration of Confidence Measures in Speech Recognition

Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Speaker Identification by Comparison of Smart Methods. Abstract

Speaker Recognition. Speaker Diarization and Identification

Speech Recognition at ICSI: Broadcast News and beyond

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Generative models and adversarial training

Affective Classification of Generic Audio Clips using Regression Models

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Speech Recognition by Indexing and Sequencing

Assignment 1: Predicting Amazon Review Ratings

Segregation of Unvoiced Speech from Nonspeech Interference

Investigation on Mandarin Broadcast News Speech Recognition

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Lecture 1: Machine Learning Basics

Mandarin Lexical Tone Recognition: The Gating Paradigm

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Automatic Pronunciation Checker

SUPRA-SEGMENTAL FEATURE BASED SPEAKER TRAIT DETECTION

Proceedings of Meetings on Acoustics

Edinburgh Research Explorer

Improving software testing course experience with pair testing pattern. Iyad Alazzam* and Mohammed Akour

Comment-based Multi-View Clustering of Web 2.0 Items

Australian Journal of Basic and Applied Sciences

/$ IEEE

Rule Learning With Negation: Issues Regarding Effectiveness

arxiv: v2 [cs.cv] 30 Mar 2017

On the Formation of Phoneme Categories in DNN Acoustic Models

Python Machine Learning

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen

Offline Writer Identification Using Convolutional Neural Network Activation Features

LOW-RANK AND SPARSE SOFT TARGETS TO LEARN BETTER DNN ACOUSTIC MODELS

An Online Handwriting Recognition System For Turkish

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Use and Adaptation of Open Source Software for Capacity Building to Strengthen Health Research in Low- and Middle-Income Countries

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

First Grade Standards

Constructing Parallel Corpus from Movie Subtitles

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

Disambiguation of Thai Personal Name from Online News Articles

Evaluating Interactive Visualization of Multidimensional Data Projection with Feature Transformation

On the Combined Behavior of Autonomous Resource Management Agents

Speech Translation for Triage of Emergency Phonecalls in Minority Languages

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

A Privacy-Sensitive Approach to Modeling Multi-Person Conversations

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

Learning Methods for Fuzzy Systems

Conceptual and Procedural Knowledge of a Mathematics Problem: Their Measurement and Their Causal Interrelations

Vowel mispronunciation detection using DNN acoustic models with cross-lingual training

Semi-Supervised Face Detection

The Good Judgment Project: A large scale test of different methods of combining expert predictions

arxiv: v1 [math.at] 10 Jan 2016

South Carolina College- and Career-Ready Standards for Mathematics. Standards Unpacking Documents Grade 5

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Improvements to the Pruning Behavior of DNN Acoustic Models

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS

Transcription:

350 Int. J. Biometrics, Vol. 2, o. 4, 2010 Using MMSE to improve session variability estimation Gang Wang and Thomas Fang Zheng* Center for Speech and Language Technologies, Division of Technical Innovation and Development, Tsinghua ational Laboratory for Information Science and Technology, Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China E-mail: wang-g07@mails.tsinghua.edu.cn E-mail: fzheng@tsinghua.edu.cn *Corresponding author Abstract: In this paper, the Session Variability Subspace Projection (SVSP) method based on model compensation for speaker verification was improved using the Minimum Mean Square Error (MMSE) criterion. The issue of SVSP is that the speaker s session-independent supervector is approximated by the average of all his or her session-dependent GMM-supervectors when estimating SVSP matrix. However, the error between them does obviously exist. Our goal is to minimise it using MMSE criterion. Compared with the original SVSP, the proposed method could achieve an error rate reduction of 6.7% for EER and 5.3% for minimum detection cost function over the IST SRE 2006 1C4W-dataset. Keywords: speaker verification; session variability; MMSE; model compensation. Reference to this paper should be made as follows: Wang, G. and Zheng, T.F. (2010) Using MMSE to improve session variability estimation, Int. J. Biometrics, Vol. 2, o. 4, pp.350 357. Biographical notes: Gang Wang is pursuing his PhD in the Department of Computer Science and Technology, Tsinghua University, Beijing, China. His research interests focus on speaker recognition and multi-speaker segmentation. Thomas Fang Zheng received his PhD Degree from the Department of Computer Science and Technology, Tsinghua University, Beijing, China, in 1997. He is a Research Professor and Vice Dean of the Research Institute of Information Technology, Tsinghua University and Director of Center for Speech and Language Technologies, Division of Technical Innovation and Development, Tsinghua ational Laboratory for Information Science and Technology. He is an IEEE senior member and his current research interests include speech recognition, speaker recognition and natural language processing. Copyright 2010 Inderscience Enterprises Ltd.

Using MMSE to improve session variability estimation 351 1 Introduction The mismatch caused by session variability is still a big problem in speaker recognition, in spite of the great progress that has been made in this field. Session variability includes transmission channel effects, transducer characteristics, background noise, intra-speaker variability, and so on. Several methods have been proposed to solve this problem, which can be categorised into three domains: feature domain, model domain and score domain. In the feature domain, typical methods include cepstral mean subtraction (Furui, 1981), RASTA filter (Hermansky and Morgan, 1994), feature warping (Pelecanos and Sridharan, 2001) and feature mapping (Reynolds, 2003); in the model domain, typical methods include speaker model synthesis (Teunen et al., 2000; Wu et al., 2007), factor analysis (Kenny et al., 2005; Vogt and Sridharan, 2006), uisance Attribute Projection (AP) (Campbell et al., 2006) and SVSP based on model compensation (Deng et al., 2007); while in the score domain, typical methods include Hnorm (Reynolds, 1996), Tnorm (Auckenthaler et al., 2000), Znorm (Li and Porter, 1988) and ZTnorm (Kenny et al., 2008). Methods in the model domain have been recently proposed, have become very popular, and have achieved impressive reductions in verification error rates (Kenny et al., 2005; Vogt and Sridharan, 2006; Campbell et al., 2006; Deng et al., 2007). SVSP (Deng et al., 2007) can greatly reduce the computational complex of session variability compared with the traditional factor analysis method (Vogt and Sridharan, 2006), and it can be directly used for the GMM-UBM system and improve the performance of speaker verification compared with the conventional GMM-UBM method. However, SVSP is not perfect, as expected. In the SVSP algorithm, the session variability of a test utterance will be used to compensate for the tentative speaker models whose session variability (which, of course, is mostly different from that estimated from the test utterance) has already been removed during the training phase. Its performance depends on the estimation accuracy of the session variability to some extent. But SVSP uses the average of a speaker s all session-dependent GMM supervectors to approximate the session-independent supervector of the speaker when estimating the SVSP matrix; this will reduce the accuracy of session variability. Accordingly, the performance of speaker recognition will be enhanced. Considering the above, an improved method of SVSP matrix estimation based on MMSE is proposed in this paper. Its main idea is to estimate SVSP matrix within constraint of minimising session-independent GMM supervector mean square error on the development data set and improve the accuracy of session variability. This paper is organised as follows. In Section 2 the SVSP method will be briefly reviewed and then the proposed method will be introduced. In Section 3, experiments will be described and results will be given. Finally, conclusions and perspectives will be presented in Section 4. 2 SVSP review and improvement 2.1 SVSP review Given a speaker s Gaussian mixture model, a GMM supervector M(s, i) can be formed by concatenating the GMM component mean vectors (Wu et al., 2007; Kenny et al., 2005;

352 G. Wang and T.F. Zheng Vogt and Sridharan, 2006). The supervector is a sum of a session-independent GMM supervector m(s) with an additional session-dependent supervector Uz(s, i) (Campbell et al., 2006), which is illustrated in Figure 1 and can be described as M (,) si = ms () + Uzsi (,). (1) Figure 1 The decomposition of GMM supervector In equation (1), the GMM supervector M(s, i) is dependent on the ith session of speaker s; z(s, i) is the latent factor which is assumed to belong to a standard normal distribution. U is a CD R C low-rank matrix from the constrained session variability subspace, where C is the number of Gaussian components in a Universal Background Model (UBM), D is the dimension of the acoustic feature vectors, and R C is the rank of U matrix and R C << CD. The computation method of U can be found in Vogt and Sridharan (2006). ote that the eigenvectors used to form the U matrix are orthogonal. So the derived projection matrix P can be written as t t P = UU and PU = UU U = U. (2) Figure 2 illustrates the detailed work flow of the SVSP method based on model compensation for GMM-UBM systems, which is divided into two parts. The part surrounded by the dashed rectangle will be modified in this paper so as to improve the estimation of session variability and details will be introduced in Section 2.2. The UBM is trained using hundreds of speakers utterances containing all kinds of sessions with the EM algorithm (Dempster et al., 1977), so it can be regarded as session-independent. A GMM supervector µ can be formed from the UBM. Given a training utterance i uttered by speaker s, the speaker model can be obtained from UBM using the conventional MAP adaptation method (Reynolds et al., 2000) with only the mean vectors changed. A GMM supervector M(s, i) can then be formed from this speaker model. Afterwards, the session variability is removed from the GMM supervector by projection, which can be written as ms () = ( I PM ) (,) si (3) where m(s) is the session-independent GMM supervector of speaker s. During the recognition phase, given a test utterance j uttered by speaker t, a speaker model is adapted from UBM with the conventional MAP method (Reynolds et al., 2000), then a GMM supervector M(t, j) is formed from it. Afterwards, the session variability of the test utterance is calculated by equation (4)

Using MMSE to improve session variability estimation 353 Uz(, t j) = PM(, t j). (4) Therefore, the session-j-dependent GMM supervector of speaker s M(s, j) can be obtained after session-independent GMM supervector m(s) is compensated with Uz(t, j) as M (, s j) = ( I P) M(,) s i + PM(, t j) (5) where M(s, j) can be regarded as the model of speaker s whose session is the same as that of the test utterance t. Similarly, the session-j-dependent UBM supervector M(ubm, j) can be obtained after µ was compensated with Uz(t, j) as M ( ubm, j) = µ + PM ( t, j) (6) where µ denotes the session-independent UBM supervector, and M(ubm, j) can be regarded as a session-dependent UBM supervector whose session is the same as that of the test utterance t. Figure 2 The block diagram of SVSP based on model compensation for GMM-UBM system 2.2 The proposed SVSP estimation method In Campbell et al. (2006), the U matrix estimation is performed under the assumption that session-independent GMM supervector of a speaker s is approximated by the average of the speaker s all session-dependent GMM supervectors, which can be written as

354 G. Wang and T.F. Zheng 1 s M () s = M(,) s i m() s (7) s i = 1 where s is the number of utterances spoken by speaker s, and each utterance is regarded as one session. However, the error between M (s) and m(s) does obviously exist. On the one hand, if the U matrix is ideally precisely estimated, the error between M (s) and m(s) must be expected to be as small as possible. On the other hand, a U matrix is ideal if after using it the difference among the session-independent GMM supervectors of those different sessions from a same speaker after removing session variability is made minimal. From the two points, a conclusion can be drawn that a better U matrix should meet equation (9). 1 1 MSE( U) = ( M( s, i) M ( s)) H H s 2 s= 1 s i= 1 (8) U* = argminmse( U). (9) U In equation (8), H is the number of those unique speakers in the development data set, and MSE(U) is the mean square error of U matrix. 1 s M () s = ( I P) M(,). s i (10) s i = 1 Equation (10) was used to replace equation (7) to iteratively estimate the U matrix after the initial U matrix was estimated using the original algorithm. If the reduction of MSE(U) becomes very small or a predefined maximum iteration times has passed, the iteration will be stop. The detailed flow chat is given in Figure 3. Figure 3 The flow chart of U matrix estimation based on MMSE

Using MMSE to improve session variability estimation 355 3 Experiments and results The experiments were performed on the ational Institute of Standards and Technology (IST) Speaker Recognition Evaluation (SRE) 2006 corpus (ational Institute of Standards and Technology, 2004, 2005, 2006) and focused on the single side one conversation training, single-side one conversation test. Feature extraction was performed on a 20-millisecond frame every 10-millisecond. The pre-emphasis coefficient was 0.97 and hamming windowing was applied to each pre-emphasised frame. Voice activity detection based on energy was performed with each frame labelled either valid or invalid. 16-dimensional MFCC features were extracted from the utterances only for those valid frames with 30 triangular Mel filters used in the MFCC calculation. For each frame, the MFCC coefficients and their first delta coefficients formed a 32-dimentional feature vector. To reduce channel effects, mean and variance normalisation was applied to the extracted features. Two gender-dependent UBMs consisted of 1024 mixture components and were trained from the IST SRE 2004 1C4W dataset using the EM algorithm (Dempster et al., 1977). For the MAP training (Reynolds et al., 2000), only mean vectors were adapted with a relevance factor of 16. The data used for estimate the U matrix were from the single-side 8 conversation training in IST SRE 2005, which consisted of 279 females and 202 males. The baseline system is a speaker verification system based on the conventional GMM-UBM. Equal Error Rate (EER) and min-dcf (ational Institute of Standards and Technology, 2004, 2005, 2006) are used to evaluate the performance of the system. The parameters of DCF are the same as in ational Institute of Standards and Technology (2004, 2005, 2006). 3.1 The rank of the U matrix The rank of the U matrix is critical for SVSP algorithm, which will affect the accuracy of the estimation of session variability. The results for different session variability subspace sizes are given in Table 1 where R C is the rank of the U matrix. The system used in this experiment was based on the original SVSP. Experimental results show that the system can achieve the best Min-DCF and EER when R C = 50. Table 1 Min-DCF and EER results for different U matrix rank R C MSE (U) Min-DCF ( 10 2 ) EER (%) 10 1028.2 9.4 9.56 30 940.5 7.7 8.25 50 891.7 7.6 7.93 70 856.8 7.8 8.12 90 828.3 7.9 8.37 3.2 Optimising the U matrix The proposed method was used to optimise the U matrix which was described in Section 2.2. With about 5 8 times iteration, equation (9) can be satisfied. After optimisation, the Min-DCF and EER will be reduced with MSE(U) going small.

356 G. Wang and T.F. Zheng It can be seen from Table 2 that the best Min-DCF and EER could be achieved when R C = 70. Table 2 The DCF and EER results for the propose method R C MSE (U) Min-DCF ( 10 2 ) EER (%) 10 1025.5 8.2 9.37 30 931.1 7.5 8.15 50 870.3 7.4 7.85 70 816.2 7.2 7.40 90 818.7 7.6 8.19 3.3 Comparison between the two methods Figure 4 shows the performance comparison among three methods: baseline (GMM-UBM), original SVSP, and improved SVSP. Compared with the original SVSP, the proposed method achieves a relative reduction of 6.7% for EER and 5.3% for DCF, though the curves cross with each other. Figure 4 The DET curve comparison among baseline, the original SVSP, and the improved SVSP (see online version for colours) 4 Conclusions In this paper the method of evaluation of the U matrix was analysed and an evaluation method was given in equations (8) and (9). Here, a U matrix has better performance means the system using it achieves a smaller EER and DCF. To some extent the smaller MSE(U) is, the better the performance of U matrix is. The proposed method can better estimate the U matrix, so the session variability can be estimated more accurately and the effect of session can be better reduced. The experimental results can show the effectiveness of the proposed SVSP matrix estimation method based on MMSE.

Using MMSE to improve session variability estimation 357 However, it is still possible for some useful speaker dependent information from a speaker model to be removed using equation (3). For those sessions that do not exist in the development data set, the performance will not show improvement. References Auckenthaler, R., Carey, M. and Thomas, H.L. (2000) Score normalization for text-independent speaker verification system, Digital Signal Processing, Vol. 10, pp.1 16. Campbell, W.M., Sturim, D.E., Reynolds, D.A. and Solomonoff, A. (2006) SVM based speaker verification using a GMM supervector kernel and AP variability compensation, ICASSP, Vol. 1, pp.97 100. Dempster, A., Laird,. and Rubin, D. (1977) Maximum likelihood from incomplete data via the em algorithm, J. Roy. Stat. Soc., Vol. 39, pp.1 38. Deng, J., Zheng, T.F. and Wu, W.H. (2007) Session variability subspace projection based model compensation for speaker verification, ICASSP, Vol. IV, pp.57 60. Furui, S. (1981) Cepstral analysis technique for automatic speaker verification, IEEE Trans. Acoust. Speech Signal Processing, Vol. 29, pp.254 272. Hermansky, H. and Morgan,. (1994) RASTA processing of speech, IEEE Transactions on Speech and Audio Processing, Vol. 2, pp.578 589. Kenny, P., Boulianne, G. and Dumouchel, P. (2005) Eigenvoice modeling with sparse training data, IEEE Transactions on Speech and Audio Processing, Vol. 13, o. 3, pp.345 354. Kenny, P., Dehak,., Dehak, R., Gupta, V. and Dumouchel, P. (2008) The role of speaker factors in the IST extended data task, Proceedings of Odyssey 2008: The Speaker and Language Recognition Workshop. Li, K.P. and Porter, J.E. (1988) ormalizations and selection of speech segments for speaker recognition scoring, ICASSP, Vol. 1, pp.595 598. ational Institute of Standards and technology (2004, 2005, 2006) IST Speech Group Website, http://www.nist.gov/speech Pelecanos, J. and Sridharan, S. (2001) Feature warping for robust speaker verification, Odyssey, pp.213 218. Reynolds, D.A. (1996) The effect of handset variability on speaker recognition performance: experiments on the switchboard corpus, ICASSP, Vol. 1, pp.113 116. Reynolds, D.A. (2003) Channel robust speaker verification via feature mapping, ICASSP, Vol. 2, pp.53 56. Reynolds, D.A., Quatieri, T.F. and Dunn, R. (2000) Speaker verification using adapted Gaussian mixture models, Digital Signal Processing, Vol. 10, o. 1, pp.194 141. Teunen, R., Shahshahani, B. and Heck, L.P. (2000) A model based transformational approach to robust speaker recognition, ICSLP, Vol. 2, pp.495 498. Vogt, R. and Sridharan, S. (2006) Experiments in session variability modeling for speaker verification, ICASSP, Vol. 1, pp.897 900. Wu, W., Zheng, T.F., Xu, M-X. and Soong, F. (2007) A cohort-based speaker model synthesis for mismatched channels in speaker verification, IEEE Trans. on Audio, Speech and Language Processing, Vol. 15, o. 6, pp.1893 1903.