Pavel Matějka, Lukáš Burget, Petr Schwarz, Ondřej Glembek, Martin Karafiát and František Grézl

SpeakerID@Speech@FIT Pavel Matějka, Lukáš Burget, Petr Schwarz, Ondřej Glembek, Martin Karafiát and František Grézl November 13 th 2006 FIT VUT Brno

Outline The task of Speaker ID / Speaker Ver NIST 2005 data and evaluations From easy to complex GMM basic system Dealing with channel and speaker variability Support vector machines (SVM) What is SVM? GMM-Supervectors MLLR transforms Conclusions Speech@FIT BUT 13.11.2006 2

The task Speaker recognition aims at recognizing "who said it". In speaker identification, the task is to assign speech signal to one out of N speakers. In speaker verification, the claimed identity is known and the question to be answered is "was the speaker really Mr. XYZ or an impostor? Speech@FIT BUT 13.11.2006 3

Good for Search of information in private and public audio archives (lectures, presentations, meetings, TV shows, ). Increasing the quality of service in call centers Speaker ID allows to detect a known client in a few seconds. Security and defense Looking for suspect in a quantity of recordings. Waiting on-line for a suspect Verifying, if a given speech sample belongs to a suspect or not. Speech@FIT BUT 13.11.2006 4

NIST evaluations NIST -agency of US Government. Its Speech group organizes regular evaluations of speech technologies (recognition, speaker detection, language recognition). ALL participating sites receive the same data and have limited time to submit results objective comparison! Results and system details are discussed at a workshop in Washington D.C. (or another nice place) Speech@FIT BUT 13.11.2006 5

Speech@FIT track in NIST evals Language Recognition -2005 Speaker Recognition - 2006 LVCSR -Large Vocabulary Continuous Speech Recognition -2005,2006 Keyword spotting (Spoken Term Detection) -will see in November 2006 Speech@FIT BUT 13.11.2006 6

NIST 2005 data Test data NIST 2005 evaluation data 646 speakers ~3.000 target trials, ~28.000 impostor trials 5 minute conversation (2.5 for each speaker) Train data NIST 2004 evaluation data 310 speakers 10 utterances per speaker in average 5 minute conversation (2.5 for each speaker) for training background model and eigenchannel space Speech@FIT BUT 13.11.2006 7

Evaluation DET curves False alarm = it was not target speaker but system said it is Miss probability = it was target speaker but system said it is not Equal Error Rate (EER) = operation point of the system where it makes the same number of false alarms and misses. Speech@FIT BUT 13.11.2006 8

Basic structure of the system Speech@FIT BUT 13.11.2006 9

Signal preprocessing Speech@FIT BUT 13.11.2006 10

Feature extraction - MFCC Speech@FIT BUT 13.11.2006 11

GMM Speech@FIT BUT 13.11.2006 12

Why Universal Background Model (UBM)? There is not enough data to train high order GMM for each speaker Let s train one model on data from lots of speakers, which would represents all speakers and all parameters of the model will be well estimated. The speaker model is derived from the UBM by shifting parameters for which we have data Speech@FIT BUT 13.11.2006 13

MAP adaptation of UBM to target speaker I. Speech@FIT BUT 13.11.2006 14

MAP adaptation of UBM to target speaker II. Speech@FIT BUT 13.11.2006 15

MAP adaptation of UBM to target speaker III. Speech@FIT BUT 13.11.2006 16

Things to improve GMM EER[%] 20 18 16 14 12 10 8 6 4 2 0 baseline + RASTA + short time gaussianization best stand alone system best fused system + double delta + tripple delta 29.3 5.4 12.4 19.4 26.4 3.5 10.5 date Speech@FIT BUT 13.11.2006 17

RASTA RASTA = RelAtive SpecTrAl Technique RASTA filter remove slowly varying, linear channel effects from the raw feature vectors c ( t) = h( t) c ( t) Remove near-dc components along with some higher frequency components Standard RASTA IIR filter i i 1 3 2 + z z 2z H ( z) = 0.1 4 1 z (1 0.982z ) 4 Speech@FIT BUT 13.11.2006 18

Short-time gaussianization Speech@FIT BUT 13.11.2006 19

HLDA HLDA provides a linear transformation that de-correlate the features Speech@FIT BUT 13.11.2006 20

HLDA And reduces the dimensionality while preserving the discriminative power of features Speech@FIT BUT 13.11.2006 21

Things to improve GMM EER[%] 20 18 16 14 12 10 8 6 4 2 0 baseline + RASTA + short time gaussianization best stand alone system best fused system + double delta + tripple delta + HLDA 29.3 5.4 12.4 19.4 26.4 3.5 10.5 time [-] Speech@FIT BUT 13.11.2006 22

Feature mapping I. FM starts with UBM trained on everything Speech@FIT BUT 13.11.2006 23

Feature mapping II. A channel-specific UBM is MAP-adapted from the general one. Speech@FIT BUT 13.11.2006 24

Feature mapping III. Gaussian-dependent offsets are computed and removed from all features Speech@FIT BUT 13.11.2006 25

Things to improve GMM EER[%] 20 18 16 14 12 10 8 6 4 2 0 baseline + RASTA + short time gaussianization best stand alone system best fused system + double delta + feature mapping + tripple delta + HLDA 29.3 5.4 12.4 19.4 26.4 3.5 10.5 time [-] Speech@FIT BUT 13.11.2006 26

Eigen-channel adaptation I. We want to find the direction(s) of highest variability of supervectors obtained for different utterances from the same speaker eigenchannel(s). Speech@FIT BUT 13.11.2006 27

Eigen-channel adaptation II. The direction is obtained by PCA of average withinclass covariance matrix, where classes are supervectors corresponding to the same speaker. Eigen-channel = unwanted variability Speech@FIT BUT 13.11.2006 28

Eigen-channel adaptation III. During the test, we adapt speaker model and UBM by moving supervector in the direction of eigen-channel(s) to fall optimally under the data. Speech@FIT BUT 13.11.2006 29

Things to improve GMM EER[%] 20 18 16 14 12 10 8 6 4 2 0 baseline + RASTA + short time gaussianization best stand alone system best fused system + double delta + feature mapping + tripple delta + HLDA + eigen channel compensation 29.3 5.4 12.4 19.4 26.4 3.5 10.5 time [-] Speech@FIT BUT 13.11.2006 30

Things to improve GMM system Baseline GMM MFCC + c0, zero mean norm, deltas, 2048 Gaussian + RASTA channel compensation + short-time Gaussianization + acceleration coefficients + tripple deltas (bad for 2006) + HLDA 52->39 dimensions + feature mapping + eigen-channel adaptation EER 26.6 14.3 12.4 11.2 10.6 9.8 7.3 4.6 Speech@FIT BUT 13.11.2006 31

Systems based on SVM What is SVM? GMM/SVM CMLLR-MLLR/SVM Speech@FIT BUT 13.11.2006 32

SVM - What is SVM? Everybody knows neural networks NN: y=f(x) 1 2 3 x1 x2 y1 y2 xn ym SVM SVM is a neural network with just one neuron (perceptron) and without nonlinear output function Speech@FIT BUT 13.11.2006 33

Perceptron basic definition For 2D: x 2 The task should be linearly separable The training algorithm and error criteria are important! Speech@FIT BUT 13.11.2006 34 x 1

Perceptron error criteria Perceptron: E=(y-t) 2 ; t is target (wanted) value very sensitive to balancing of data sets very sensitive to distribution of data sets y-t target value for class 1 0 target value for class 2 Speech@FIT BUT 13.11.2006 35

SVM error criteria Maximizes margin between two clusters Minimizes structural error (number of miss-classified data points) x 2 Speech@FIT BUT 13.11.2006 36 x 1

Linearly inseparable tasks Can be solved by nonlinear mapping of features to space with higher dimensionality Dot products in high dimensional spaces are called kernels K(x1, x2), but we use just linear ones? 2D 3D Speech@FIT BUT 13.11.2006 37

Rank normalization Helps to unify dynamic range of features Normalize arbitrary distribution of input data to uniform distribution between 0 and 1 Values for each dimension of feature vector are replaced by indexes to an array obtained by sorting of all values over the data set and scaled to interval between 0 and 1 Speech@FIT BUT 13.11.2006 38

SVM GMM means GMM models consist of 512 Gaussian mixture components Each Gaussian is represented by mean and variance vector and it's weight The mean from all Gaussians are concatenated together to form one long "supervector" These supervectors are the features for Support vector machines Speech@FIT BUT 13.11.2006 39

SVM MLLR transforms CMLLR/MLLR speaker transforms LVCSR system speech MLLR text Speech@FIT BUT 13.11.2006 40

NAP Nuissance attribute projection Removes the unwanted variability from features by projecting them to useful space. Speech@FIT BUT 13.11.2006 41

Summary of results on NIST 2005 Speech@FIT BUT 13.11.2006 42

Summary of results on NIST 2005 system The best GMM SVM-GMM + NAP SVM-MLLR + NAP fusion EER 4.62 5.42 7.05 3.70 Speech@FIT BUT 13.11.2006 43

Comparison of results Speech@FIT BUT 13.11.2006 44

Conclusions We built three complementary systems for speaker verification We evaluated it in NIST 2006 campaign with very optimistic results Speech@FIT BUT 13.11.2006 45

References [Mason2005] M. Mason et al: Data-Driven Clustering for Blind Feature Mapping in SpkID, Eurospeech 2005. [Chang2001] C. Chang et al.: LIBSVM: a library for Support Vector Machines, http://www.csie.ntu.edu.tw/~cjlin/libsvm [Hain2005] T. Hain et al.: The 2005 AMI system for RTS, Meeting Recognition Evaluation Workshop, Edinburgh, July 2005. [Stolcke2005/6] A. Stolcke: MLLR Transforms as Features in SpkID, Eurospeech 2005, Odyssey 2006 [Brummer2004] N. Brummer: SDV NIST SRE 04 System description, 2004. [Brummer_FoCal] N. Brummer: FoCal: Toolkit for fusion and Calibration, www.dsp.sun.ac.za/~nbrummer/focal [Campbell2006] W. M. Campbell et al., SVM Based Speaker Verification Using a GMM Supervector and NAP Variability Compensation, ICASSP 2006. Speech@FIT BUT 13.11.2006 46

SVM - GMM Feature extraction and UBM adaptation is the same as for GMM system Only 512 Gaussian components Supervector 512*39=19968 NAP with 30 eigen-vectors derived on 310 speakers from NIST 2004 Impostors: 230 speakers from NIST 2002 and 2606 speakers from Fisher T-norm: 230 speakers from NIST 2002 and 800 speakers from Fisher Speech@FIT BUT 13.11.2006 47

SVM CMLLR/MLLR [Stolcke2005/6] LVCSR system is adapted to speaker (VTLN factor and (C)MLLR transformations are estimated) using ASR transcriptions provided by NIST AMI 2005(6) LVCSR system incorporates [Hain2005]: 50k word dictionary (pronunciations of OOVs were generated by grapheme to phoneme conversion based on rules trained from data) PLP, HLDA CD-HMM with 7500 tied-states each modeled by 18 Gaussians Discriminatively trained using MPE Adapted to speaker: VTLN, SAT based on CMLLR, MLLR Speech@FIT BUT 13.11.2006 48

SVM - CMLLR/MLLR Cascade of CMLLR and MLLR CMLLR: 2 classes silence and speech MLLR: 3 classes silence and 2 speech classes derived from data Silence class discarded for SRE Supervector = 1 CMLLR + 2 MLLR = = 3*3*13 2 +3*39=1638 NAP with 20 eigen-vectors derived on NIST 2004 Impostors: 310 speakers from NIST 2004 T-norm: 310 speakers from NIST 2004 Speech@FIT BUT 13.11.2006 49