The 2004 MIT Lincoln Laboratory Speaker Recognition System

The 2004 MIT Lincoln Laboratory Speaker Recognition System D.A.Reynolds, W. Campbell, T. Gleason, C. Quillen, D. Sturim, P. Torres-Carrasquillo, A. Adami (ICASSP 2005) CS298 Seminar Shaunak Chatterjee 09-23-2011 1

Actually Robust text-independent speaker identification using Gaussian mixture speaker models Reynolds, Rose (1995) Speaker verification using adapted Gaussian mixture models Reynolds, Quatieri, Bunn (2000) Speaker recognition based on idiolectal differences between speakers Doddington (2001) Generalized linear discriminant sequence kernels for speaker recognition Campbell (2002) Modeling prosodic dynamics for speaker recognition Adami, Mihaescu, Reynolds, Godfrey (2003) Speaker adaptive cohort selection for Tnorm in text-independent speaker verification Sturim, Reynolds (2005) The 2004 MIT Lincoln Laboratory Speaker Recognition System Reynolds et al (2005) The MIT Lincoln Laboratory 2008 Speaker Recognition System Sturim, Campbell, Karam, Reynolds, Richardson (2009) 2

Douglas A. Reynolds PhD (Georgia Tech, 1992) Currently Senior Member of Technical Staff at MIT Lincoln Lab Most cited author in speaker recognition (by far?) Contributed several key ideas currently used in robust speaker recognition systems MIT Lincoln Lab has won numerous awards at the NIST SRE over the years 3

What can we learn from speech? Slide courtesy: Reynolds, Heck 4

Speaker Recognition Identification No identity claim is made Classification Verification Identity claim is made Binary decision Open-set vs closed-set Text-dependent vs text-independent 5

Applications (Telephonic) Transaction Authentication Access Control Physical facilities Computer and data networks Parole Monitoring Information Retrieval Audio indexing in call centers Forensics 6

Components of a speaker recognition system Universal Background Model Background s Voiceprint Slide courtesy: Reynolds, Heck 7

Phases of speaker verification Slide courtesy: Reynolds, Heck 8

Feature Extraction Universal Background Model Background s Voiceprint 9

Feature Extraction Pre-processing Bandlimiting Silence, noise removal Channel bias removal (RASTA et al) Feature computation MFCC computed every 10ms over a 20ms window F0 and energy features Phonetic features 10

Speaker models Universal Background Model Background s Voiceprint Slide courtesy: Reynolds, Heck 11

Gaussian mixture models (GMMs) Trained using EM Often converges within 5 iterations Wide range of choices to constrain parameters 12

Why GMMs? - I Histogram of one cepstral coefficient for a 25-second speech sequence Unimodal distribution Gaussian mixture model Vector Quantization (VQ) [Reynolds 95] 13

Why GMMs? - II Each component of the GMM corresponds to a speaker-dependent vocal tract configuration [Reynolds 95] Image: wikipedia 14

Text-dependent vs text-independent Slide courtesy: Reynolds, Heck 15

Speaker models Universal Background Model Background s Voiceprint Slide courtesy: Reynolds, Heck 16

Hypothesis testing 17

2004 MIT Lincoln Lab Speaker Recognition System (MITLL) Seven core systems Spectral based GMM-UBM (Spectral) SVM Prosodic based Pitch and Energy GMM Slope and duration GMM Phonetic based Phone N-grams Phone SVM Idiolectal based 18

Feature Extraction GMM-UBM 19-dimensional MFCC every 10ms using a 20ms window Bandlimiting: 300-3138Hz RASTA filtering To reduce channel bias effects Δ-cepstral coefficients computed for ±2 frames Silence removal, feature mapping, normalization 20

UBM training Gender-independent 2048 mixture UBM trained from Switchboard and OGI National Cellular Database Corpora MIXER corpus (the test data) was not used Target models (for individual speakers) are derived by Bayesian adaptation of the UBM parameters and training data from MIXER compensating for UBM 21

Support Vector Machines (SVM) 23

SVM - II 24

SVM - III 25

Spectral SVM (for speech) Campbell (2002) showed that good performance in speaker recognition tasks could be achieved using sequence kernels Sequence kernel: provides a numerical comparison of speech utterances as entire sequences Campbell introduced a novel sequence kernel derived from generalized linear discriminants 26

SVM setup in MITLL Same front-end processing as before Background (or the other class) for every speaker consisted of a set of speakers taken from Switchboard Current speaker under training had target of +1 and every other speaker had target of -1 SVM training was performed using the GLDS kernel 27

Prosodic based systems Prosody: the rhythm, stress and intonation of speech Spectral approaches focus on capturing shortterm information Prosodic systems can model long-term information Two systems in 2004 MITLL SRS Distribution based pitch/energy classifier Pitch/energy sequence modeling system 29

Pitch and Energy GMM Very similar to GMM-UBM Main difference: feature set Log F0 and log energy estimated every 10ms using RAPT Robust Algorithm for Pitch Tracking (Talkin 1995) Δ features (over 50ms window) appended Silence and noisy region removal UBM: 512 components (Switchboard) 30

What is F0? Fundamental frequency of a human voice Between 85-180 in males 165-255 in females Range is below most band limits Higher harmonics are transmitted F0 is not static 31

Slope and duration n-gram - I The dynamics of F0 and energy also convey information about speaker identity Dynamics of both trajectories jointly represent certain prosodic gestures characteristic of a speaker (Adami et al, 2003) 32

Slope and duration n-gram - II F0 and energy trajectories converted into a sequence of tokens Each token reflects a joint state of the trajectories (rising or falling) 33

Phonetic based system - I Gender independent phone recognition Phone recognizers trained on phonetically marked speech from OGI multi-language corpus Output token streams were processed to produce a sequence of token symbols 35

Phonetic based system II Two systems Standard n-gram modeling Bi-gram model estimated for each speaker (for each phone/language) UBM from Switchboard 6 scores fused Phone SVM Very similar to Spectral SVM 36

Idiolectal differences Only look at content! It is possible to determine authorship of papers/literary works by looking at them 38

Idiolectal differences Speech content is conventionally less constrained and therefore more distinctive Unfortunately, a lot of data is needed for reasonable accuracy 39

MITLL idiolectal based system Only considered bigrams Trigrams and higher did not improve performance Switchboard data used to create UBM BBN Byblos 3.0 used for speech-to-text conversion 40

System fusion Perceptron classifier 41

Performance measure Slide courtesy: Reynolds, Heck 42

DET different scenarios Slide courtesy: Reynolds, Heck 43

Results - I 44

Results - II 45

No gain from higher-level information All development data from English Could have led to a bias in the UBMs SRE04 dataset had tons of channel mismatch More difficult task, potentially masks gains Both are essentially mismatches between training and test distributions/data 46

Results - III All Pool: all languages Common pool: English only Clear indication of crosslingual degradation N-gram system reduces error significantly 47

Conclusions 2004 MITLL system attempted to exploit other levels of information (prosodic, phonetic, idiolectal) to better characterize and recognize a speaker 7 core systems Generative, discriminative and discrete classifiers Results on the challenging MIXER corpus (SRE04) Previous success in system fusion needs to be tailored better for cross-lingual environments 48

2008 MITLL Speaker Recognition system (Interspeech 2009) Two main themes Variational nuisance modeling to allow for better compensation for channel variation Fuse systems targeting different linguistic tiers of information (high and low) 49

Thanks for the attention! QUESTIONS? 50