Automatic Speaker Recognition - PDF Free Download

Automatic Speaker Recognition Qian Yang 04. June, 2013

Outline Overview Traditional Approaches Speaker Diarization

State-of-the-art speaker recognition systems use: GMM-based framework SVM-based framework

Outline Overview Traditional Approaches Features and Models Evaluation and Performance Speaker Diarization

Traditional Approaches Features for Speaker Recognition Primary features used in speaker recognition systems are cepstral features (eg. MFCC, PLP) Voice activity detection (VAD) to remove non-speech frames Some form of blind decovolution is used to remove stationary channel effects (CMS) Feature warping is to compensate channel variability Time differential cepstral (delta cepstra, or delta detla features) are usually appended to cepstral features Typically 24-40 dimensional features are used

Support Vector Machines Latest discriminative approach for speaker verification

Traditonal Approaches Support Vector Machines A SVM is a supervised method A SVM is a binary discriminative classifier ( target speaker vs. impostors) transform data to a higher dimensional space using kernel functions Construct a hyperplane in higher dimensional space to maximize the margin between support vectors The data points lying on the boundaries are support vectors Design different kernels for speaker recognition task kernel Ideal output (0/1) Support vectors When classification, a class decision is based upon whether the value, f(x), is above or below a threshold

Traditonal Approaches Support Vector Machines How to represent speaker utterances in higher dimensional space?? Example: A linear kernel for speaker verification task Train GMMs on two utterances using MAP Derive a distance metric between two utterances using a modified KL divergence Define a linear kernel based on the distance above

Outline Overview Traditional Approaches Speaker Diarization Overview Cross-show speaker diarization Speaker Tracking

Unlike speaker verification/identifcation, no enrollment data for training and no prior knowledge about number of speakers!!!

Speaker Diarization Generic Architecture 3-step Diarization process 1. Voice Activity Detection (VAD) Detects different acoustic events: music, speech, noise.. 2. Speaker Change Detection Detects speaker turn changes missed by VAD 3. Speaker Clustering Group speaker segments from same speaker together

Cross-show Speaker Diarization Giving a set of shows from same source, provides global IDs for speakers who appear across shows. Why cross-show? In digital libraries or multimedia archives, some speakers may appear multiple times, such as journalists, politicians.. Linkage to conventional speaker diarization Global IDs vs. Local IDs

Cross-show Speaker Diarization Generic Architectures Scheme 1 Scheme 2 Scheme 3

Cross-show Speaker Diarization Some results on English Podcast News data Cross-show DER as metrics

Speaker Tracking Aims to determine if and when target speaker speaks in a multispeaker record. The target speaker has been enrolled into the system in the previous phase. Why speaker tracking? Tracking anchor speakers or politicians on broadcast news Linkage to speaker diarization Supervised approach (speaker models are required) Diarization results are used as inputs

KIT s speaker tracking system Consists of two components: speaker segementation + Open-set speaker identification

System Architecture Speaker segmentation Split the speech into segments. Each segment has only one speaker speaking Open-set speaker identification(sid) - Training Phase: train 4 UBMs for telephone/studio male/female speech (gender and channel dependent) on ESTER2 data Speaker models obtained by MAP adapted on corresponding UBMs - Detection Phase: System 1(baseline) - gender/bandwidth classification (GMM-based classifier), test segment scored against speaker and UBM models which matches gender/bandwidth conditions System 2 (FSC) - Apply frame-base score competition (FSC) [JSW07] - 100 impostors(50f+50m) for T-Norm from ESTER1 data

Some results on French broadcast news ESTER2 data, French BN 114 target speakers, 105h training data, 6 hours dev data Metrics: Half total error rate (HTER) System HTER-time HTER-speaker Baseline 25.307% -- 31.943% -- Baseline+TNorm 25.314% -0.03% 31.953% -0.03% FSC 24.098% 4.78% 31.319% 1.95% FSC+TNorm 27.830% -9.97% 33.260% -4.12%

Thanks for your attention! Questions?