Automatic Speaker Recognition

Size: px

Start display at page:

Download "Automatic Speaker Recognition"

Daisy Stewart
6 years ago
Views:

1 Automatic Speaker Recognition Qian Yang 04. June, 2013

2 Outline Overview Traditional Approaches Speaker Diarization

4 State-of-the-art speaker recognition systems use: GMM-based framework SVM-based framework

9 Outline Overview Traditional Approaches Features and Models Evaluation and Performance Speaker Diarization

14 Traditional Approaches Features for Speaker Recognition Primary features used in speaker recognition systems are cepstral features (eg. MFCC, PLP) Voice activity detection (VAD) to remove non-speech frames Some form of blind decovolution is used to remove stationary channel effects (CMS) Feature warping is to compensate channel variability Time differential cepstral (delta cepstra, or delta detla features) are usually appended to cepstral features Typically dimensional features are used

16 Support Vector Machines Latest discriminative approach for speaker verification

22 Traditonal Approaches Support Vector Machines A SVM is a supervised method A SVM is a binary discriminative classifier ( target speaker vs. impostors) transform data to a higher dimensional space using kernel functions Construct a hyperplane in higher dimensional space to maximize the margin between support vectors The data points lying on the boundaries are support vectors Design different kernels for speaker recognition task kernel Ideal output (0/1) Support vectors When classification, a class decision is based upon whether the value, f(x), is above or below a threshold

23 Traditonal Approaches Support Vector Machines How to represent speaker utterances in higher dimensional space?? Example: A linear kernel for speaker verification task Train GMMs on two utterances using MAP Derive a distance metric between two utterances using a modified KL divergence Define a linear kernel based on the distance above

30 Outline Overview Traditional Approaches Speaker Diarization Overview Cross-show speaker diarization Speaker Tracking

31 Unlike speaker verification/identifcation, no enrollment data for training and no prior knowledge about number of speakers!!!

35 Speaker Diarization Generic Architecture 3-step Diarization process 1. Voice Activity Detection (VAD) Detects different acoustic events: music, speech, noise.. 2. Speaker Change Detection Detects speaker turn changes missed by VAD 3. Speaker Clustering Group speaker segments from same speaker together

39 Cross-show Speaker Diarization Giving a set of shows from same source, provides global IDs for speakers who appear across shows. Why cross-show? In digital libraries or multimedia archives, some speakers may appear multiple times, such as journalists, politicians.. Linkage to conventional speaker diarization Global IDs vs. Local IDs

40 Cross-show Speaker Diarization Generic Architectures Scheme 1 Scheme 2 Scheme 3

41 Cross-show Speaker Diarization Some results on English Podcast News data Cross-show DER as metrics

Speaker Tracking Aims to determine if and when target speaker speaks in a multispeaker record. The target speaker has been enrolled into the system in the previous phase.

42 Speaker Tracking Aims to determine if and when target speaker speaks in a multispeaker record. The target speaker has been enrolled into the system in the previous phase. Why speaker tracking? Tracking anchor speakers or politicians on broadcast news Linkage to speaker diarization Supervised approach (speaker models are required) Diarization results are used as inputs

43 KIT s speaker tracking system Consists of two components: speaker segementation + Open-set speaker identification

44 System Architecture Speaker segmentation Split the speech into segments. Each segment has only one speaker speaking Open-set speaker identification(sid) - Training Phase: train 4 UBMs for telephone/studio male/female speech (gender and channel dependent) on ESTER2 data Speaker models obtained by MAP adapted on corresponding UBMs - Detection Phase: System 1(baseline) - gender/bandwidth classification (GMM-based classifier), test segment scored against speaker and UBM models which matches gender/bandwidth conditions System 2 (FSC) - Apply frame-base score competition (FSC) [JSW07] impostors(50f+50m) for T-Norm from ESTER1 data

45 Some results on French broadcast news ESTER2 data, French BN 114 target speakers, 105h training data, 6 hours dev data Metrics: Half total error rate (HTER) System HTER-time HTER-speaker Baseline % % -- Baseline+TNorm % -0.03% % -0.03% FSC % 4.78% % 1.95% FSC+TNorm % -9.97% % -4.12%

46 Thanks for your attention! Questions?

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Phonetic- and Speaker-Discriminant Features for Speaker Recognition by Lara Stoll Research Project Submitted to the Department of Electrical Engineering and Computer Sciences, University of California