Pavel Matějka, Lukáš Burget, Petr Schwarz, Ondřej Glembek, Martin Karafiát and František Grézl

Similar documents
A study of speaker adaptation for DNN-based speech synthesis

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems

Support Vector Machines for Speaker and Language Recognition

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

Speech Recognition at ICSI: Broadcast News and beyond

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Human Emotion Recognition From Speech

Speech Emotion Recognition Using Support Vector Machine

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment

Spoofing and countermeasures for automatic speaker verification

Python Machine Learning

Calibration of Confidence Measures in Speech Recognition

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

WHEN THERE IS A mismatch between the acoustic

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Speaker recognition using universal background model on YOHO database

Learning Methods in Multilingual Speech Recognition

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

LOW-RANK AND SPARSE SOFT TARGETS TO LEARN BETTER DNN ACOUSTIC MODELS

Speaker Identification by Comparison of Smart Methods. Abstract

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Deep Neural Network Language Models

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

(Sub)Gradient Descent

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Automatic Pronunciation Checker

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Probabilistic Latent Semantic Analysis

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Speaker Recognition. Speaker Diarization and Identification

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

Australian Journal of Basic and Applied Sciences

Investigation on Mandarin Broadcast News Speech Recognition

Artificial Neural Networks written examination

Edinburgh Research Explorer

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Lecture 1: Machine Learning Basics

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING

Probability and Statistics Curriculum Pacing Guide

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

On the Formation of Phoneme Categories in DNN Acoustic Models

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

SUPRA-SEGMENTAL FEATURE BASED SPEAKER TRAIT DETECTION

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Assignment 1: Predicting Amazon Review Ratings

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Generative models and adversarial training

arxiv: v2 [cs.cv] 30 Mar 2017

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Learning Methods for Fuzzy Systems

Learning From the Past with Experiment Databases

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

The 2014 KIT IWSLT Speech-to-Text Systems for English, German and Italian

A Case Study: News Classification Based on Term Frequency

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Switchboard Language Model Improvement with Conversational Data from Gigaword

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

Comment-based Multi-View Clustering of Web 2.0 Items

Speaker Recognition For Speech Under Face Cover

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

CSL465/603 - Machine Learning

Speech Recognition by Indexing and Sequencing

Improvements to the Pruning Behavior of DNN Acoustic Models

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Letter-based speech synthesis

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

A Reinforcement Learning Variant for Control Scheduling

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Time series prediction

Measurement & Analysis in the Real World

arxiv: v1 [cs.cl] 27 Apr 2016

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Historical maintenance relevant information roadmap for a self-learning maintenance prediction procedural approach

DNN ACOUSTIC MODELING WITH MODULAR MULTI-LINGUAL FEATURE EXTRACTION NETWORKS

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Knowledge Transfer in Deep Convolutional Neural Nets

INPE São José dos Campos

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

arxiv: v1 [cs.lg] 7 Apr 2015

Transcription:

SpeakerID@Speech@FIT Pavel Matějka, Lukáš Burget, Petr Schwarz, Ondřej Glembek, Martin Karafiát and František Grézl November 13 th 2006 FIT VUT Brno

Outline The task of Speaker ID / Speaker Ver NIST 2005 data and evaluations From easy to complex GMM basic system Dealing with channel and speaker variability Support vector machines (SVM) What is SVM? GMM-Supervectors MLLR transforms Conclusions Speech@FIT BUT 13.11.2006 2

The task Speaker recognition aims at recognizing "who said it". In speaker identification, the task is to assign speech signal to one out of N speakers. In speaker verification, the claimed identity is known and the question to be answered is "was the speaker really Mr. XYZ or an impostor? Speech@FIT BUT 13.11.2006 3

Good for Search of information in private and public audio archives (lectures, presentations, meetings, TV shows, ). Increasing the quality of service in call centers Speaker ID allows to detect a known client in a few seconds. Security and defense Looking for suspect in a quantity of recordings. Waiting on-line for a suspect Verifying, if a given speech sample belongs to a suspect or not. Speech@FIT BUT 13.11.2006 4

NIST evaluations NIST -agency of US Government. Its Speech group organizes regular evaluations of speech technologies (recognition, speaker detection, language recognition). ALL participating sites receive the same data and have limited time to submit results objective comparison! Results and system details are discussed at a workshop in Washington D.C. (or another nice place) Speech@FIT BUT 13.11.2006 5

Speech@FIT track in NIST evals Language Recognition -2005 Speaker Recognition - 2006 LVCSR -Large Vocabulary Continuous Speech Recognition -2005,2006 Keyword spotting (Spoken Term Detection) -will see in November 2006 Speech@FIT BUT 13.11.2006 6

NIST 2005 data Test data NIST 2005 evaluation data 646 speakers ~3.000 target trials, ~28.000 impostor trials 5 minute conversation (2.5 for each speaker) Train data NIST 2004 evaluation data 310 speakers 10 utterances per speaker in average 5 minute conversation (2.5 for each speaker) for training background model and eigenchannel space Speech@FIT BUT 13.11.2006 7

Evaluation DET curves False alarm = it was not target speaker but system said it is Miss probability = it was target speaker but system said it is not Equal Error Rate (EER) = operation point of the system where it makes the same number of false alarms and misses. Speech@FIT BUT 13.11.2006 8

Basic structure of the system Speech@FIT BUT 13.11.2006 9

Signal preprocessing Speech@FIT BUT 13.11.2006 10

Feature extraction - MFCC Speech@FIT BUT 13.11.2006 11

GMM Speech@FIT BUT 13.11.2006 12

Why Universal Background Model (UBM)? There is not enough data to train high order GMM for each speaker Let s train one model on data from lots of speakers, which would represents all speakers and all parameters of the model will be well estimated. The speaker model is derived from the UBM by shifting parameters for which we have data Speech@FIT BUT 13.11.2006 13

MAP adaptation of UBM to target speaker I. Speech@FIT BUT 13.11.2006 14

MAP adaptation of UBM to target speaker II. Speech@FIT BUT 13.11.2006 15

MAP adaptation of UBM to target speaker III. Speech@FIT BUT 13.11.2006 16

Things to improve GMM EER[%] 20 18 16 14 12 10 8 6 4 2 0 baseline + RASTA + short time gaussianization best stand alone system best fused system + double delta + tripple delta 29.3 5.4 12.4 19.4 26.4 3.5 10.5 date Speech@FIT BUT 13.11.2006 17

RASTA RASTA = RelAtive SpecTrAl Technique RASTA filter remove slowly varying, linear channel effects from the raw feature vectors c ( t) = h( t) c ( t) Remove near-dc components along with some higher frequency components Standard RASTA IIR filter i i 1 3 2 + z z 2z H ( z) = 0.1 4 1 z (1 0.982z ) 4 Speech@FIT BUT 13.11.2006 18

Short-time gaussianization Speech@FIT BUT 13.11.2006 19

HLDA HLDA provides a linear transformation that de-correlate the features Speech@FIT BUT 13.11.2006 20

HLDA And reduces the dimensionality while preserving the discriminative power of features Speech@FIT BUT 13.11.2006 21

Things to improve GMM EER[%] 20 18 16 14 12 10 8 6 4 2 0 baseline + RASTA + short time gaussianization best stand alone system best fused system + double delta + tripple delta + HLDA 29.3 5.4 12.4 19.4 26.4 3.5 10.5 time [-] Speech@FIT BUT 13.11.2006 22

Feature mapping I. FM starts with UBM trained on everything Speech@FIT BUT 13.11.2006 23

Feature mapping II. A channel-specific UBM is MAP-adapted from the general one. Speech@FIT BUT 13.11.2006 24

Feature mapping III. Gaussian-dependent offsets are computed and removed from all features Speech@FIT BUT 13.11.2006 25

Things to improve GMM EER[%] 20 18 16 14 12 10 8 6 4 2 0 baseline + RASTA + short time gaussianization best stand alone system best fused system + double delta + feature mapping + tripple delta + HLDA 29.3 5.4 12.4 19.4 26.4 3.5 10.5 time [-] Speech@FIT BUT 13.11.2006 26

Eigen-channel adaptation I. We want to find the direction(s) of highest variability of supervectors obtained for different utterances from the same speaker eigenchannel(s). Speech@FIT BUT 13.11.2006 27

Eigen-channel adaptation II. The direction is obtained by PCA of average withinclass covariance matrix, where classes are supervectors corresponding to the same speaker. Eigen-channel = unwanted variability Speech@FIT BUT 13.11.2006 28

Eigen-channel adaptation III. During the test, we adapt speaker model and UBM by moving supervector in the direction of eigen-channel(s) to fall optimally under the data. Speech@FIT BUT 13.11.2006 29

Things to improve GMM EER[%] 20 18 16 14 12 10 8 6 4 2 0 baseline + RASTA + short time gaussianization best stand alone system best fused system + double delta + feature mapping + tripple delta + HLDA + eigen channel compensation 29.3 5.4 12.4 19.4 26.4 3.5 10.5 time [-] Speech@FIT BUT 13.11.2006 30

Things to improve GMM system Baseline GMM MFCC + c0, zero mean norm, deltas, 2048 Gaussian + RASTA channel compensation + short-time Gaussianization + acceleration coefficients + tripple deltas (bad for 2006) + HLDA 52->39 dimensions + feature mapping + eigen-channel adaptation EER 26.6 14.3 12.4 11.2 10.6 9.8 7.3 4.6 Speech@FIT BUT 13.11.2006 31

Systems based on SVM What is SVM? GMM/SVM CMLLR-MLLR/SVM Speech@FIT BUT 13.11.2006 32

SVM - What is SVM? Everybody knows neural networks NN: y=f(x) 1 2 3 x1 x2 y1 y2 xn ym SVM SVM is a neural network with just one neuron (perceptron) and without nonlinear output function Speech@FIT BUT 13.11.2006 33

Perceptron basic definition For 2D: x 2 The task should be linearly separable The training algorithm and error criteria are important! Speech@FIT BUT 13.11.2006 34 x 1

Perceptron error criteria Perceptron: E=(y-t) 2 ; t is target (wanted) value very sensitive to balancing of data sets very sensitive to distribution of data sets y-t target value for class 1 0 target value for class 2 Speech@FIT BUT 13.11.2006 35

SVM error criteria Maximizes margin between two clusters Minimizes structural error (number of miss-classified data points) x 2 Speech@FIT BUT 13.11.2006 36 x 1

Linearly inseparable tasks Can be solved by nonlinear mapping of features to space with higher dimensionality Dot products in high dimensional spaces are called kernels K(x1, x2), but we use just linear ones? 2D 3D Speech@FIT BUT 13.11.2006 37

Rank normalization Helps to unify dynamic range of features Normalize arbitrary distribution of input data to uniform distribution between 0 and 1 Values for each dimension of feature vector are replaced by indexes to an array obtained by sorting of all values over the data set and scaled to interval between 0 and 1 Speech@FIT BUT 13.11.2006 38

SVM GMM means GMM models consist of 512 Gaussian mixture components Each Gaussian is represented by mean and variance vector and it's weight The mean from all Gaussians are concatenated together to form one long "supervector" These supervectors are the features for Support vector machines Speech@FIT BUT 13.11.2006 39

SVM MLLR transforms CMLLR/MLLR speaker transforms LVCSR system speech MLLR text Speech@FIT BUT 13.11.2006 40

NAP Nuissance attribute projection Removes the unwanted variability from features by projecting them to useful space. Speech@FIT BUT 13.11.2006 41

Summary of results on NIST 2005 Speech@FIT BUT 13.11.2006 42

Summary of results on NIST 2005 system The best GMM SVM-GMM + NAP SVM-MLLR + NAP fusion EER 4.62 5.42 7.05 3.70 Speech@FIT BUT 13.11.2006 43

Comparison of results Speech@FIT BUT 13.11.2006 44

Conclusions We built three complementary systems for speaker verification We evaluated it in NIST 2006 campaign with very optimistic results Speech@FIT BUT 13.11.2006 45

References [Mason2005] M. Mason et al: Data-Driven Clustering for Blind Feature Mapping in SpkID, Eurospeech 2005. [Chang2001] C. Chang et al.: LIBSVM: a library for Support Vector Machines, http://www.csie.ntu.edu.tw/~cjlin/libsvm [Hain2005] T. Hain et al.: The 2005 AMI system for RTS, Meeting Recognition Evaluation Workshop, Edinburgh, July 2005. [Stolcke2005/6] A. Stolcke: MLLR Transforms as Features in SpkID, Eurospeech 2005, Odyssey 2006 [Brummer2004] N. Brummer: SDV NIST SRE 04 System description, 2004. [Brummer_FoCal] N. Brummer: FoCal: Toolkit for fusion and Calibration, www.dsp.sun.ac.za/~nbrummer/focal [Campbell2006] W. M. Campbell et al., SVM Based Speaker Verification Using a GMM Supervector and NAP Variability Compensation, ICASSP 2006. Speech@FIT BUT 13.11.2006 46

SVM - GMM Feature extraction and UBM adaptation is the same as for GMM system Only 512 Gaussian components Supervector 512*39=19968 NAP with 30 eigen-vectors derived on 310 speakers from NIST 2004 Impostors: 230 speakers from NIST 2002 and 2606 speakers from Fisher T-norm: 230 speakers from NIST 2002 and 800 speakers from Fisher Speech@FIT BUT 13.11.2006 47

SVM CMLLR/MLLR [Stolcke2005/6] LVCSR system is adapted to speaker (VTLN factor and (C)MLLR transformations are estimated) using ASR transcriptions provided by NIST AMI 2005(6) LVCSR system incorporates [Hain2005]: 50k word dictionary (pronunciations of OOVs were generated by grapheme to phoneme conversion based on rules trained from data) PLP, HLDA CD-HMM with 7500 tied-states each modeled by 18 Gaussians Discriminatively trained using MPE Adapted to speaker: VTLN, SAT based on CMLLR, MLLR Speech@FIT BUT 13.11.2006 48

SVM - CMLLR/MLLR Cascade of CMLLR and MLLR CMLLR: 2 classes silence and speech MLLR: 3 classes silence and 2 speech classes derived from data Silence class discarded for SRE Supervector = 1 CMLLR + 2 MLLR = = 3*3*13 2 +3*39=1638 NAP with 20 eigen-vectors derived on NIST 2004 Impostors: 310 speakers from NIST 2004 T-norm: 310 speakers from NIST 2004 Speech@FIT BUT 13.11.2006 49