U-NORM Likelihood Normalization in PIN-Based Speaker Verification Systems

Similar documents
DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Support Vector Machines for Speaker and Language Recognition

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Learning Methods in Multilingual Speech Recognition

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Speaker recognition using universal background model on YOHO database

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Speech Emotion Recognition Using Support Vector Machine

Why Did My Detector Do That?!

Modeling function word errors in DNN-HMM based LVCSR systems

A study of speaker adaptation for DNN-based speech synthesis

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Spoofing and countermeasures for automatic speaker verification

Calibration of Confidence Measures in Speech Recognition

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Modeling function word errors in DNN-HMM based LVCSR systems

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

WHEN THERE IS A mismatch between the acoustic

AUTOMATED TROUBLESHOOTING OF MOBILE NETWORKS USING BAYESIAN NETWORKS

Lecture Notes in Artificial Intelligence 4343

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Measurement & Analysis in the Real World

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

On the Combined Behavior of Autonomous Resource Management Agents

Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

Human Emotion Recognition From Speech

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

IN a biometric identification system, it is often the case that

BioSecure Signature Evaluation Campaign (ESRA 2011): Evaluating Systems on Quality-based categories of Skilled Forgeries

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Automating the E-learning Personalization

Speaker Recognition For Speech Under Face Cover

Learning From the Past with Experiment Databases

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

STABILISATION AND PROCESS IMPROVEMENT IN NAB

Proceedings of Meetings on Acoustics

Speech Recognition by Indexing and Sequencing

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Speaker Recognition. Speaker Diarization and Identification

Automatic Pronunciation Checker

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Lecture 1: Machine Learning Basics

Using Synonyms for Author Recognition

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

Integrating E-learning Environments with Computational Intelligence Assessment Agents

Using EEG to Improve Massive Open Online Courses Feedback Interaction

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

Speech Recognition at ICSI: Broadcast News and beyond

Worldwide Online Training for Coaches: the CTI Success Story

Disambiguation of Thai Personal Name from Online News Articles

Segregation of Unvoiced Speech from Nonspeech Interference

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

A Reinforcement Learning Variant for Control Scheduling

Affective Classification of Generic Audio Clips using Regression Models

Investigation on Mandarin Broadcast News Speech Recognition

Multivariate k-nearest Neighbor Regression for Time Series data -

Visit us at:

On-Line Data Analytics

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Efficient Use of Space Over Time Deployment of the MoreSpace Tool

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Data Integration through Clustering and Finding Statistical Relations - Validation of Approach

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Transfer Learning Action Models by Measuring the Similarity of Different Domains

University of Waterloo School of Accountancy. AFM 102: Introductory Management Accounting. Fall Term 2004: Section 4

THE PENNSYLVANIA STATE UNIVERSITY SCHREYER HONORS COLLEGE DEPARTMENT OF MATHEMATICS ASSESSING THE EFFECTIVENESS OF MULTIPLE CHOICE MATH TESTS

Problems of the Arabic OCR: New Attitudes

Rote rehearsal and spacing effects in the free recall of pure and mixed lists. By: Peter P.J.L. Verkoeijen and Peter F. Delaney

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

The Strong Minimalist Thesis and Bounded Optimality

preassessment was administered)

COMPUTATIONAL COMPLEXITY OF LEFT-ASSOCIATIVE GRAMMAR

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

Prototype Development of Integrated Class Assistance Application Using Smart Phone

Practical Integrated Learning for Machine Element Design

Historical maintenance relevant information roadmap for a self-learning maintenance prediction procedural approach

Transcription:

U-NORM Likelihood Normalization in PIN-Based Speaker Verification Systems D. Garcia-Romero, J. Gonzalez-Rodriguez, J. Fierrez-Aguilar, and J. Ortega-Garcia Speech and Signal Processing Group (ATVS) Universidad Politécnica de Madrid (UPM) dgromero@atvs.diac.upm.es {jfierrez,jgonzalez,jortega}@diac.upm.es http://www.atvs.diac.upm.es Abstract. This paper present a new likelihood normalization technique, entitled U-NORM, for speaker recognition systems based on short utterances. A comparison between this new approach and the widely used Z-NORM is reported and evaluated. Phonetic dependency between the speaker model and the test speech utterances is determined as the main impediment for a good performance of Z-NORM technique. A set of experiments are developed on a specifically acquired PIN-oriented real-users database showing the higher performance of the new technique for PIN based security applications. U-NORM provides a common likelihood scale for all system users allowing speaker independent thresholds that simplify the enrollment process and add robustness to PIN based security applications. 1 Introduction It is generally known that speaker recognition systems provide one of the most feasible scenarios for remote security applications due to wide deployment of access points (landline, cellular telephone and Internet). This leaves the voice signal as a desirable biometric modality for any system with remote secure authentication needs. Furthermore, this ease of access enables fully-automated remote acquisition of large databases and consequently enough data to develop common benchmark s and obtain statistically significant assessment of the speaker recognition technologies. As a good example we may consider the NIST [1] yearly text independent speaker recognition evaluation whose baseline test happens to be the most commonly used benchmark for speaker recognition systems. This baseline test provides 2 minutes of speech for speaker modeling and 30 seconds for test segments. Although NIST yearly evaluations have greatly contributed to the development of new speaker recognition algorithms, some of these technologies need to be tuned or even disregarded when applied in a different framework. A good example of this may be found in short utterances based speaker recognition systems where, generally, the available amount of data for training and testing differs significantly from the NIST case. J. Kittler and M.S. Nixon (Eds.): AVBPA 2003, LNCS 2688, pp. 208-213, 2003. Springer-Verlag Berlin Heidelberg 2003

U-NORM Likelihood Normalization in PIN-Based Speaker Verification Systems 209 Besides the amount of data, a new artifact appears in the PIN framework due to the scarcity of acoustic information and phonetic variability of the training data. Such effect results on a high dependency between the speaker model and the phonetic content of the speaker PIN even though the technology used is text independent (GMM [2]). This artifact may be considered beneficial as long as the PIN is only known by the client and not anyone else, which is the most common scenario. One of the most useful and common techniques that is greatly affected by the phonetic dependency of the speaker model is the Z-NORM likelihood normalization [2]. This normalization approach provides a common likelihood scale for all system clients by means of normalizing the speaker likelihoods with the a priori estimated mean and variance {µ IMP,σ IMP } from a generic set of impostors selected in the development phase. The a priori statistics are computed once for each client during the enrollment phase and used to normalize the likelihood of the speaker utterances. This strategy provides an approximately zero mean and a unity variance distribution for the impostor s likelihoods, if the a priori estimates are well-adjusted to the real distribution, and higher likelihoods for the client s utterances (the more the similitude between the test utterance and the training utterances the higher the likelihood). Due to the phonetic dependency in PIN based applications, two different classes of impostors may be considered: real impostors (those who know the client s PIN) and casual impostors (those who have no knowledge of the client s PIN). Since the Z- NORM attempts to normalize likelihoods based on the a priori knowledge of impostor s distribution new considerations must be taken into account for PIN based systems. The main reason for regarding likelihood normalization of relevance importance is the possibility of establishing speaker independent thresholds, which provides two major benefits, namely simplicity of the enrollment process [3] and reduction of storage space in the recognition system. Due to the above mentioned advantages and the need for new considerations in PIN based frameworks, this paper reports on a series of comparative experiments on likelihood normalization techniques and proposes a new algorithm, entitled U- NORM, that takes into account the specifics of PIN based speaker recognition systems. 2 System Description Current speaker recognition systems rely almost exclusively on short-time acoustic information. UBM-MAP-adapted Gaussian Mixtures Models [4] represent the stateof-the-art technique in text independent speaker recognition achieving a very good performance but conditioned to the acoustic environment. 2.1 Baseline System The baseline system is based on our MAP-GMM system used in the 2002 NIST evaluation [4]. A gender-independent 512 mixtures UBM is trained with approximately one hour (gender balanced) of microphone speech acquired in the same

210 D. Garcia-Romero et al. conditions of the database utterances (section 3.1). Features vectors consist of 19MFCC+19 MFCC obtained from a 20 ms Hamming window shifted 10 ms. Target speaker models are trained via MAP adaptation of the UBM with 10 iterations. Channel compensation is performed by means of Cepstral Mean Normalization (CMN) and UBM normalization is applied to the speaker likelihoods. 2.2 U-NORM As stated above, a common likelihood scale for all speakers is something desirable since speaker independent thresholds have many advantages. PIN based applications add new considerations to the likelihood normalization procedure since two different kinds of impostors are considered. For a proper use of the Z-NORM technique, the subset of impostors used to calculate the a priori statistics must know the PIN of all system clients since the essence of this algorithm lies in the estimation of real impostor likelihoods. This technique is impracticable in online systems since the a priori subset of impostors have no knowledge of the PIN number of the system clients in the development phase. Due to that, only casual-impostors may be used in the a priori estimation of impostor likelihoods yielding a mismatch between the estimated likelihoods and real impostor distributions. The cause of this mismatch lies in the implicit phonetic dependency between the client model and the real-impostor utterance which yields a higher likelihood for this situation than in the casual-impostor case. Hence, realimpostors will score higher than the estimated impostor distribution increasing the risk of obtaining likelihoods beyond the settled threshold. In consequence, impostor-based likelihood normalization techniques do not seem to fit into the PIN based applications, since the likelihoods of real-impostors remain unknown in the development phase. A new approach may be considered by substituting the impostor-based likelihood normalization by a user-based. This technique has been named U-NORM and is performed in two steps: 1. Estimation of the outcomes of the client model q with a subset of the client utterances, calculating the mean and variance of the likelihoods distribution {µ IMP,σ IMP }. 2. During the testing phase, after the baseline system outcomes the raw likelihood Λ(X q), the following normalization is performed: ( X q) Λ ( X q) µ q Λ =. (1) UNORM σ q Therefore, in the enrollement phase (either one session or multi-session) some client utterances will be used for training and other for U-NORM normalization. Table 1. Database structure # Sesion 1 2 3 4 5 Clients 10 6 12 0 19 Impostors 11 7 12 0 0

U-NORM Likelihood Normalization in PIN-Based Speaker Verification Systems 211 3 Experiments 3.1 Database A total amount of 47 speakers are involved in this database acquired specifically for PIN based applications assessment. All speaker utterances were collected within a one month period of time using a Plantronics headset USB microphone. Up to 5 different sessions were used to collect the data. In the first session all the speakers where asked to utter 5 repetitions of their PIN, an eligible number of real-impostor trials and again two repetitions of their own PIN. In subsequent sessions only two utterances of their own PIN and two real-impostor utterances were requested. The number of sessions in which each speaker was involved is not constant as was the number of real-impostor trials. It is important to remark that when the speakers performed as an impostor not only was the PIN known but also the way it was uttered by the client. The following chart shows the number of clients and impostors that attended to one, two, three, four or five sessions. 3.2 Results All the experiments were performed with the 47 system clients. The speaker models were trained with three PIN utterances in two different training conditions, namely mono-session (first session utterances) and multi-session (utterances from different sessions). To assess the system performance, false alarm probabilities were always computed with all the real-impostor utterances, whereas miss detection probabilities were computed in two different conditions, namely mono-session (first session utterances of the client) and multi-session (client utterances from different sessions). Combination of all training-testing conditions yields four different possibilities but only three of them were considered since the multi-session training and mono-session testing condition does not report any interesting information. Three normalization techniques are compared through the application to the raw likelihoods generated by the baseline system: Z-NORM with casual-impostors, also named a priori since no knowledge of the client s PIN is necessary. 51 speakers were used as impostors for all clients. Z-NORM with real-impostors, also named a posteriori since knowledge of the client s PIN is necessary. It s important to remark that this approach is not valid for online systems but the results are computed to remark the necessary distinction between real-impostors and casual-impostors. U-NORM with the client utterances not used for training. Table 2 presents a summary of all the experiments results in terms of equal error rate (EER). Baseline system performance is also showed in order to make noticeable the relative improvement of the likelihood normalization techniques.

212 D. Garcia-Romero et al. Table 2. Experiments results in terms of % of EER with different normalization techniques and training-testing conditions Train-Test Likelihood normalization condition None Z-NORM a priori Z-NORM a posteriori U-NORM Mono-Mono 4 % 18 % 4 % 1.5 % Mono-Multi 7 % 20 % 5 % 3.5 % Multi-Multi 5 % 19.5 % 5 % 2 % Figure 1 depicts DET plots for Z-NORM a posteriori and U-NORM in order to allow a more exhaustive comparison for all possible system operating points. Fig. 1. DET plot for Z-NORM a posteriori (top) and U-NORM (bottom) techniques in three different training and testing conditions As shown in table 2, U-NORM technique provides the best results for the PIN based experiments. Z-NORM a posteriori performance is good as well but it is not practicable in online systems. Z-NORM a priori performance is the worst, even worse than raw likelihood results. This is due to the fact that casual-impostors statistics are not representative of the real-impostors likelihoods.

U-NORM Likelihood Normalization in PIN-Based Speaker Verification Systems 213 4 Conclusions This paper reported on a series of comparative experiments on likelihood normalization techniques and proposed a new algorithm, named U-NORM, that takes into account new considerations regarding the PIN based security applications. Analyzing the results obtained in all the experiments we may conclude that U-NORM technique provides excellent results for PIN based applications allowing the use of a common likelihood scale for all system clients and enabling speaker independent thresholds with a considerable reduction of the enrollment process. Z-NORM technique only performs correctly when used with real-impostor statistics (a posteriori) which are not available for online applications. Casual-impostor statistics are available for online applications but perform poorly due to the phonetic dependency of the PIN based applications. Acknowledgements This work has been supported by the Spanish Ministry for Science and Technology under project TIC2000-1669-C04-01. J. F.-A. also thanks Consejeria de Educacion de la Comunidad de Madrid and Fondo Social Europeo for supporting his doctoral research. References [1] NIST 2003 Speaker Recognition Evaluation Plan, at http://www.nist.gov/speech/tests/spk/2003. [2] Douglas A. Reynolds et al., Speaker Verification using Adapted Gaussian Mixture Models, Digital Signal Processing, vol. 10, pp. 19-41 (2000). [3] J.-B. Pierrot et al., A Comparison of a Priori Threshold setting Procedures for Sperker Verification in the CAVE Project, ICASSP 98. [4] D. García-Romero et al., ATVS-UPM Results and Presentation at NIST 2002 Speaker Recognition Evaluation, Vienna, VA, 2002.