I D I A P R E S E A R C H R E P O R T. 26th April 2004

Similar documents
Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Learning Methods in Multilingual Speech Recognition

Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques

Speech Emotion Recognition Using Support Vector Machine

Speaker recognition using universal background model on YOHO database

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Support Vector Machines for Speaker and Language Recognition

A study of speaker adaptation for DNN-based speech synthesis

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Modeling function word errors in DNN-HMM based LVCSR systems

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Human Emotion Recognition From Speech

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Modeling function word errors in DNN-HMM based LVCSR systems

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Lecture 1: Machine Learning Basics

Spoofing and countermeasures for automatic speaker verification

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

LOW-RANK AND SPARSE SOFT TARGETS TO LEARN BETTER DNN ACOUSTIC MODELS

Calibration of Confidence Measures in Speech Recognition

An Online Handwriting Recognition System For Turkish

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

Speech Recognition at ICSI: Broadcast News and beyond

WHEN THERE IS A mismatch between the acoustic

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

Semi-Supervised Face Detection

Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Probabilistic Latent Semantic Analysis

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Automatic Pronunciation Checker

Segregation of Unvoiced Speech from Nonspeech Interference

DNN ACOUSTIC MODELING WITH MODULAR MULTI-LINGUAL FEATURE EXTRACTION NETWORKS

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Improvements to the Pruning Behavior of DNN Acoustic Models

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

user s utterance speech recognizer content word N-best candidates CMw (content (semantic attribute) accept confirm reject fill semantic slots

Generative models and adversarial training

Word Segmentation of Off-line Handwritten Documents

Vowel mispronunciation detection using DNN acoustic models with cross-lingual training

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

Artificial Neural Networks written examination

The Good Judgment Project: A large scale test of different methods of combining expert predictions

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING

Truth Inference in Crowdsourcing: Is the Problem Solved?

Speech Recognition by Indexing and Sequencing

Speaker Identification by Comparison of Smart Methods. Abstract

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

Think A F R I C A when assessing speaking. C.E.F.R. Oral Assessment Criteria. Think A F R I C A - 1 -

Corrective Feedback and Persistent Learning for Information Extraction

On Developing Acoustic Models Using HTK. M.A. Spaans BSc.

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Multi-Dimensional, Multi-Level, and Multi-Timepoint Item Response Modeling.

Listening and Speaking Skills of English Language of Adolescents of Government and Private Schools

BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY

Math 96: Intermediate Algebra in Context

Improving software testing course experience with pair testing pattern. Iyad Alazzam* and Mohammed Akour

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Lecture 9: Speech Recognition

Using dialogue context to improve parsing performance in dialogue systems

Comparison of EM and Two-Step Cluster Method for Mixed Data: An Application

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

Affective Classification of Generic Audio Clips using Regression Models

Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method

On the Combined Behavior of Autonomous Resource Management Agents

Edinburgh Research Explorer

PHONETIC DISTANCE BASED ACCENT CLASSIFIER TO IDENTIFY PRONUNCIATION VARIANTS AND OOV WORDS

Mandarin Lexical Tone Recognition: The Gating Paradigm

Investigation on Mandarin Broadcast News Speech Recognition

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Lecture 10: Reinforcement Learning

On the Formation of Phoneme Categories in DNN Acoustic Models

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Assignment 1: Predicting Amazon Review Ratings

Proceedings of Meetings on Acoustics

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Transcription:

R E S E A R C H R E P O R T I D I A P Posteriori Probabilities and Likelihoods Combination for Speech and Speaker Recognition Mohamed Faouzi BenZeghiba a,b Hervé Bourlard a,b IDIAP RR 04-23 26th April 2004 D a l l e M o l l e I n s t i t u t e for Perceptual Artificial Intelligence P.O.Box 592 Martigny Valais Switzerland phone +41 27 721 77 11 fax +41 27 721 77 12 e-mail secretariat@idiap.ch internet http://www.idiap.ch a b Institut Dalle Molle d Intelligence Artificielle Perceptive (IDIAP), Martigny Swiss Federal Institute of Technology at Lausanne (EPFL), Switzerland

IDIAP Research Report 04-23 Posteriori Probabilities and Likelihoods Combination for Speech and Speaker Recognition Mohamed Faouzi BenZeghiba Hervé Bourlard 26th April 2004 Abstract. This paper investigates a new approach to perform simultaneous speech and speaker recognition. The likelihood estimated by a speaker identification system is combined with the posterior probability estimated by the speech recognizer. So, the joint posterior probability of the pronounced word and the speaker identity is maximized. A comparison study with other standard techniques is carried out in three different applications, (1) closed set speech and speaker identification, (2) open set speech and speaker identification and (3) speaker quantization in speaker-independent speech recognition.

2 IDIAP RR 04-23 1 Introduction Speech signal conveys (among other things) two major types of information, the speech content (text) and the speaker characteristics. Speech recognition systems aim to extract the lexical information from the speech signal. Speaker recognition systems aim to recognize (identify/verify) the speaker. Joint speech and speaker recognition systems aim to recognize (simultaneously) who is speaking and what was said. Such systems have several applications, such as: 1. Speaker identification can be used as a front-end processor to a speech system [1] and viceversa [2]. 2. Performing continuous speaker recognition and knowledge/content recognition [3] [4] [5]. 3. Automatic recognition of co-channel speech[2], where more than one speaker is speaking at the same time. In this paper, a probabilistic approach for the joint speech and speaker recognition is proposed. This approach is based on the combination of a likelihood based speaker-identification with a posteriori probability based speech recognizer. The evaluation of this approach is examined in three applications. Closed set speech and speaker identification (i.e., the access is restricted to speakers enrolled in the system), open set speaker identification (i.e., any speaker can access the system, those that are not enrolled should be rejected) and speaker-quantization for speaker-independent speech recognition[1] (i.e., the speech recognizer associated with the most likely speaker determined by the speaker identification system is used to recognize the utterance pronounced by the speaker). In closed set and open set experiments, our goal is to recognize correctly, both the speaker identity and the command associated with a specific service. A typical application could be the voice dialing system. The combination will be done for every enrolled speaker, making the computational requirements very costly. We will show how to reduce this cost without affecting the recognition performance. Also, we compare the proposed approach in all those situations, with two other standard approaches. 2 Formulation Our goal is to find the word (command) Ŵ from a finite set of possible words {W } and the speaker Ŝ from a finite set of registered speakers {S} that maximize the joint posterior probability P (Ŵ, Ŝ X). Formally, this is expressed as follows: (Ŵ, Ŝ) = arg max P (W, S X) {W,S} = arg max [P (W S, X) P (S X)] (1) {W,S} Taking the logarithm, and using Bayes rule with the assumption that the prior probability of the speaker P (S) is uniform over all speakers, equation (1) can be rewritten as: (Ŵ, Ŝ) = arg max [log P (W S, X) + log P (X S)] (2) {W,S} The first term, log P (W S, X), corresponds to the posterior probability of the word W estimated in our case through a speaker-dependent hybrid HMM/ANN with parameters θ s as follows: log P (W S, X) = 1 T T log p(qk t x t, θ s ) (3) t=1 where q t k represents the optimal state q k decoded at time t along the Viterbi path, and T the length of X after removing the decoded silence frames.

IDIAP RR 04-23 3 The second term, log P (X S), corresponds to the likelihood of the observed data estimated by a text-independent GMM model with parameters λ s : log P (X S) = 1 T T log p(x t λ s ) (4) t=1 The log P (W S, X) and log P (X S) represent, respectively, the contribution of the speech and speaker recognition systems in the combined score. Using (2) for all registered speakers is time consuming. Therefore, we generate a list of N-best speakers according to the likelihood criterion (4) and then re-score this list using (2). 3 Database and Experiment setup The experiments were done using the PolyVar database [6]. For the closed set experiments, a set of 19 speakers (12 males and 7 females) who were in more than 26 sessions are selected. Each session consists of one repetition of the same set of 17 words common for all speakers. For each speaker, the first 5 sessions are used as training (adaptation) data and an average of 19 sessions as test data, resulting in a total of 6430 test utterances. For the open set experiments, another set of 19 speakers with the same set of words are used as impostors. There are a total of 6452 impostor test utterances. For acoustic features, 12 MFCC coefficients with energy and their first derivatives were calculated every 10 ms over a 30 ms window. We have also, used the PolyPhone database [6] to train a speaker-independent speech recognizer and a Gaussian Mixture model (GMM). They will be used only as an initial distribution for speaker adaptation. The SI speech recognizer is a hybrid HMM/MLP system[9], with a set of parameters Θ. This SI-MLP has 234 input units with 9 consecutive 26 dimensional acoustic vectors, 600 hidden units and 36 outputs. The GMM model with a set of parameters Λ is modeled by 240 (diagonal covariance) Gaussians and trained with an EM algorithm. 4 Speech and Speaker Recognition Approaches In this work, we are interested in correctly recognizing both the pronounced word and the speaker identity for each test utterance. We have examined and compared three techniques. They used the same text-independent GMM based speaker identification subsystem. But, they employ different speech recognizers and different ways to integrate the speech and speaker recognizers. 4.1 Speaker identification subsystem The speaker identification subsystem is a text-independent GMM based [7]. The parameters λ s of the speaker-dependent GMM are derived (using the speaker s training data) by adapting mean parameters of mixtures components of the GMM model Λ. The adaptation is performed using MAP adaptation technique[8]. The correct speaker identification rate for the closed set is equal to 95.9%. 4.2 Baseline approach Here the speech recognizer is a speaker-independent hybrid HMM/MLP system. The parameters θ of the MLP are derived by re-training all the parameters Θ of the SI-MLP trained on PolyPhone. The re-training is done using data (referred to as world data) from PolyVar provided by 56 speakers with the same set of 17 words. A cross-validation is used to avoid overtraining. The word recognition rate is equal to 97.2% and 96.8% for the closed set and open set applications, respectively. The recognition

4 IDIAP RR 04-23 of the pronounced word and the identification of the speaker are done independently. In the closed set application, this is done as follows: In open set application, the speaker is accepted if: Ŵ = arg max[log P (W θ, X)] (5) {W } Ŝ = arg max {S} [log P (X λ s)] (6) LLR(X) = log P (X λŝ) log P (X λ) δ (7) where LLR(X) is the likelihood ratio, δ is a speaker and word independent threshold, λŝ is the GMM model of the most likely speaker Ŝ according to (4), and λ is the background model where its parameters are derived from Λ using MAP adaptation and the world data set. 4.3 Sequential approach Here, the speech recognition is performed using a speaker-dependent HMM/MLP. The parameters θ s of the MLP are derived by re-training the parameters Θ of the SI-MLP, using speaker s training data. The most likely speaker determined by the speaker identification subsystem using (6) is used to select the SD-MLP for speech recognition. The speaker in both closed set and open set applications is identified as in the baseline approach using (6) and (7), respectively. The recognition of the pronounced word is performed as follows: Ŵ = arg max {W } [log P (W θ ŝ, X)] (8) where θŝ is the set of parameters of the MLP associated with the most likely speaker Ŝ. With perfect recognition of the speaker identity, the word recognition rate is equal to 98.9%. The main advantage of this approach compared to the baseline is the gain we got in speech recognition performance. This gain should improve the performance of the simultaneous speech and speaker recognition. A system using this approach can be viewed as performing speaker quantization before speech recognition[1]. The speaker identification subsystem associates a new speaker to the most likely similar speaker in the enrolled speaker set. 4.4 Combined approach The MLP adaptation for a specific speaker consists of shifting the boundaries between the phone classes without strongly affecting the posterior probabilities of the speech sounds of other speakers. This makes the estimated posterior probabilities more effective for speech recognition but less effective for speaker recognition [10][11]. Nevertheless, these posterior probabilities can be used to improve the speaker recognition performance if they are combined with more speaker specific information. In the closed set application, the recognition of both the pronounced word and the speaker identity is performed as follows: (Ŵ, Ŝ) = arg max [log P (W θ s, X) + log P (X λ s )] (9) {W,S} Given both speaker identification and speaker-dependent speech recognition subsystems are trained with two different criteria, the posterior probability and the likelihood scores estimated by each subsystem might have some complementary (new) information that can be useful to improve the performance of each individual subsystem or the joint speech and speaker recognition. The criterion (9) should be done for every speaker and every word, making the computation requirement very costly (compared to the two previous approaches, we need an additional cost of (N 1) times the cost of a speech recognition task, where N is the number of the enrolled speakers). To reduce this cost, we first generate a list

IDIAP RR 04-23 5 of the N-best candidates using text-independent speaker identification (6). Then, for each speaker S in the list, we use the SD-MLP θ s for speech recognition. Finally, we re-score the N-best list according to the combined likelihood and posterior probability scores using (9). This procedure generates a new N-best list, where the most likely speaker is selected according to the following combined criterion: Ŝ = arg max {S} [log P (W θ s, X) + log P (X λ s )] (10) For the open set application, the goal is to detect an impostor and reject him/her independently of what he/she pronounces. The criterion to accept a speaker is defined as follows: which is equivalent to: [log P (W θŝ, X) + log P (X λŝ)] log P (X λ) δ (11) log P (W θŝ, X) + LLR(X) δ (12) In this work, we have tried a linear combination technique to combine posterior probabilities and likelihoods. The combined score in (9) and (12) are estimated, respectvelly, as follows: (Ŵ, Ŝ) = arg max [α 1 log P (W θ s, X) + log P (X λ s )] (13) {W,S} α 2 log P (W θŝ, X) + LLR(X) δ (14) where α 1 and α 2 are determined a posteriori on the test set. We can also use (10) as a speaker selection criterion to improve the performance of the speaker quantizer. 5 Experiments and Results The aim of these experiments is to evaluate, analyze and compare the effectiveness of the three different approaches described above in three different tasks, closed set speaker identification, open set speaker identification and speaker quantization. In the first two tasks, our interest is to improve the simultaneous speaker and speech recognition performance. In the results, this will be referred to as overall recognition rate. While in the third task, our aim is to improve the speaker-independent speech recognition performance in an open set application. 5.1 Closed set results The results of the closed set experiments are shown in Table (1). It gives the performance of each approach in terms of speech, speaker and overall recognition. It is worth mentioning here, that the best overall recognition rate we can achieve will be equal to the lowest recognition rate given by speech and speaker subsystems. From these results, we can see that: Approaches Baseline Sequential Combined Speech Reco. 97.2% 98.7% 98.7% Speaker Reco. 95.9% 95.9% 96.8% Overall Reco. 93.4% 95.1% 95.9% Table 1: Speech, speaker and overall recognition rates for different approaches

6 IDIAP RR 04-23 1. The sequential approach gave better performance in terms of overall recognition rate than the baseline approach. This is due mainly to the improvement in the speech recognition rate. From the computational cost point of view, both approaches have the same cost. It is interesting to note here that the speech recognition rate in the sequential approach (98.7%) is almost equal to that obtained with perfect speaker identification (98.9%). This means, that the hybrid HMM/MLP model θ s of the speaker S still recognizes correctly the pronounced word even if the speech segment comes from another mis-identified speaker. 2. Compared to the other approaches, the use of the combined posteriori probability and likelihood criterion (9) gave the best overall recognition (95.9%) rate. As a consequence, the speaker identification performance was also improved. This is because two speakers which are acoustically close in the speaker space are not necessary close in the speech space. So, selecting speakers based on one of these two components is not optimal. In Figure (1) we have plotted the variations of speech, speaker and overall recognition rates as a function of the size of the N-best candidates list. It shows that the most significant improvement is obtained by keeping the first two best likely speakers according to (6), and then use (9) for re-scoring. From the computational cost point of view, the combined approach needs only one more speech recognition step which depends on the size of the MLP and the length of the pronounced word. 99 98.5 98 Recognition rate(%) 97.5 97 96.5 96 95.5 95 Speaker Speech Joint 2 4 6 8 10 12 14 16 18 N best list Figure 1: Speaker, speech and overall recognition rates as a function of the size of the N-best candidates list 5.2 Open set results To evaluate our approach in a more practical application, open set experiments are conducted. The goal is to detect an unknown speaker (impostor) and reject him. Three types of errors are considered here [5], False acceptance (FA), false rejection (FR) and confusion acceptance (CA), that is, when an authorized speaker is accepted but confused with another speaker. We have plotted the variations of these errors as a function of a threshold for the sequential 1 (Figure (2)) and combined (Figure (3)) approaches. The EERs (FA = FR) obtained by the sequential and combined approaches were equal to 14.5% and 13.1%, respectivelly. Moreover, the combined approach reduced the confusion acceptance errors. If we take into account only the true speakers that have been accepted, the overall recognition rates with the baseline, sequential and combined approaches were equal to 82%, 83.8% and 85.7%, respectivelly, confirming the tendency we have seen in closed set application. 1 Both baseline and sequential approaches use the same speaker identification criterion (7) in an open set test

IDIAP RR 04-23 7 50 45 FA FR CA 40 35 Percentage of error 30 25 20 15 10 5 0 1.5 1 0.5 0 0.5 1 1.5 2 Threshold Figure 2: False acceptance, false rejection and confusion acceptance variations as a function of the threshold for baseline and sequential approaches 50 45 40 Percentage of errors 35 30 25 20 15 10 5 FA FR CA 0 2.5 2 1.5 1 0.5 0 0.5 1 1.5 2 Threshold Figure 3: False acceptance, false rejection and confusion acceptance variations as a function of the threshold for the combined approache

8 IDIAP RR 04-23 5.3 Speaker quantizer results In this experiment, we evaluate the use of the sequential and combined approaches to perform a speaker-independent speech recognition in open set application. This is done by selecting the enrolled speaker that is acoustically close to the test speaker and using the speech recognizer associated with the selected speaker to recognize the pronounced word. The main issue here is what will be the criterion to select the closet enrolled speaker? We have tested two criteria described in (6) and (10). Results of the speech recognition performance are shown in Table (2). For comparison purposes, the average performance of using each enrolled speaker is also reported (second column). We have used only impostor utterances (6452 utterances). As we can see, the use of the sequential criterion gave 8.4% Approaches Single speaker Sequential Combined Speech Reco. 85.17% 92.3% 93.5% Table 2: Speaker quantizer performance for speech recognition relative improvement compared to the single speaker results. But the best improvement is achieved by the combined criterion (11% relative improvement). This is because, using (10) the selected reference speaker is acoustically close to the test speaker in the joint speech and speaker space. 6 Conclusion In this paper, a probabilistic approach that maximizes simultaneous speech and speaker recognition performance is presented. It is based on the combination of posteriori probability estimated by a hybrid HMM/MLP system for isolated word recognition and likelihood estimated by a textindependent GMM model for speaker identification. We have evaluated and compared three approaches for closed set speaker identification, open set speaker identification and speaker quantization for speaker-independent speech recognition. In the three applications, results showed the effectiveness of the proposed approach. 7 Acknowledgment The authors gratefully acknowledge the support of the Swiss National Science Foundation through the project MULTI :2000-068231.02/1. This work was also carried on in the framework of the SNSF National Center of Competence in Research (NCCR) on Interactive Multimodal Information Management (IM2). The authors would like to thank Hynek Hermansky for helpful comments on the paper and Joanne Moore for proofreading this paper. References [1] D. A. Reynolds and L. P. Heck, Integration of Speaker and Speech recognition systems proceedings of ICASSP 91 pp. 869-872, 1991. [2] L. P. Heck, A Bayesian Framework for Optimizing the joint Probability of Speaker and Speech Recognition Hypotheses, The Advent of Biometrics on the Internet, COST 275 Workshop, Rome, Italy, 2002. [3] Q. Li, B.-H. Juang, Q. Zhou, C.-H. Lee, Automatic Verbal Information Verification for User Authentication, IEEE Trans. Speech and Audio Processing, Vol. 8, No. 5,2000. [4] S. H. Maes, Conversational Biometrics Proceedings of EUROSPEECH 99, vol. 3, pp. 1219-1222, 1999.

IDIAP RR 04-23 9 [5] T. J. Hazen, D. A. Jones, A. Park, L. C. Kukolich and D. A. Reynolds, Integration of speaker Recognition into Conversational Spoken Dialog systems Proceedings of EUROSPEECH 03 pp. 1961-1964. [6] G. Chollet, J.-L. Cochard, A. Constantinescu, C. Jaboulet, and P. Langlais, Swiss French Poly- Phone and PolyVar: telephone speech databases to model inter- and intra-speaker variability, IDIAP Research Report, IDIAP-RR-96-01, 1996. [7] D. A. Reynolds, T. F. Quatieri and R.B. Dunn, Speaker Verification using Adapted Gaussian Mixture Models, Digital Signal Processing, vol.10, N 1-3, 2000, pp 19-41. [8] J. L. Gauvain and C.-H. Lee, Maximum a posteriori estimation for multivariate Gaussian mixture observation of Markov chains, in IEEE Transaction on Speech Audio Processing, April 1994, Vol 2,pp. 291-298. [9] S. Renals, N. Morgan, H. Bourlard, M. Cohen, H. Franco, Connectionist probability estimators in HMM speech recognition, IEEE Transactions on Speech and Audio Processing, Vol. 2, No. 1, Part II, 1994. [10] D. Genoud, D. Ellis and N. Morgan, Combined speech and speaker recognition with speakeradapted connectionist models, Proc. Auto. Speech recog. and Understanding Workshop, keystone [11] M. F. BenZeghiba and H. Bourlard, User-Customized Password Speaker Verification based on HMM/ANN and GMM models, Proceedings of ICSLP 2002, pp 1325-1328, 2002.