in 82 Dutch speakers. All of them were prompted to pronounce 10 sentences in four dierent languages : Dutch, English, French, and German. All the sent

Similar documents
International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Learning Methods in Multilingual Speech Recognition

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Modeling function word errors in DNN-HMM based LVCSR systems

Speech Emotion Recognition Using Support Vector Machine

Modeling function word errors in DNN-HMM based LVCSR systems

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

Voice conversion through vector quantization

A study of speaker adaptation for DNN-based speech synthesis

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

The Effects of Ability Tracking of Future Primary School Teachers on Student Performance

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Speaker recognition using universal background model on YOHO database

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

The distribution of school funding and inputs in England:

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Clouds = Heavy Sidewalk = Wet. davinci V2.1 alpha3

Pp. 176{182 in Proceedings of The Second International Conference on Knowledge Discovery and Data Mining. Predictive Data Mining with Finite Mixtures

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Human Emotion Recognition From Speech

Speaker Recognition. Speaker Diarization and Identification

Vowel mispronunciation detection using DNN acoustic models with cross-lingual training

2 Mitsuru Ishizuka x1 Keywords Automatic Indexing, PAI, Asserted Keyword, Spreading Activation, Priming Eect Introduction With the increasing number o

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

WHEN THERE IS A mismatch between the acoustic

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Speech Recognition at ICSI: Broadcast News and beyond

Listening and Speaking Skills of English Language of Adolescents of Government and Private Schools

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Speaker Recognition For Speech Under Face Cover

A Cross-language Corpus for Studying the Phonetics and Phonology of Prominence

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

I-COMPETERE: Using Applied Intelligence in search of competency gaps in software project managers.

Accuracy (%) # features

Teaching and Learning as Multimedia Authoring: The Classroom 2000 Project

Body-Conducted Speech Recognition and its Application to Speech Support System

Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

Automatic Pronunciation Checker

Edinburgh Research Explorer

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques

Infrastructure Issues Related to Theory of Computing Research. Faith Fich, University of Toronto

Segregation of Unvoiced Speech from Nonspeech Interference

Revisiting the role of prosody in early language acquisition. Megha Sundara UCLA Phonetics Lab

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Mandarin Lexical Tone Recognition: The Gating Paradigm

Dynamic Pictures and Interactive. Björn Wittenmark, Helena Haglund, and Mikael Johansson. Department of Automatic Control

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

To appear in The TESOL encyclopedia of ELT (Wiley-Blackwell) 1 RECASTING. Kazuya Saito. Birkbeck, University of London

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

Support Vector Machines for Speaker and Language Recognition

Probabilistic Latent Semantic Analysis

DNN ACOUSTIC MODELING WITH MODULAR MULTI-LINGUAL FEATURE EXTRACTION NETWORKS

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

REVIEW OF CONNECTED SPEECH

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

Audible and visible speech

The Linguistic Territoriality Principle: Heterogeneity and Freedom Problems

Speech Recognition by Indexing and Sequencing

TEACHING AND EXAMINATION REGULATIONS (TER) (see Article 7.13 of the Higher Education and Research Act) MASTER S PROGRAMME EMBEDDED SYSTEMS

Letter-based speech synthesis

Phonological and Phonetic Representations: The Case of Neutralization

An Online Handwriting Recognition System For Turkish

Automatic intonation assessment for computer aided language learning

Author's personal copy

Universal contrastive analysis as a learning principle in CAPT

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

Eyebrows in French talk-in-interaction

The Perception of Nasalized Vowels in American English: An Investigation of On-line Use of Vowel Nasalization in Lexical Access

Automatic Phonetic Transcription of Words. Based On Sparse Data. Maria Wolters (i) and Antal van den Bosch (ii)

The influence of metrical constraints on direct imitation across French varieties

The Acquisition of English Intonation by Native Greek Speakers

Fix Your Vowels: Computer-assisted training by Dutch learners of Spanish

1972 M.I.T. Linguistics M.S. 1972{1975 M.I.T. Linguistics Ph.D.

AP Statistics Summer Assignment 17-18

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

MKT ADVERTISING. Fall 2016

Rhythm-typology revisited.

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

Review in ICAME Journal, Volume 38, 2014, DOI: /icame

Measures of the Location of the Data

Proceedings of the 19th COLING, , 2002.

Math Placement at Paci c Lutheran University

SEGMENTAL FEATURES IN SPONTANEOUS AND READ-ALOUD FINNISH

Transcription:

MULTILINGUAL TEXT-INDEPENDENT SPEAKER IDENTIFICATION Georey Durou Faculte Polytechnique de Mons TCTS 31, Bld. Dolez B-7000 Mons, Belgium Email: durou@tcts.fpms.ac.be ABSTRACT In this paper, we investigate two facets of speaker recognition : cross-language speaker identication and same-language non-native text-independent speaker identication. In this context, experiments have been conducted, using standard multi-gaussian modeling, on the brand new multi-language TNO corpus. Our results indicate how speaker identication performance might be aected when speakers do not use the same language during the training and testing, or when the population is composed of non-native speakers. 1. INTRODUCTION AND MOTIVATION Speaker recognition systems working in text independent (TI) mode have been characterized by their exibility but also by their insecure aspect. Indeed, the non-imposing of words or sentences can lead to the breaking of the system if the voice of an authorized person is pre-recorded. However, text-independent speaker identication systems are involved in many applications. That is the reason why many eorts have been developed in order to improve text-independent speaker recognition methods. For the last decade, the technology in this eld has achieved signicant progress. Now, these techniques can be used in real conditions, for that the application eld be well dened. Nowadays, more and more users of such systems are polyglot. So, if we do not have a priori knowledge of the mother tongue of the talker - or at least the tongue he used during the training - and if we can not apply any language identication system, then it is possible to perform speaker identication in a language dierent from the one used during training. Let us note that no restriction about the tongue would still increase the exibility of the system. However, the system may still impose one specic tongue. Since, it should be open to all users, we can easily imagine that any given language might dier from the native language of some of the users. In order to start a descriptive study on (a) the crosslanguage and (b) the same non-native language effects on speaker recognition performance, we carried out some text-independent speaker identication experiments on a subset of 57 speakers extracted from the TNO multi-language database. Our system is based on the standard GMM technique, which has already been successfully used by the past for TI speaker recognition [3] [2] [4]. In section 2 we present in detail the TNO corpus and our identication system. The speaker identication experiments are described in section 3, which is subdivided into three items : (a) native speaker identication, acting as reference experiment (b) cross-language speaker identication (c) non-native same-language speaker identication. Results are then discussed and, in particular, cross-language speaker identication results are compared to performance recently obtained on the POLYCOST telephone speech corpus [5] [1]. 2.1. Database 2. EXPERIMENTAL SETUP Speech material for our experiments was taken from the new Dutch TNO corpus. This database consists

in 82 Dutch speakers. All of them were prompted to pronounce 10 sentences in four dierent languages : Dutch, English, French, and German. All the sentences were read from a computer screen in a anechoic silent recording room. Given one language, the rst ve sentences are common for all speakers, while the others dier from one speaker to another. We decided to accomplish the identication tests over all the speakers for whom speech data in the four tongues are available. So we conducted our experiments on a subset of 57 speakers (68 % males and 32 % females). the mother tongue of the speakers. This might be seen, in the context of this paper, as the reference experiment. Let us remind once again that for these experiments and all the experiments that will follow, we shall systematically choose the ve sentences per language identical for the training, and the other ve per language and per speaker unique for the identication tests. The identication error rates for various training and testing durations are given hereafter in Figure 1. The rst 5 utterances (per language identical for all speakers) were used for the training, while the other 5 sentences (per language and per speaker unique) were reserved to the identication tests. In our experiments, we have systematically considered four dierent training durations (10 s, 15 s, 20 s, and 25 s) and ve dierent testing durations (5 s, 10 s, 15 s, 20 s, and 25 s). 2.2. Feature Extraction Speech recordings were sampled at 16 khz. Analysis windows consisted of 512 samples taken every 16 ms. After pre-emphasis (factor 0.95) and application of a Hamming window, 10 autocorrelation LPC coef- cient were computed and transformed into 12 cepstral coecients. Finally, training and testing features consist only of 12 cepstral coecients : neither the energy, nor dynamic information (delta coecients), nor the pitch were used. No cepstral mean subtraction was applied. 2.3. Speaker Model Our speaker identication system is based on the statistical modeling by Gaussian mixtures [3] [2][4]. Each mixture is composed of 12 Gaussian distributions, with diagonal covariances matrices. 3. EXPERIMENTS 3.1. Native speaker Identication First of all, let us carry out a preliminary experiment, considering both training and test phases in Figure 1: Identication error rates over 57 native speakers of Dutch as a function of test trial length for various training conditions We can notice at this point that the closed set speaker identication rate reaches 100 % for a 20 second testing duration and more, whatever the training duration considered. 3.2. Cross-language speaker identication It would now be interesting to measure the impact of language on our speaker recognition system. For that purpose, we conduct an experiment characterized by the use of dierent languages during the training and the test : models are trained on native speech (i.e. Dutch), while identication tests are made successsively on non-native speech (successively English, French, and German).

Results for dierent training and testing durations are reported in Figure 2, Figure 3 and Figure 4 below. Figure 4: Cross-language speaker identication error rates (Dutch / German) over 57 Dutch speakers Figure 2: Cross-language speaker identication error rates (Dutch / English) over 57 Dutch speakers Figure 3: Cross-language speaker identication error rates (Dutch / French) over 57 Dutch speakers For values of training and testing durations large enough, we are still able, in the case Dutch/English, to reach the maximal performance. On the contrary, we are unable to reach a 100 % identication rate in the case Dutch/French, given our proposed training and testing When German is used for the test, error rates seem to converge to about 2 %. Similar experiments have been recently conducted on a telephone speech database [1]. In this context, cross-language speaker identication tests on a set of 111 speakers showed that the performance degradation induced by the use of a non-native tongue for the test did not exceed 1 % (relatively to the use of the native tongue for the test) in the case of a speaker identication system based on a vector quantization technique. We justied this very restricted dierence by the fact that spectral characteristics of the speaker speech is not importantly modied as he speaks a second language. This corroborated another study which has shown that people who learn a second language at an advanced age (> 10 years old), instead of learning new phonemes, substitute phonemes from their native language and impose the rythm of this native language as they speak a non-native language [8]. Let us also mention that this conclusion was consolidated by an experiment described in [6] and which showed that the spectrum dierence, measured by Kullback's divergence, on English and Japanese words pronounced by bilingual speakers was very small. Here, in the case of maximal training and testing durations, we observe that the degradation easily exceeds 1 % in the cases Dutch-French (4.8%)and Dutch-German (2.3 %) even though the population

size is more restricted. However, we must be aware that, rst, the maximal training duration is here of 25 seconds, whereas each training session lasted about 90 seconds in the previous work. Secondly, our identication system is now based on statistical modeling by Gaussian mixtures. These two points make it dicult to compare in the absolute results from these experiments. 3.3. Non-native speaker identication Let us nally consider a last set of experiments conducted on non-native talkers. We conducted three sets of experiments characterized by the use of same non-native language during the training and the test : models were trained and identication tests were made on non-native speech (successively English, French, and German). Figure 6: speakers of French as a function of test trial length for various training Once again, we report separately results on English, French, and German speech in Figure 5, Figure 6, and Figure 7, for dierent training and testing durations. Figure 7: speakers of German as a function of test trial length for various training Figure 5: speakers of English as a function of test trial length for various training When English is chosen as non-native language, we see that there is no big dierence between these plots and the reference plots. Surprisingly enough, the system performs sometimes better when this nonnative language is employed. We may reiterate the same observation if German is used. However, our system performs slightly worse if French is employed. Globally, as expected, we observe through these experiments that even if non-native speakers use the phonetic and prosodic patterns of their rst language, the identication scores are not really affected. Major aspects that can make non-native speech deviate from native speech are notably uency, word stress, and intonation [7]. Although these factors might be responsible of a score degradation in the cross-language case, we can easily understand that they haveamuch more restricted eect on these last

experiments. In particular, if a non-native talker tends to speak more slowly during the training, he will also tend to speak roughly the same way for the tests, because the language is the same. This point should explain partly why the identication scores are not so aected. 4. CONCLUSION The purpose of this paper was to describe and carry out multi-lingual speaker identication experiments on the TNO database made of native speakers of Dutch, and to comment on the results. Various training and testing durations were considered. We rst carried out a preliminary set of experiments (what we considered as being the baseline experiments) where both training of the speakers models and the identication tests were made on their mother tongue (i.e. Dutch). Then, regarding to our baseline results, we have measured the evolution of our speaker identication system performance when (a) dierent languages are used during the training and the tests (b) a same non-native language is used both for the speakers models training and the identication tests. Three non-native languages were tested : English, French, and German. [4] D. Titterington, A. Smith, and U. Markov, \Statistical Analysis of Finite Mixture Distributions", John Wiley and sons, 1985. [5] The European COST 250 action entitled "Speaker Recognition in Telephony", Information can be found on the web page : http://circhp.ep.ch/polycost/ [6] M. Abe and K. Shikano, \Statistical analysis of bilingual speakers's speech for cross-language voice conversation", J. Acoust. Soc. Amer., Vol 90, pp 76-82, July 1991. [7] C. Cucchiarini, H. Strik, and L. Boves, \Automatic evaluation of Dutch pronunciation by using speech recognition technology", Proc IEEE ASRU, Santa Barbara, Dec 1997. [8] L. Neumeyer, H. Franco, M. Weintraub, and P. Price, \Automatic text-independent pronunciation scoring of foreign language student speech", Proc ICSLP'96, Philadelphia, pp 1457-1460, 1996. We also pointed out and partly justied the discordance between the conclusions about the eect on the language if the performance degradation is measured on the microphone TNO corpus or on the telephone POLYCOST database. 5. REFERENCES [1] G. Durou, F. Jauquet, \Cross-Language Text- Independent Speaker Identication", Proc. European Conference on Signal Processing (EU- SIPCO'98), vol 3, pp 1481-1484, September 1998, Rhodes, Greece. [2] D. A. Reynolds, \A Gaussian Mixture Modeling Approach to Text-Independent Speaker Identication", PhD Thesis, Georgia Institute of Technology, 1992. [3] G. McLachlan and K. Basford, \Mixture Models : Inference and Applications to Clustering", Marcel Dekker, 1998.