Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Similar documents
Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Human Emotion Recognition From Speech

Speech Emotion Recognition Using Support Vector Machine

Learning Methods in Multilingual Speech Recognition

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment

Speaker Identification by Comparison of Smart Methods. Abstract

A study of speaker adaptation for DNN-based speech synthesis

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems

WHEN THERE IS A mismatch between the acoustic

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Lecture 1: Machine Learning Basics

Speaker recognition using universal background model on YOHO database

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

SARDNET: A Self-Organizing Feature Map for Sequences

Speech Recognition at ICSI: Broadcast News and beyond

A Case Study: News Classification Based on Term Frequency

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Mandarin Lexical Tone Recognition: The Gating Paradigm

Understanding and Interpreting the NRC s Data-Based Assessment of Research-Doctorate Programs in the United States (2010)

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Word Segmentation of Off-line Handwritten Documents

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

Voice conversion through vector quantization

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Speaker Recognition. Speaker Diarization and Identification

Assignment 1: Predicting Amazon Review Ratings

Learning From the Past with Experiment Databases

Reducing Features to Improve Bug Prediction

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

On the Formation of Phoneme Categories in DNN Acoustic Models

Python Machine Learning

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Learning Methods for Fuzzy Systems

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Proceedings of Meetings on Acoustics

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

STA 225: Introductory Statistics (CT)

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

A comparison of spectral smoothing methods for segment concatenation based speech synthesis

Affective Classification of Generic Audio Clips using Regression Models

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Artificial Neural Networks written examination

Speech Recognition by Indexing and Sequencing

Probabilistic Latent Semantic Analysis

arxiv: v1 [math.at] 10 Jan 2016

Support Vector Machines for Speaker and Language Recognition

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

Why Did My Detector Do That?!

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Probability and Statistics Curriculum Pacing Guide

Automatic segmentation of continuous speech using minimum phase group delay functions

Spoofing and countermeasures for automatic speaker verification

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Knowledge Transfer in Deep Convolutional Neural Nets

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Australian Journal of Basic and Applied Sciences

Rule Learning With Negation: Issues Regarding Effectiveness

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

Segregation of Unvoiced Speech from Nonspeech Interference

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Comment-based Multi-View Clustering of Web 2.0 Items

Truth Inference in Crowdsourcing: Is the Problem Solved?

Matching Similarity for Keyword-Based Clustering

Team Formation for Generalized Tasks in Expertise Social Networks

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Short Text Understanding Through Lexical-Semantic Analysis

On the Combined Behavior of Autonomous Resource Management Agents

Self-Supervised Acquisition of Vowels in American English

Statewide Framework Document for:

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Grade 6: Correlated to AGS Basic Math Skills

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

An Effective Framework for Fast Expert Mining in Collaboration Networks: A Group-Oriented and Cost-Based Method

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Different Requirements Gathering Techniques and Issues. Javaria Mushtaq

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques

The Perception of Nasalized Vowels in American English: An Investigation of On-line Use of Vowel Nasalization in Lexical Access

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

An Online Handwriting Recognition System For Turkish

INPE São José dos Campos

Generative models and adversarial training

Softprop: Softmax Neural Network Backpropagation Learning

COMPUTER INTERFACES FOR TEACHING THE NINTENDO GENERATION

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Universiteit Leiden ICT in Business

Transcription:

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification Tomi Kinnunen and Ismo Kärkkäinen University of Joensuu, Department of Computer Science, P.O. Box 111, 80101 JOENSUU, FINLAND {tkinnu,iak}@cs.joensuu.fi Abstract. We consider the distortion measure in vector quantization based speaker identification system. The model of a speaker is a codebook generated from the set of feature vectors from the speakers voice sample. The matching is performed by evaluating the distortions between the unknown speech sample and the models in the speaker database. In this paper, we introduce a weighted distortion measure that takes into account the correlations between the known models in the database. Larger weights are assigned to vectors that have high discriminating power between the speakers and vice versa. 1 Introduction It is well known that different phonemes have unequal discrimination power between speakers [14, 15]. That is, the inter-speaker variation of certain phonemes are clearly different from other phonemes. This knowledge should be exploited in the design of speaker recognition [6] systems. Acoustic units that have higher discrimination power should contribute more to the similarity or distance scores in the matching. The description of acoustic units in speech and speaker recognition is often done via short-term spectral features. Speech signal is analyzed in short segments (frames) and a representative feature vector for each frame is computed. In speaker recognition, cepstral coefficients [5] along with their 1 st and 2 nd time derivatives ( coefficients) are commonly used. Physically these represent the shapes of the vocal tract and their dynamic changes [1, 2, 5], and therefore carry information about the formant structure (vocal tract resonant frequencies) and dynamic formant changes. In vector quantization (VQ) based speaker recognition [3, 8, 9, 10, 16], each speaker (or class) is presented by a codebook which approximates his/her data density by a small number of representative code vectors. Different regions (clusters) in the feature space represent acoustically different units. The question how to benefit from the different discrimination power of phonemes in VQ-based speaker recognition returns into question how to assign discriminative weights for different code vectors and how to adopt these weights into the distance or similarity calculations in the matching phase. As a motivating example, Fig. 1 shows two scatter plots of four different speakers cepstral code vectors derived from the TIMIT speech corpus. In both plots, two randomly chosen components of the 36- dimensional cepstral vectors are shown. Each speakers data density is presented as a

codebook of 32 vectors. As can be seen, different classes have strong overlap. However, some speakers do have code vectors that are far away from all other classes.!! "$#&%(' )*'+"$' '+,- +/. -/!0121-3 +#-$4 - code vectors that are especially good for discriminating them from other speakers. Fig. 1. Scatter plots of two randomly chosen dimensions of four speakers cepstral data from TIMIT database There are two well-known ways for improving class separability in pattern recognition. The first one is to improve separability in the training phase by discriminative training algorithms. Examples in the VQ context are LVQ [12] and GVQ [8] algorithms. The second discrimination paradigm, score normalization, is used in the decision phase. For instance, matching scores of the client speaker in speaker verification can be normalized against matching scores obtained from a cohort set [3]. In this paper, we introduce a third alternative for improving class separability and apply it to speaker identification problem. For a given set of codebooks, we assign discriminative weights for each of the code vectors. In the matching phase, these weights are retrieved from a look-up table and used in the distance calculations directly. Thus, the time complexity of the matching remains the same as in the unweighted case. The outline of this paper is as follows. In Section 2, we shortly review the baseline VQ-based speaker identification. In Section 3, we give details of the weighted distortion measure. Experimental results are reported in Section 4. Finally, conclusions are drawn in Section 5. 2 VQ-Based Speaker Identification Speaker identification is a process of finding the best matching speaker from a speaker database, when given an unknown speakers voive sample [6]. In VQ-based speaker identification [8, 9, 11, 16], vector quantization [7] plays two roles. It is used both in the training and matching phases. In the training phase, the speaker models are

constructed by clustering the feature vectors in K separate clusters. Each cluster is represented by a code vector c i, which is the centroid (average vector) of the cluster. The resulting set of code vectors is called a codebook, and notated here by C (j) = {c 1 (j), c 2 (j),..., c K (j) }. The superscript (j) denotes speaker number. In the codebook, each vector represents a single acoustic unit typical for the particular speaker. Thus, the distribution of the feature vectors is represented by a smaller set of sample vectors with similar distribution than the full set of feature vectors of the speaker model. The codebook size should be set reasonably high since the previous results indicate that the matching performance improves with the size of the codebook [8, 11, 16]. For the clustering we use the randomized local search (RLS) algorithm [4] due its superiority in codebook quality over the widely used LBG method [13]. In the matching phase, VQ is used in computing a distortion D(X, C (i) ) between an unknown speakers feature vectors X = {x 1,..., x T } and all codebooks {C (1), C (2),..., C (N) } in the speaker database [16]. A simple decision rule is to select the speaker i* that minimizes the distortion, i.e. ( i) i* = arg min D( X, C ). (1) 1 i N A natural choice for the distortion measure is the average distortion [8, 16] defined as T ( ] x X 1 D X, C) = d( x, c NN [x ), T where NN[x] is the index of the nearest code vector to x in the codebook and d(.,.) is a distance measure defined for the feature vectors. In words, each vector from the unknown feature set is quantized to its nearest neighbor in the codebook and the sum of the distances is normalized by the length of the test sequence. A popular choice for the distance measure d is the Euclidean distance or its square. In [15] it is justified that Euclidean distance of two cepstral vectors is a good measure for the dissimilarity of the corresponding short-term speech spectra. In this work, we use squared Euclidean distance as the distance measure. In the previous work [10] we suggested an alternative approach to the matching. Instead minimizing distortion, maximization of a similarity measure was proposed. However, later experiments have pointed out that it is difficult to define a natural and intuitive similarity measure in the same way as distortion (2) is defined. For that reason, we limit our discussion to distortion measures. (2) 3 Speaker Discriminative Matching As an introduction, consider the two speakers codebooks illustrated in Fig. 2. Vectors marked by represent an unknown speakers data. Which one is this speaker? We (1) can see that the uppermost code vector c 2 is actually the only vector which clearly turns the decision to the speaker #1. Suppose that there wasn t that code vector. Then

the average distortion would be approximately same for both speakers. There are clearly three regions in the feature space which cannot distinguish these two speakers. Only the code vectors c 2 (1) and c 3 (2) can make the difference, and they should be given a large discrimination weight. 3.1 Weighted Distortion Measure We define our distortion measure by modifying (2) as follows: T 1 D( X, C) = f ( w NN [ x] ) d( x, c NN [ x] ). T x X Here w NN[x] is the weight associated with the nearest code vector, and f is a nonincreasing function of its argument. In other words, code vectors that have good discrimination (large weight) tend to decrease the distances d; vice versa, nondiscriminative code vectors (small weight) tend to increase the distances. Product f(w)d(x,c) can be viewed as an operator which attracts (decreases overall distortion) vectors x that are close to c or the corresponding weight w is large. Likewise, it repels (increases overall distortion) such vectors x that are far away or are quantized with small w. (3) Fig. 2. Illustration of code vectors with unequal discrimination powers An example of a quantization of a single vector is illustrated in Fig. 3. Three speakers code vectors and corresponding weights are shown. For instance, the code vector at location (8, 4) has a large weight, because there are no other classes presentatives in its neighborhood. The three code vectors in the down left corner, in turn, have all small weights because they all have another classes representative near. When quantizing the vector marked by, the unweighted measure (2) would give the same distortion value D 7.5 for all classes (squared Euclidean distance). However, when using the weighted distortion (3.1), we get distortion values D 1 6.8, D 2 6.8 and D 3 1.9 for the three classes, respectively. Thus, is favored by the class #3 due to the large weight of the code vector. We have not yet specified two important issues in the design of the weighted distortion, namely:

How to assign the code vector weights, Selection of the function f. Fig. 3. Weighted quantization of a single vector In this work, we fix the function f as a decaying exponential of the form f ( w)= e αw, (4) 576898 :<;!=>?A@? 9*? BC8 D/8 9ED6? DGF HÏ D9HJ>D6 8A9*? D/8AHKEL8 F?NM<OP*ICD/687? QHR878 S?B!@J8T UV;O*W&O 3.2 Assigning the Weights The weight of a code vector should depend on the minimum distances to other classes code vectors. Let c C (j) be a code vector of the jth speaker. Let us denote the index of its nearest neighbor in the kth codebook simply by NN (k). The weight of c is then assigned as follows: 1 w( c ) =. 1/ d( c, c ) k j NN ( k) In other words, nearest code vector from all other classes are found, and the inverse of the sum of inverse distances is taken. If some of the distances equals 0, we set w(c) = 0 for mathematical convenience. The algorithm is looped over all code vectors and all codebooks. As an example, consider the code vector located at (1,1) in Fig. 3. The distances (squared Euclidean) to the nearest code vectors in other classes are 2.0 and 4.0. Thus, the weight for this code vector is w = 1/(1/2.0 + 1/4.0) = 1.33. In the practical implementation, we further normalize the weights within each codebook such that their sum equals 1. Then all weights satisfy 0 X w Y[Z]\ ^A_ `a_,bcc d ef!g/_ebhec f`eig/j handle and interpret. (5)

4 Experimental Results For testing purposes, we used a 100 speaker subset from the American English TIMIT corpus. We resampled the wave files down to 8.0 khz with 16-bit resolution. The average duration of the training speech per speaker was approximately 15 seconds. For testing purposes we derived three test sequences from other files with durations 0.16, 0.8 and 3.2 seconds. The feature extraction was performed using the following steps: Pre-emphasis filtering with H ( z) = 1 0.97z 1. 12 th order mel-cepstral analysis with 30 ms Hamming window, shifted by 15 ms. The feature vectors were composed of the 12 lowest mel-cepstral coefficients (excluded the 0 th coefficient). The - and -cepstral were added to the feature vectors, thereby implying 3 12=36-dimensional feature space. 40 35 Sample length 0.16 s Identification rate (%) 30 25 20 15 10 5 0 Unw eighted Weighted 2 4 8 16 32 64 128 Codebook size Fig. 4. Performance evaluation using ~0.16 s. speech sample (~10 vectors) Identification rate (%) 80 70 60 50 40 30 20 10 0 Sample length 0.8 s Unw eighted Weighted 2 4 8 16 32 64 128 Codebook size Fig. 5. Performance evaluation using ~0.8 s. speech sample (~50 vectors)

100 90 Sample length 3.2 s Identification rate (%) 80 70 60 50 40 30 20 Unw eighted Weighted 2 4 8 16 32 64 128 Codebook size Fig. 6. Performance evaluation using ~3.2 s. speech sample (~200 vectors) The identification rates by using the reference method (2) and the proposed method (3) are summarized through Figs. 4-6 for the three different subsequences by varying the codebook sizes from K = 2 to 128. The parameter klm*noqp/rsltpü v w$px$yzz1{/ }vv experiments to ~ & The following observations can be made from the figures. The proposed method does not perform consistently better than the reference method. In some cases the reference method (unweighted) outperforms the proposed (weighted) method, especially for low codebook sizes. For large codebooks the ordering tends to be opposite. This phenomenon is probably due to the fact that small codebook sizes give a poorer representation of the training data, and thus the weight estimates cannot be good either. Both methods give generally better results with increasing codebook size and test sequence length. Both methods saturate to the maximum accuracy (97 %) with the longest test sequence (3.2 seconds of speech) and codebook size K=64. In this case, using codebook K=128 does not improve accuracy any more. 5 Conclusions We have proposed a framework for improving class separability in pattern recognition and evaluated the approach in the speaker identification problem. In general, results show that with proper design VQ-based speaker identification system can achieve high recognition rates with very short test samples while model having low complexity (codebook size K = 64). Proposed method adapts to a given set of classes represented by codebooks by computing discrimination weights for all code vectors and uses these weights in the matching phase. The results obtained in this work show no clear improvement over the reference method. However, together with the results obtained in [10] we conclude that weighting indeed can be used to improve class separability. The critical question is: how to take full advantage of the weights in the distortion or similarity measure? In future work, we will focus on the optimization of the weight decay function f.

References 1. Deller, J.R. Jr., Hansen, J.H.L., Proakis, J.G.: Discrete-time Processing of Speech Signals. Macmillan Publishing Company, New York, 2000. 2. Fant, G.: Acoustic Theory of Speech Production. The Hague, Mouton, 1960. 3. Finan R.A., Sapeluk A.T., Damper R.I.: Impostor cohort selection for score normalization in speaker verification, Pattern Recognition Letters, 18: 881-888, 1997. 4. Fränti, P., Kivijärvi, J.: Randomized local search algorithm for the clustering problem, Pattern Analysis and Applications, 3(4): 358-369, 2000. 5. Furui, S.: Cepstral analysis technique for automatic speaker verification, IEEE Transactions on Acoustics, Speech and Signal Processing, 29(2): 254-272, 1981. 6. Furui, S.: Recent advances in speaker recognition, Pattern Recognition Letters, 18: 859-872, 1997. 7. Gersho, A., Gray, R.M., Gallager, R.: Vector Quantization and Signal Compression. Kluwer Academic Publishers, 1991. 8. He, J., Liu, L., Palm, G.: A discriminative training algorithm for VQ-based speaker identification, IEEE Transactions on Speech and Audio Processing, 7(3): 353-356, 1999. 9. Jin, Q., Waibel, A.: A naive de-lambing method for speaker identification, Proc. ICSLP 2002, Beijing, China, 2000. 10. Kinnunen, T., Fränti, P.: Speaker discriminative weighting method for VQ-based speaker identification, Proc. 3rd International Conference on Audio- and Video-Based Biometric Person Authentication (AVBPA)): 150-156, Halmstad, Sweden, 2001 11. Kinnunen, T., Kilpeläinen, T., Fränti P.: Comparison of clustering algorithms in speaker identification, Proc. IASTED Int. Conf. Signal Processing and Communications (SPC): 222-227, Marbella, Spain, 2000. 12. Kohonen T.: Self-Organizing Maps. Springer-Verlag, Heidelberg, 1995. 13. Linde, Y., Buzo, A., Gray, R.M.: An algorithm for vector quantizer design, IEEE Transactions on Communications, 28(1): 84-95, 1980 14. Nolan, F.: The Phonetic Bases of Speaker Recognition. Cambridge CUP, Cambridge, 1983. 15. Rabiner, L., Juang B.: Fundamentals of Speech Recognition. Prentice Hall, 1993. 16. Soong, F.K., Rosenberg, A.E., Juang, B-H., Rabiner, L.R.: A vector quantization approach to speaker recognition, AT&T Technical Journal, 66: 14-26, 1987.