Performance Evaluation of Text-Independent Speaker Identification and Verification Using MFCC and GMM

Similar documents
Speaker recognition using universal background model on YOHO database

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Human Emotion Recognition From Speech

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

A study of speaker adaptation for DNN-based speech synthesis

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

WHEN THERE IS A mismatch between the acoustic

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

Speaker Identification by Comparison of Smart Methods. Abstract

Modeling function word errors in DNN-HMM based LVCSR systems

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

Speech Emotion Recognition Using Support Vector Machine

Modeling function word errors in DNN-HMM based LVCSR systems

Speech Recognition at ICSI: Broadcast News and beyond

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Calibration of Confidence Measures in Speech Recognition

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Support Vector Machines for Speaker and Language Recognition

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Lecture 1: Machine Learning Basics

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Word Segmentation of Off-line Handwritten Documents

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Spoofing and countermeasures for automatic speaker verification

Grade 6: Correlated to AGS Basic Math Skills

Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method

Python Machine Learning

INPE São José dos Campos

Speech Recognition by Indexing and Sequencing

Generative models and adversarial training

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Probabilistic Latent Semantic Analysis

Learning Methods in Multilingual Speech Recognition

Why Did My Detector Do That?!

Mandarin Lexical Tone Recognition: The Gating Paradigm

On the Combined Behavior of Autonomous Resource Management Agents

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Assignment 1: Predicting Amazon Review Ratings

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment

Automatic Pronunciation Checker

Segregation of Unvoiced Speech from Nonspeech Interference

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Voice conversion through vector quantization

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Australian Journal of Basic and Applied Sciences

Speaker Recognition. Speaker Diarization and Identification

Proceedings of Meetings on Acoustics

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

Statewide Framework Document for:

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Learning From the Past with Experiment Databases

Rule Learning With Negation: Issues Regarding Effectiveness

Circuit Simulators: A Revolutionary E-Learning Platform

CS Machine Learning

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Corpus Linguistics (L615)

An Online Handwriting Recognition System For Turkish

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

Transfer Learning Action Models by Measuring the Similarity of Different Domains

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Affective Classification of Generic Audio Clips using Regression Models

Understanding and Interpreting the NRC s Data-Based Assessment of Research-Doctorate Programs in the United States (2010)

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Corrective Feedback and Persistent Learning for Information Extraction

arxiv: v1 [math.at] 10 Jan 2016

A Case Study: News Classification Based on Term Frequency

Using EEG to Improve Massive Open Online Courses Feedback Interaction

On the Formation of Phoneme Categories in DNN Acoustic Models

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

FOR TEACHERS ONLY. The University of the State of New York REGENTS HIGH SCHOOL EXAMINATION PHYSICAL SETTING/PHYSICS

Artificial Neural Networks written examination

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

GCSE Mathematics B (Linear) Mark Scheme for November Component J567/04: Mathematics Paper 4 (Higher) General Certificate of Secondary Education

TRANSFER LEARNING IN MIR: SHARING LEARNED LATENT REPRESENTATIONS FOR MUSIC AUDIO CLASSIFICATION AND SIMILARITY

Truth Inference in Crowdsourcing: Is the Problem Solved?

Evolutive Neural Net Fuzzy Filtering: Basic Description

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS

1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all

Physics 270: Experimental Physics

BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY

On Human Computer Interaction, HCI. Dr. Saif al Zahir Electrical and Computer Engineering Department UBC

AP Calculus AB. Nevada Academic Standards that are assessable at the local level only.

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

arxiv: v2 [cs.cv] 30 Mar 2017

Transcription:

IOSR Journal of Engineering (IOSRJEN) ISSN: 2250-3021 Volume 2, Issue 8 (August 2012), PP 18-22 Performance Evaluation of ext-independent Speaker Identification and Verification Using FCC and G Palivela Hema 1, E.Venkatanarayana 2 1 (.tech-c&c),department of ECE, Jawaharlal Nehru echnological University Kakinada(JNUK) Kakinada,India) 2 (Asst.Professor,Department of ECE, Jawaharlal Nehru echnological University Kakinada(JNUK), Kakinada,India) Abstract : - his paper presents the performance of a text independent speaker identification and verification system using Gaussian ixture odel(g).in this paper, we adapted el-frequency Cepstral Coefficients(FCC) as speaker speech feature parameters and the concept of Gaussian ixture odel for classification with log-likelihood estimation. he Gaussian ixture odeling method with diagonal covariance is increasingly being used for both speaker identification and verification. We have used speakers in experiments, modeled with 13 mel-cepstral coefficients. Speaker verification performance was conducted using False Acceptance Rate (FAR), False Rejection Rate(FRR) and Equal Error Rate(ERR). Keywords: - Equal Error Rate, Gaussian ixture odel, el-frequency Cepstral Coefficients, speaker identification, speaker verification. I. Introduction Speaker recognition can be classified into speaker identification and verification. Speaker identification is the process of determining which registered speaker provides a given utterance i.e. to identify the speaker without any prior knowledge to a claimed identity. Speaker verification refers to whether or not the speech samples belong to specific speaker. speaker recognition can be ext-dependent or text-independent. he verification is the number of decision alternatives. In Identification,the number of decision alternatives is dependent to the size of population, whereas in verification there are only two choices, acceptance or rejection,regardless of population size. Feature extraction deals with extracting the features of speech from each frame and representing it as a vector. he feature here is the spectral envelope of the speech spectrum which is represented by the acoustic vectors. el Frequency Cepstral Coefficients(FCC) is the most common technique for feature extraction which computed on a warped frequency scale based on human auditory perception. G[1,2] has been being the most classical method for text-independent speaker recognition. Reynolds etc. introduced G to speaker identification and verification.g is trained from a large database of different people. he speeches of this database should be carefully selected from different people in order to get better results. II. Speaker identification and verification system. he process of speaker recognition is divided into enrolment phase and testing phase. During the enrolment, speech samples from the speaker are collected and used to train their models. he collection of enrolled models is saved in a speaker database. In the testing phase, a test sample from an unknown speaker is compared against the database. he basic structure for a speaker identification and verification system is shown in Figure 1 (a) and (b) respectively [3]. In both systems, the speech signal is first processed to extract useful information called features. In the identification system these features are compared to a speaker database representing the speaker set from which we wish to identify the unknown voice. he speaker associated with the most likely, or highest scoring model is selected as the identified speaker. his is simply a maximum likelihood classifier [3]. 18 P a g e

Speech Signal Speaker 1 Speaker # Front-end Processing Speaker 2 A X Score Speaker N (a) Speech signal Front-end Processing Speaker model Imposter model (b) + - Σ Λ>θ Accept Λ<θ Reject Figure1:Basic structure of (a) speaker identification and (b)speaker Verification system. he verification system essentially implements a likelihood ratio test to distinguish the test speech comes from the claimed speaker. Features extracted from the speech signal are compared to a model representing the claimed speaker, obtained from a previous enrolment. he ratio (or difference in the log domain) of speaker and imposter match scores is the likelihood ratio statistic (Λ), which is then compared to a threshold (θ) to decide whether to accept or reject the speaker [3]. III. Feature Extraction Preprocessing mostly is necessary to facilitate further high performance recognition. A wide range of possibilities exist for parametrically representing the speech signal for the voice recognition task. el Frequency Cepstral Coefficients (FCC): el Frequency Cepstral Coefficients (FCC) are derived from the Fourier ransform (FF) of the audio clip. he basic difference between the FF and the FCC is that in the FCC, the frequency bands are positioned logarithmically (on the el scale) which approximates the human auditory system's response more closely than the linearly spaced frequency bands of FF. his allows for better processing of data. he main purpose of the FCC processor is to mimic the behaviour of the human ears. Overall the FCC process has 5 steps that show in figure2. At the Frame Blocking Step a continuous speech signal is divided into frames of N samples. Adjacent frames are being separated by (<N). he values used are = 128 and N =256. he next step in the processing is to window each individual frame so as to minimize the signal discontinuities at the beginning and end of each frame. he concept here is to minimize the spectral distortion by using the window to taper the signal to zero at the beginning and end of each frame. If we define the window as w(n), 0 n N 1, where N is the number of samples in each frame, then the result of windowing is the signal 19 P a g e

y (n) = x (n)w(n), 0 n N 1 (1) ypically the Hamming window is used, which has the form: W(n) = 0.54-0.46 cos 2πn N 1, 0 n N 1 (2) Use of speech spectrum for modifying work domain on signals from time to frequency is made possible using Fourier coefficients. At such applications the rapid and practical way of estimating the spectrum is use of rapid Fourier changes. N 1 X k = n=0 X 1 e j2πkn N, k = 0,1,2,.., N-1 (3) Psychophysical studies have shown that human perception of the frequency contents of sounds for speech signals does not follow a linear scale. hus for each tone with an actual frequency, f, measured in Hz, a subjective pitch is measured on a scale called the mel scale [3],[4]. he mel-frequency scale is a linear frequency spacing below 1000 Hz and a logarithmic spacing above 1000 Hz. herefore we can use the following approximate formula to compute the mels for a given frequency f in Hz: mel(f) = 2595*log10(1+f/700) (4) he final procedure for the el Frequency cepstral coefficients (FCC) computation is to convert the log mel spectrum back to time domain where we get the so called the mel frequency cepstral coefficients (FCC). Because the mel spectrum coefficients are real numbers, we can convert them to the time domain using the Discrete Cosine ransform (DC) and get a featured vector. he DC compresses these coefficients to 13 in number. IV. he Gaussian ixture Speaker odel. A mixture of Gaussian probability densities is a weighted sum of densities, as depicted in Fig.3 and is given by: p x λ = i=1 p i b i x (5) where x is a random vector of dimension D,b i x,i=1,,, are the density components,and p i,i=1,.., are the mixture weights. Each component density is a D variate Gaussian function of the form: b i x = e 1 2 x µ K i 1 x µ D (2π) 2 K i (6) With mean vector µ i and covariance matrix K i. Note that the weighting of the mixtures satisfy i=1 p i =1.he complete Gaussian mixture density is parameterized by a vector of means, covariance matrix, and a weighted mixture of all component densities(λ model).hese parameters are jointly represented by the following notation. λ = {p i, µ i, K i } i=1,,. (7) he G can have different forms on the choice of the covariance matrix. he model can have a covariance matrix per Gaussian component as indicated in (nodal covariance),a covariance matrix for all Gaussian components for a given model (grand covariance ),or only one covariance matrix shared by all models(global covariance).a covariance matrix can also be complete or diagonal[4]. Since Gaussian components jointly act to model the probability density function, the complete covariance matrix is usually not necessary. Even being the input vectors not statistically independent, the linear combination of the diagonal covariance matrices in the G is able to model the correlation between the given vectors. he effect of using a set of complete covariance matrices can be equally obtained by using a larger set of diagonal covariance matrices[5]. For a set of training data the Estimation of maximum Likelihood is necessary. In other words this estimation tries to find the model parameters that maximize the likelihood of G. he algorithm presented in [6] is widely used for this task. For a sequence of independent vectors for Figure 3. probability densities forming G. 20 P a g e

training X = {x 1,..,x },the likelihood of the G is given by: p X λ = t=1 p x t λ (8) he likelihood for modeling a true speaker (model ) is directly calculated through log p X λ = 1 log p x t=1 t λ (9) he scale factor 1 is used in order to normalize the likelihood according to the duration of the elocution(number of feature vectors).he last equation corresponds to the normalized logarithmic likelihood which is the λ model s response. he speaker verification system requires a binary decision, accepting or rejecting a speaker. he system uses two models which provide the normalized logarithmic likelihood with input vectors x 1,..,x one from pretense speaker and another one trying to minimize the variation not related to the speaker providing a more stable decision threshold. If the system output value (difference between two likelihood) is higher than a given threshold θ the speaker is accepted otherwise it is rejected as shown in figure1(b). the background (imposter model) is built with a hypothetical set of false speakers and modeled via G. he threshold is calculated on the basis of experimental results. V. Experimental Evaluation. his section presents the experimental evaluation of the Gaussian mixture speaker model for textindependent speaker identification and verification. he evaluation of a speaker identification experiment was conducted in the following manner. he test speech was first processed by the front end analysis to produce the sequence of feature vectors { x 1,..,x }.o evaluate different test utterance lengths, the sequence of feature vectors was divided into overlapping segments of feature vectors. he first two segments from a sequence would be: Segment1 x 1, x 2,, x, x +1, x +2, Segment2 x 1, x 2,, x, x +1, x +2. A test segment length of 5 seconds corresponds to =500 feature vectors at a 10ms frame rate. Each segment of vectors was treated as a separate test utterance. he identified speaker of each segment was compared to the actual speaker of the test utterance and the number of segments which were correctly identified was tabulated. he above steps were repeated for test utterances from each speaker in the population. he final performance evaluation was then computed as the percent of correctly identified -length segments over all test utterances %correct identification = # correctly identified segments total # of segments X 100. he evaluation was repeated for different values of to evaluate performance with respect to test utterance length. Speaker verification: he acceptance or rejection of an unknown speaker depends on the determination of the threshold value from the training speaker model. If the system accepts an impostor, it makes a false acceptance (FA) error. If the system rejects a valid user, it makes a false reject (FR) error. he FA and FR errors can be traded off by adjusting the decision threshold, (as shown by a Receiver Operating Characteristic (ROC) curve.) he ROC curve is obtained by assigning false rejection rate (FRR) and false acceptance rate (FAR), to the vertical and horizontal axes respectively, and varying the decision threshold. he FAR and FRR are obtained by equation (10) and (11) respectively. FAR= EI / I * 100% (10) where EI is the number of impostor acceptance, I is the number of impostor claims. FRR = ES / S * 100% (11) where ES is the number of genuine speaker (client) rejection, and S is the number of speaker claims. 21 P a g e

he operating point where the FAR and FRR are equal corresponds to the equal error rate (EER). he equal error rate (EER) is a commonly accepted overall measure of system performance. It also corresponds to the threshold at which the false acceptance rate is equal to the false rejection rate. VI. Simulation Results he system has been implemented in atlab7 on windows XP platform. he result of the study has been presented in able 1. We have used coefficient order of 13 for all experiments. We have trained the model using Gaussian mixture components as 16 for training speech lengths as 10sec.esting is performed using different test speech lengths such as 3 sec, and 8sec.. Here, recognition rate is defined as the ratio of the number of speaker identified to the total number of speakers tested. FAR and FRR are estimated using the expressions (10) and (11). Figure 4.shows a ROC plot of FRR vs FAR.he EER obtained is indicated in Figure(4). able 1: Performance Evaluation No. of Gaussians= 16 rain speech(in sec) est speech(in sec) Identification accuracy %FAR %FRR %EER 10s 3s 93.5% 2.23 1.65 1.94 8s 97.5% 0.77 0.38 0.57 Figure 4 :ROC plot of FRR vs FAR. VII.Conclusion In this work we have demonstrated the importance of test speech length duration for speaker recognition task. Speaker discrimination information is effectively captured for coefficient order 13 by using G. he recognition performance depends on the training speech length selected for training to capture the speaker-discrimination information. Larger the test length,the better is the performance, although smaller number reduces computational complexity. he objective of this paper was mainly to demonstrate the significance of speaker-discrimination information present in the speech signal for speaker recognition. We have not made any attempt to optimize the parameters of the model used for feature extraction, and also the decision making stage. herefore the performance of speaker recognition may be improved by optimizing the various design parameters. References [1] D. A. Reynolds, Speaker Identification and Verification using Gaussian ixture Speaker odels, Speech Communication, Vol. 17, pp. 91-108, 1995. [2] Douglas A. Reynolds, homas F. Quatieri and Robert B. Dunn. Speaker verification using adapted Gaussian mixture models. Digital Signal Processing, Academic Press, 2000. [3] Douglas A.Reynolds, AutomaticSpeaker Recognition: Current Approaches and Future rends, I Lincoln Laboratory, Lexington, A USA. [4] Reynolds, Douglas A.Speaker Identification and verification Using Gaussian ixture Speaker odels. Speech Communication.vol.17,pp.91-108,1995. [5] Reynolds, Douglas A. Robust ext-independent Speaker Identification Using Gaussian ixture Speaker odel. IEEE ransactions on Speech and Audio Processing. Vol. 3, n. 1, pp. 72-83, January, 1995. [6] Reynolds, Douglas A. Gaussian ixture odeling Approach to ext-independent Speaker Identification. PhD hesis. Georgia Institute of echnology, August 1992. [7] Z.J.Wu and Z.G.Cao, Improved FCC-Based Feature for Robust Speaker Identification, SINGHUA Science and echnology, vol.10, pp. 158-161, Apr. 2005. [8] Wei HAN,Cheong-Fat CHAN, Chiu-Sing CHOY and Kong-Pang PUN,(2006) An Efficient FCC Extraction ethod in Speech Recognition,Circuits and Systems, ISCAS 2006. Proceedings. IEEE InternationalSymposium on 21-24 ay 2006, pp.4 [9] omi Kinnunen., and Haizhou Li., An overview of ext-independent Speaker Recognition: from Features to Supervectors. Speech Communication, July 1, 2009. 22 P a g e