Speaker Identification based on GFCC using GMM Md. Moinuddin Arunkumar N. Kanthi M. Tech. Student, E&CE Dept., PDACE Asst. Professor, E&CE Dept., PDACE Abstract: The performance of the conventional speaker identification system degrades drastically in presence of noise. The ability of human ear to identify the speaker s identity in noisy environment motivates us to use an auditory based feature called gammatone frequency cepstral coefficient (GFCC). The GFCC is based on gammatone filter bank, which models the basilar membrane as a series of overlapping band pass filters. The speaker identification system using the GFCC features and GMMs has been developed and analysed using TIMIT and NTIMIT databases. The performance of the system is compared with the baseline system using the traditional MFCC features. The results show that the GFCC features has a good recognition performance not only in clean speech environment, but also in noisy environment. Keywords Auditory based feature, Gammatone Frequency Cepstral Coefficient (GFCC), MFCC, GMM, EM - algorithm I. INTRODUCTION Speaker identification determines from which of the enrolled speakers the given utterance has come. The utterance can be constrained to a known phrase (text-dependent) or totally unconstrained (text-independent). It consists of feature extraction, speaker modeling and decision making. Typically, extracted speaker features are Mel-frequency cepstral coefficients (MFCCs). For speaker modeling, Gaussian mixture models (GMMs) are widely used to describe feature distributions of individual speakers. Recognition decisions are usually made based on likelihoods of observing feature frames given a speaker model. The poor performance of MFCCs in noisy or mismatched condition can be attributed to the use of triangular filters for modelling the auditory critical bands. To model cochlear filter more accurately, Gammatone filters are used instead of the triangular filters and the extracted features are called gammatone frequency cepstral coefficients (GFCCs). II. THE SYSTEM MODEL Speaker identification system consists of two parts: front-end & back-end. The front-end of the system is a feature extractor while the back-end consists of a classifier and a reference database. Front-End Back-End Train utterances GFCC Extractor GMM Modeling Test utterance Identification result ML Classifier Database Figure 1: Architecture of speaker identification system 2014, IJIRAE- All Rights Reserved Page - 224
The main task of the front-end is to extract features from a speech signal. The aim is to sufficiently represent the characteristics of the speech signal with reduced redundancy. Features are extracted based on frames. One feature vector is calculated for every frame. After feature extraction, the sequence of feature vectors is passed to the back-end of the speaker identification system. Based on the feature vectors, the back-end of the system selects the most likely utterance out of all the possibilities from the reference database. After training, the statistical models are stored in the database. When an unknown utterance is presented, feature vectors are obtained. The classifier calculates the maximum log likelihood based on the models and decides the most likely utterance. III. GFCC EXTRACTION The GFCC features are based on the GammaTone Filter Bank (GTFB). The feature vectors are calculated from the spectra of a series of windowed speech frames. The figure below shows the block diagram of GFCC extraction. Speech utterance Pre-emphasis Framing & Windowing DFT &. 2 GFCC features DCT Logarithmic compression GTFB Pre-emphasis stage: Figure 2: Block Diagram of GFCC extraction The high frequency components of a speech signal have low amplitude as compare to low frequency components due to radiation effect of lips. In order to spectrally flatten the speech signal i.e. to obtain similar amplitude for all frequency components, the speech signal is passed through a Pre-emphasis filter, which is a first order FIR digital filter, which can eliminate the lips spectral contribution effectively. The speech after pre-emphasis sounds much sharper. The transfer function of the pre-emphasis filter is given by the following equation Where a is a constant, it has a typical value of 0.97. (1) Fig 3: Pre-emphasis operation 2014, IJIRAE- All Rights Reserved Page -225
Framing & Windowing: Speech signal is non-stationary i.e. its statistical characteristics varies with time. Since the glottal system cannot change immediately, speech can be considered to be time-invariant over short segments of time (20-30 ms). Therefore speech signal is split into frames of 20ms. When the signal is framed, it is necessary to consider how to treat the edges of the frame otherwise the edges add. Therefore a windowing function is used to tone down the edges. The choice of window must be the one which has narrow main lobe and attenuated side lobes. Therefore hamming window is the preferred choice. The Hamming window is given by equation (2) Fig 4: windowing operation As a consequence of windowing, the samples will not be assigned the same weight in the following computations & for this reason it is sensible to use an overlap (10 ms). DFT: The windowed frame is transformed using a Discrete Fourier transform and the magnitude is taken because phase does not carry any speaker specific information. Gammatone filter banks stage: Fig 5: DFT operation The Gammatone filter bank consists of a series of band-pass filters, which models the frequency selectivity property of the basilar membrane.the impulse response of each filter is given by the equation (1 M) (3) Where a is the constant (usually equals to 1). n is the filter order ( here n=4). is the phase shift. is the center frequency and b m is the attenuation factor of the filter, which is related to the band of the filter, and is decisive factor of impulse response decay rate. Fig 6: Frequency response of 64 channel gammatone filter bank 2014, IJIRAE- All Rights Reserved Page -226
The centre frequency of m th Gammatone filter can be determined by the equation + (4) Where f L and f H are the lower and upper frequencies of the filter bank. The bandwidth of each filter is described by an Equivalent Rectangular Bandwidth (ERB). The ERB is a psychoacoustic measure of the width of the auditory filter at each point along the cochlea. The equation for the ERB ERB ( ) = 24.7 (5) The bandwidth of each filter is described by ERB as b m = 1.019 ERB ( ) (6) The FFT magnitude coefficients are binned by correlating them with each gammatone filter i.e., each FFT magnitude coefficient is multiplied by the gain of the corresponding filter and the result is accumulated. Thus, each bin holds the spectral magnitude in that filterbank channel. (7) Fig 7: Filter bank processing Logarithmic compression & Discrete cosine Transformation (DCT) stage: The logarithm is applied to each of the filter output to simulate the human perceived loudness given certain signal intensity and to separate the excitation (source) produced by the vocal cords and the filter that represents the vocal tract. Since the log-power spectrum is real, Discrete Cosine Transform (DCT) is applied to the filter outputs which produces highly uncorrelated features. The envelope of the vocal tract changes slowly, and thus presents at low quefrencies (lower order cepstrum), while the periodic excitation are at high quefrencies (higher order cepstrum)., where 1 (8) 2014, IJIRAE- All Rights Reserved Page -227
Fig 8: Logarithm and DCT operation IV. GAUSSIAN MIXTURE MODEL The task is to classify the feature vectors. Each speaker is represented by a speaker model. Where is the mean vector, is the covariance matrix, is the mixture weight. The Gaussian Mixture Model (GMM) is a model that expresses the probability density function of a random variable in terms of weighted sum of its components, each of which is described by a Gaussian density. The feature vectors extracted from the speech of the enrolled speaker are modelled as Where is a D-dimensional random vector and (9) is the component density. Let X = be the set of training feature vectors. Training a GMM requires the computation of and from the given feature vectors belonging to a speaker. Maximum Likelihood estimation is used to estimate these parameters. 2014, IJIRAE- All Rights Reserved Page -228
Maximum Likelihood Estimation ML aims to maximize the likelihood p (X of the GMM from the given set of feature vectors X = p (X (10) Since log cannot be moved inside the summation, direct maximization is not possible. However, estimates can be obtained iteratively using Expectation Maximization Algorithm. Expectation Maximization Algorithm 1. Initialize: Means by clustering the feature vectors through k-means algorithm. Mixture weights to be equally likely, by setting each weight to be. Co-variance matrix by using an identity matrix. 2. Expectation step: Evaluate responsibilities 3. Maximization step: Update the parameters using the current responsibilities Where 4. Evaluate the log likelihood 2014, IJIRAE- All Rights Reserved Page -229
Check for the convergence of the parameters or the log likelihood. If the convergence criterion is not satisfied return to step2. V. SPEAKER IDENTIFICATION Speaker identification is done by finding the speaker model which has the maximum a posterior probability for the given set of test feature vectors X =. i.e. By mixed baye s rule we ve Assuming speakers to be equally likely i.e. and is independent of speaker model the above equation simplifies to = Assuming feature vectors are occurrences of independent random variables Taking logarithm, we get VI. EXPERIMENTAL RESULTS Speech Database: Experiment is conducted on TIMIT and NTIMIT speech database. TIMIT consists of read speech recorded in a quiet environment without channel distortion. TIMIT database has 630 speakers (438 males and 192 females) with 10 utterances per speaker, each 3 seconds long on average. NTIMIT was created by transmitting all TIMIT utterances over actual telephone channels. The Performance of the speaker identification system is evaluated using the 23-dimensional GFCCs and the baseline 13-dimensional MFCCs features by taking different orders of GMM. 2014, IJIRAE- All Rights Reserved Page -230
1. For each database, 8 of the 10 utterances are used for training (about 24s) and 2 for testing (about 6s). i. Logarithmic compression: ii. Cubic root compression: 2. For each database, 9 of the 10 utterances are used for training (about 27s) and 1 for testing (about 3s). i. Logarithmic compression: ii. Cubic root compression: 2014, IJIRAE- All Rights Reserved Page -231
VII. CONCLUSION The results show that the gammatone frequency cepstral coefficient (GFCC) features, captures speaker characteristics better than the conventional MFCC features and has a good recognition performance not only in clean speech environment (TIMIT), but also in noisy environment (NTIMIT). Further, the modified MFCC (MMFCC) and modified GFCC (MGFCC) features, developed by replacing the log with the cubic root, shows drop in the identification performance both in clean and noisy speech. REFERENCES [1] E. B. Tazi, A. Benabbou, and M. Harti, Efficient Text Independent Speaker Identification Based on GFCC and CMN Methods" ICMCS 2012, pp. 90-95. [2] He Xu, Lin Lin, Xiaoying Sun and Huanmei Jin, "A New Algorithm for Auditory Feature Extraction" CSNT-2012, pp. 229-232. [3] Feng song H and Xiao Cao, An auditory Feature Extraction Method for Robust Speaker Recognition ICCT 2012, pp. 1067-1071. [4] M. Slaney, An Efficient Implementation of the Patterson-Holdsworth Auditory Filter Bank, Apple Technical Report No. 35, Advanced Technology Group, Apple Computer Inc., 1993. [5] Douglas A. Reynolds and Richard C. Rose Robust Text-Independent Speaker Identification Using Gaussian Mixture Speaker Models" IEEE Trans. Audio, Speech, and Language Processing, vol. 3(1), pp. 72-83,1995. [6] Christopher M. Bishop, Pattern Recognition and Machine Learning, Springer Science, Business Media, 2006. [7] L. Rabiner and B. H. Juang, Fundamentals of Speech Recognition, Pearson Education (Singapore) Pt. Ltd. 2005. 2014, IJIRAE- All Rights Reserved Page -232