Gammachirp based speech analysis for speaker identification MOUSLEM BOUCHAMEKH, BOUALEM BOUSSEKSOU, DAOUD BERKANI Signal and Communication Laboratory Electronics Department National Polytechnics School, 10 Avenue Hacen BADI. 16200 Algiers, ALGERIA. mbouchamekh@gmail.com Abstract: - Many modern speaker recognition systems use a bank of linear filters as the first step in performing frequency analysis of speech and extracting the acoustics parameters that allow characterizing the speaker identity. In this paper we illustrate the use of novel feature set extracted from speech signal. The new technique for extracting these parameters is based on the human auditory system characteristics and relies on the gammachirp to emulate asymmetric frequency response and level dependent frequency response. For evaluation a comparative study was operated with standard MFCC. Key-Words: - Speaker identification, MFCC, Gammachirp, triangular 1 Introduction Feature extraction is the key to the front-end process in speaker identification systems. The performance of the identification is highly dependent on the quality of the selected speech features. Most of the current proposed speaker identification systems use mel frequency cepstral coefficients (MFCC) and linear predictive cepstral coefficients (LPCC) as feature vectors. It is known that speech auditory frequency selectivity is largely determined by signal processing in cochlea [4, 5, and 7]. The basilar membrane inside the cochlea is usually conceived (in psychoacoustical auditory masking models) as a bank of band-pass filters that have increasing bandwidth. Irino and Patterson [4, 5] have developed a theoretically optimal auditory filter, the gammachirp, whose parameters can be chosen to fit observed physiological and psychoacoustical data. In this work, a new approach for speech analysis based on gammachirp filters is shown. After extracting parameters we are interested to compare their performance with standard MFCC for text-independent speaker identification system, the evaluation is conducted on a database of 168 speakers extracted from TIMIT. Our speaker identification system is based on Gaussian Mixture Model (GMM) classifier [1]. 2 The standard MFCC The spectral based features of the Mel-Frequency Cepstral Coefficients have been proven to provide an accurate depiction of the spectral information of the human vocal tract. The Mel-Cepstral features are calculated by taking the cosine transform of the real logarithm of the short-term energy spectrum expressed on a mel-frequency scale. Speech Pre-emphasising and windowing MFCC DFT Log. DFT (FFT) Triangular on mel scale Fig. 1: MFCC Extraction procedures. After pre-emphasizing the speech using a first order high pass filter and windowing the speech segments using a Hamming window of 20 ms length with 10 ms overlap, the Discrete Fourier Transform is taken of these segments. The magnitude of the Fourier Transform is then passed into a filter bank comprising of twenty five triangular filters. The start and end points of these filters were calculated firstly by evenly spacing the triangular filters on the Mel-Scale and then using equation 1 to convert these values back to the linear scale. 2595. 1 1 700 The resulting filters used in our experiments are shown in fig.2. ISSN: 1790-5117 19 ISBN: 978-960-474-144-1
Fig. 2: Triangular Mel. Fig. 3: Example of gammachirp impulse response,.,,,. Lastly The Cepstral Coefficients were calculated from the log-energy outputs of these filters by the equation: 1 2 35 2 Where is the number of the coefficients and is the log energy output of the filter. 3 Gammachirp filter The gammachirp filter is a gamma distribution modulated at central frequency f. It has as implulse response the following function [5]:. 3 With 0 : a parameter defining the order of the corresponding filter. : The frequency of modulation of the function gamma. : The initial phase. : Amplitude normalization parameter. The term. characterizes the equivalent rectangular bandwidth (ERB) of the filter and is a parameter defining the envelope of the gammachirp filter. The function is defined by the expression: 24.7 0.108. 4 : a factor introducing the asymmetry of this filter. Psychoacoustics studies show that c is strongly dependent on the signal power in the frequency bandwidth centered in. Fig. 4: The power spectrum of the gammachirp function. The Fourier spectrum of the gammachirp can be done by [5]: Γ Γ Γ. 2 ² 6. Where tan, and is the spectrum of corresponding gammatone function (obtained from for 0). 4 Gammachirp based Speech Analysis The analysis of speech signals is operated by using a gammachirp, in this work we use 35 gammachirp in each (of 4th order, n = 4), the is applied on the frequency band of 0 / 2 (where is the sampling frequency), the speech signal firstly framed and multiplied by hamming window of 20 ms time interval. Each gammachirp filtering is obtained across two steps, in the first step, the speech frame is filtered by the correspondent 4th order gammatone filter (obtained from for 0 ), and in the second step we estimate the speech power and calculate the asymmetry parameter c as shown in the following figure. 3.38 0.107. 5 With is signal power. ISSN: 1790-5117 20 ISBN: 978-960-474-144-1
Speech frame Hamming Amplitude normalisation Gammatone filter Ps estimation and calculate c for each sub-band Filter bank gammachirp Asymmetry function Fig. 5: Gammachirp based speech analysis Where M is the number of mixtures, is the feature vector, is the weight of the i-th mixture in the GMM, is the mean of the i-th mixture in the GMM, and Σ is the covariance matrix of the i-th mixture in the GMM. The Model parameters,, Σ characterize a speaker voice in the form of probability density function. They are determined by the Expectation maximization (EM) algorithm. In the identification phase, the log-likelihood scores of the incoming sequence of feature vectors as subjected to each speaker model are calculated by:, 8 Fig. 6: Example of 35 gammachirp. 5 Modeling by Gaussian Mixture Model (GMM) In the speaker identification system under investigation, each speaker enrolled in the system is represented by a Gaussian mixture model (GMM). The idea of GMM is to use a series of Gaussian functions to represent the probability density of the feature vectors produced by each speaker. The mathematical representation is [1]:,Σ 7 Where,,, is the sequence of speaker feature vectors, and M is the total number of feature vectors. A GMM that generates the highest, score is identified as the producer of the incoming speech signal. This decision method is called maximum likelihood (ML). 6 Experimental Evaluation Three experiments have been conducted on 168 speakers database extracted from TIMIT, the first experience is conducted on original speech sampled at 16 KHz, and the last two experiments are conducted on downsampled version of speech at 8 KHz, the downsampling is downe after filtring the speech in the band of [0 3400] Hz and applying a decimation of factor 2. The speech sigal was extracted by using an energy based algorithm (the silences durations are excluded). The analysis of speech signal was conducted over speech frames of 20 with overlapping of 10. In TIMIT, each speaker produces 10 sentences, 7 arbitrary sentences were used for training, and the last 3 sentences were used for testing, the average length of sentence is 3 seconds. In other word there was 21 seconds of speech for training and 9 seconds for 3 tests with 3 seconds for each test. The classification engine used in this work was based on 32 mixtures GMM classifier initialized by vector quantization [10]. The results obtained in the first exeriment are sumerized in table 1. Fig. 7: Speaker Gaussian Mixture model (GMM) Number of coefficients Mel triangular 2 68.85 86.87 4 92.06 90.48 6 96.43 96.21 Gammachirp ISSN: 1790-5117 21 ISBN: 978-960-474-144-1
8 97.62 97.02 10 98.41 99.20 12 99.41 99.20 14 99.01 99.80 16 99.21 99.21 18 99.41 99.41 20 99.41 99.41 2 99.80 99.60 Table 1 : Identification rate (%) obtained on 16 KHz database. As we can see in the following graphs, the identification rates are smaller than previously. Generally, we remark that with gammachirp the rates are slightly superior then Mel triangular. Fig. 9: Results for To evaluate the performances in case of presence of noise, in the third experiment we evaluate the speaker identification system on noisy database, our database is noised with additif white gaussian noise. The obtained results are summerized in the table 3. Fig. 8: Results for As it is shown in figure 8, the identification rate is increasing with the coefficients number for the both paramaters, standard MFCC and gammachirp based coefficients. We can also remark that for a number of coefficients lower then 12 the standard MFCC are slightly efficience, and contrairly for more then 12 cefficients. In the second experiment, the speech database is filtered in the band of [0, 3400] Hz, and downsampled to 8 by decimation. The identification results are summerized in table 2. Number of coefficients Mel triangular 2 49.41 51.19 4 83.53 84.33 6 91.47 91.27 8 95.24 95.64 10 97.62 97.02 12 97.42 98.02 14 98.61 98.02 16 98.41 98.21 18 98.21 98.02 20 97.02 98.41 Gammachirp Table 2 : Identification rate (%) obtained on 8 KHz database. SNR (db) Mel Gammachirp 0 81.27 49.60 5 83.53 83.14 10 93.25 92.46 15 94.84 95.04 20 96.43 96.63 25 96.53 96.63 30 96.63 98.21 Table 3 : Identification rates (%) for noisy speech. Fig. 10: Identification rate for noisy speech The identification rate increases with speech quality, for higher signal to noise ratio we have higher identification rate, the gammachirp based parameters are slightly more efficience than standard ISSN: 1790-5117 22 ISBN: 978-960-474-144-1
MFCC for noisy speech (98.21% vs 96.62% for 30dB of SNR). 7 Conclusion In this paper we have exposed a new method of speech analysis based on the human auditory system characteristics and rely on the gammachirp filter. The extracted coefficients are evaluated using GMM classifier, and compared with standard MFCC parameters for text independent speaker identification. The obtained results show that the new technique is very useful for noisy speech, and get more good rates then standard MFCC. References [1] D.A. Reynolds and R. C. Rose, Robust Text- Independent Speaker Identification Using Gaussian Mixture Speaker Models, IEEE Transaction on SAP, vol. 3, pp. 72-83, Jan, 1995. D. A. Reynolds, Experimental Evaluation of [2] Features for Robust Speaker Identification, IEEE Transaction on SAP, vol. 2, pp. 639-643, October, 1994. Cambpbell J.P. and Jr. Speaker recognition: a [3] tutorial. Proceeding of the IEEE. Vol 85, pp.1437-1462. Septembre, 1997. T. Irino, R. D. Patterson. Temporal asymmetry in [4] the auditory system. J. Acoust. Soc. Am. 99(4): [5] 2316-2331, April, 1997. T. Irino, D. Patterson. A time-domain, level dependent auditory filter: the gammachirp. J.Acoust Soc. Am. 101(1): 412-419, January, 1997. [6] T. Irino et M. Unoki. An analysis auditory based on an IIR implementation of the gammachirp. J. Acoust. Soc Japan. 20(6): 397-406, November, 1999. [7] T. Irino, R. D. Patterson. A compressive gammachirp auditory filter for both physiological and psychophysical data. J. Acoust Soc. Am. 109(5): 2008-2022, May 2001. [8] J. O. Smith III, J.S. Abel. Bark and ERB bilinear transforms, IEEE Tran. On speech and Audio Processing, Vol. 7, No. 6, November 1999. [9] J.E. Hawkins Jr. and S. S. Stevens The masking of pure tones and of speech by white noise J. Acoust. Soc. Am., 1950, vol. 22, pp. 6-13. [10] Linde Y., Buzo A., Gray, R. An Algorithm for Vector Quantizer Design. IEEE Transactions on Communications. Vol. 28(1), 84-95. Jan, 1980. ISSN: 1790-5117 23 ISBN: 978-960-474-144-1