RECENT ADVANCES in COMPUTATIONAL INTELLIGENCE, MAN-MACHINE SYSTEMS and CYBERNETICS

Similar documents
Human Emotion Recognition From Speech

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

WHEN THERE IS A mismatch between the acoustic

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Speaker recognition using universal background model on YOHO database

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Speech Emotion Recognition Using Support Vector Machine

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

A study of speaker adaptation for DNN-based speech synthesis

Speaker Identification by Comparison of Smart Methods. Abstract

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Speaker Recognition. Speaker Diarization and Identification

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Speech Recognition at ICSI: Broadcast News and beyond

Segregation of Unvoiced Speech from Nonspeech Interference

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

Modeling function word errors in DNN-HMM based LVCSR systems

Body-Conducted Speech Recognition and its Application to Speech Support System

Modeling function word errors in DNN-HMM based LVCSR systems

Probabilistic Latent Semantic Analysis

Proceedings of Meetings on Acoustics

Speech Recognition by Indexing and Sequencing

Support Vector Machines for Speaker and Language Recognition

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques

Learning Methods in Multilingual Speech Recognition

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

On the Formation of Phoneme Categories in DNN Acoustic Models

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Generative models and adversarial training

Author's personal copy

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

A comparison of spectral smoothing methods for segment concatenation based speech synthesis

Calibration of Confidence Measures in Speech Recognition

TRANSFER LEARNING IN MIR: SHARING LEARNED LATENT REPRESENTATIONS FOR MUSIC AUDIO CLASSIFICATION AND SIMILARITY

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Voice conversion through vector quantization

Australian Journal of Basic and Applied Sciences

Evolutive Neural Net Fuzzy Filtering: Basic Description

INPE São José dos Campos

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

SARDNET: A Self-Organizing Feature Map for Sequences

Affective Classification of Generic Audio Clips using Regression Models

Automatic segmentation of continuous speech using minimum phase group delay functions

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Mandarin Lexical Tone Recognition: The Gating Paradigm

Rule Learning With Negation: Issues Regarding Effectiveness

Word Segmentation of Off-line Handwritten Documents

Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

On Developing Acoustic Models Using HTK. M.A. Spaans BSc.

Lecture 1: Machine Learning Basics

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

Rhythm-typology revisited.

BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY

Rule Learning with Negation: Issues Regarding Effectiveness

THE RECOGNITION OF SPEECH BY MACHINE

Reducing Features to Improve Bug Prediction

1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all

Page 1 of 11. Curriculum Map: Grade 4 Math Course: Math 4 Sub-topic: General. Grade(s): None specified

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

A Biological Signal-Based Stress Monitoring Framework for Children Using Wearable Devices

JONATHAN H. WRIGHT Department of Economics, Johns Hopkins University, 3400 N. Charles St., Baltimore MD (410)

Voiceless Stop Consonant Modelling and Synthesis Framework Based on MISO Dynamic System

COMPUTER INTERFACES FOR TEACHING THE NINTENDO GENERATION

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Math-U-See Correlation with the Common Core State Standards for Mathematical Content for Third Grade

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

Rendezvous with Comet Halley Next Generation of Science Standards

Spoofing and countermeasures for automatic speaker verification

Learning Methods for Fuzzy Systems

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

On-Line Data Analytics

Python Machine Learning

Circuit Simulators: A Revolutionary E-Learning Platform

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Grade 6: Correlated to AGS Basic Math Skills

Comparison of EM and Two-Step Cluster Method for Mixed Data: An Application

Lecture 9: Speech Recognition

Edinburgh Research Explorer

P. Belsis, C. Sgouropoulou, K. Sfikas, G. Pantziou, C. Skourlas, J. Varnas

Transcription:

Gammachirp based speech analysis for speaker identification MOUSLEM BOUCHAMEKH, BOUALEM BOUSSEKSOU, DAOUD BERKANI Signal and Communication Laboratory Electronics Department National Polytechnics School, 10 Avenue Hacen BADI. 16200 Algiers, ALGERIA. mbouchamekh@gmail.com Abstract: - Many modern speaker recognition systems use a bank of linear filters as the first step in performing frequency analysis of speech and extracting the acoustics parameters that allow characterizing the speaker identity. In this paper we illustrate the use of novel feature set extracted from speech signal. The new technique for extracting these parameters is based on the human auditory system characteristics and relies on the gammachirp to emulate asymmetric frequency response and level dependent frequency response. For evaluation a comparative study was operated with standard MFCC. Key-Words: - Speaker identification, MFCC, Gammachirp, triangular 1 Introduction Feature extraction is the key to the front-end process in speaker identification systems. The performance of the identification is highly dependent on the quality of the selected speech features. Most of the current proposed speaker identification systems use mel frequency cepstral coefficients (MFCC) and linear predictive cepstral coefficients (LPCC) as feature vectors. It is known that speech auditory frequency selectivity is largely determined by signal processing in cochlea [4, 5, and 7]. The basilar membrane inside the cochlea is usually conceived (in psychoacoustical auditory masking models) as a bank of band-pass filters that have increasing bandwidth. Irino and Patterson [4, 5] have developed a theoretically optimal auditory filter, the gammachirp, whose parameters can be chosen to fit observed physiological and psychoacoustical data. In this work, a new approach for speech analysis based on gammachirp filters is shown. After extracting parameters we are interested to compare their performance with standard MFCC for text-independent speaker identification system, the evaluation is conducted on a database of 168 speakers extracted from TIMIT. Our speaker identification system is based on Gaussian Mixture Model (GMM) classifier [1]. 2 The standard MFCC The spectral based features of the Mel-Frequency Cepstral Coefficients have been proven to provide an accurate depiction of the spectral information of the human vocal tract. The Mel-Cepstral features are calculated by taking the cosine transform of the real logarithm of the short-term energy spectrum expressed on a mel-frequency scale. Speech Pre-emphasising and windowing MFCC DFT Log. DFT (FFT) Triangular on mel scale Fig. 1: MFCC Extraction procedures. After pre-emphasizing the speech using a first order high pass filter and windowing the speech segments using a Hamming window of 20 ms length with 10 ms overlap, the Discrete Fourier Transform is taken of these segments. The magnitude of the Fourier Transform is then passed into a filter bank comprising of twenty five triangular filters. The start and end points of these filters were calculated firstly by evenly spacing the triangular filters on the Mel-Scale and then using equation 1 to convert these values back to the linear scale. 2595. 1 1 700 The resulting filters used in our experiments are shown in fig.2. ISSN: 1790-5117 19 ISBN: 978-960-474-144-1

Fig. 2: Triangular Mel. Fig. 3: Example of gammachirp impulse response,.,,,. Lastly The Cepstral Coefficients were calculated from the log-energy outputs of these filters by the equation: 1 2 35 2 Where is the number of the coefficients and is the log energy output of the filter. 3 Gammachirp filter The gammachirp filter is a gamma distribution modulated at central frequency f. It has as implulse response the following function [5]:. 3 With 0 : a parameter defining the order of the corresponding filter. : The frequency of modulation of the function gamma. : The initial phase. : Amplitude normalization parameter. The term. characterizes the equivalent rectangular bandwidth (ERB) of the filter and is a parameter defining the envelope of the gammachirp filter. The function is defined by the expression: 24.7 0.108. 4 : a factor introducing the asymmetry of this filter. Psychoacoustics studies show that c is strongly dependent on the signal power in the frequency bandwidth centered in. Fig. 4: The power spectrum of the gammachirp function. The Fourier spectrum of the gammachirp can be done by [5]: Γ Γ Γ. 2 ² 6. Where tan, and is the spectrum of corresponding gammatone function (obtained from for 0). 4 Gammachirp based Speech Analysis The analysis of speech signals is operated by using a gammachirp, in this work we use 35 gammachirp in each (of 4th order, n = 4), the is applied on the frequency band of 0 / 2 (where is the sampling frequency), the speech signal firstly framed and multiplied by hamming window of 20 ms time interval. Each gammachirp filtering is obtained across two steps, in the first step, the speech frame is filtered by the correspondent 4th order gammatone filter (obtained from for 0 ), and in the second step we estimate the speech power and calculate the asymmetry parameter c as shown in the following figure. 3.38 0.107. 5 With is signal power. ISSN: 1790-5117 20 ISBN: 978-960-474-144-1

Speech frame Hamming Amplitude normalisation Gammatone filter Ps estimation and calculate c for each sub-band Filter bank gammachirp Asymmetry function Fig. 5: Gammachirp based speech analysis Where M is the number of mixtures, is the feature vector, is the weight of the i-th mixture in the GMM, is the mean of the i-th mixture in the GMM, and Σ is the covariance matrix of the i-th mixture in the GMM. The Model parameters,, Σ characterize a speaker voice in the form of probability density function. They are determined by the Expectation maximization (EM) algorithm. In the identification phase, the log-likelihood scores of the incoming sequence of feature vectors as subjected to each speaker model are calculated by:, 8 Fig. 6: Example of 35 gammachirp. 5 Modeling by Gaussian Mixture Model (GMM) In the speaker identification system under investigation, each speaker enrolled in the system is represented by a Gaussian mixture model (GMM). The idea of GMM is to use a series of Gaussian functions to represent the probability density of the feature vectors produced by each speaker. The mathematical representation is [1]:,Σ 7 Where,,, is the sequence of speaker feature vectors, and M is the total number of feature vectors. A GMM that generates the highest, score is identified as the producer of the incoming speech signal. This decision method is called maximum likelihood (ML). 6 Experimental Evaluation Three experiments have been conducted on 168 speakers database extracted from TIMIT, the first experience is conducted on original speech sampled at 16 KHz, and the last two experiments are conducted on downsampled version of speech at 8 KHz, the downsampling is downe after filtring the speech in the band of [0 3400] Hz and applying a decimation of factor 2. The speech sigal was extracted by using an energy based algorithm (the silences durations are excluded). The analysis of speech signal was conducted over speech frames of 20 with overlapping of 10. In TIMIT, each speaker produces 10 sentences, 7 arbitrary sentences were used for training, and the last 3 sentences were used for testing, the average length of sentence is 3 seconds. In other word there was 21 seconds of speech for training and 9 seconds for 3 tests with 3 seconds for each test. The classification engine used in this work was based on 32 mixtures GMM classifier initialized by vector quantization [10]. The results obtained in the first exeriment are sumerized in table 1. Fig. 7: Speaker Gaussian Mixture model (GMM) Number of coefficients Mel triangular 2 68.85 86.87 4 92.06 90.48 6 96.43 96.21 Gammachirp ISSN: 1790-5117 21 ISBN: 978-960-474-144-1

8 97.62 97.02 10 98.41 99.20 12 99.41 99.20 14 99.01 99.80 16 99.21 99.21 18 99.41 99.41 20 99.41 99.41 2 99.80 99.60 Table 1 : Identification rate (%) obtained on 16 KHz database. As we can see in the following graphs, the identification rates are smaller than previously. Generally, we remark that with gammachirp the rates are slightly superior then Mel triangular. Fig. 9: Results for To evaluate the performances in case of presence of noise, in the third experiment we evaluate the speaker identification system on noisy database, our database is noised with additif white gaussian noise. The obtained results are summerized in the table 3. Fig. 8: Results for As it is shown in figure 8, the identification rate is increasing with the coefficients number for the both paramaters, standard MFCC and gammachirp based coefficients. We can also remark that for a number of coefficients lower then 12 the standard MFCC are slightly efficience, and contrairly for more then 12 cefficients. In the second experiment, the speech database is filtered in the band of [0, 3400] Hz, and downsampled to 8 by decimation. The identification results are summerized in table 2. Number of coefficients Mel triangular 2 49.41 51.19 4 83.53 84.33 6 91.47 91.27 8 95.24 95.64 10 97.62 97.02 12 97.42 98.02 14 98.61 98.02 16 98.41 98.21 18 98.21 98.02 20 97.02 98.41 Gammachirp Table 2 : Identification rate (%) obtained on 8 KHz database. SNR (db) Mel Gammachirp 0 81.27 49.60 5 83.53 83.14 10 93.25 92.46 15 94.84 95.04 20 96.43 96.63 25 96.53 96.63 30 96.63 98.21 Table 3 : Identification rates (%) for noisy speech. Fig. 10: Identification rate for noisy speech The identification rate increases with speech quality, for higher signal to noise ratio we have higher identification rate, the gammachirp based parameters are slightly more efficience than standard ISSN: 1790-5117 22 ISBN: 978-960-474-144-1

MFCC for noisy speech (98.21% vs 96.62% for 30dB of SNR). 7 Conclusion In this paper we have exposed a new method of speech analysis based on the human auditory system characteristics and rely on the gammachirp filter. The extracted coefficients are evaluated using GMM classifier, and compared with standard MFCC parameters for text independent speaker identification. The obtained results show that the new technique is very useful for noisy speech, and get more good rates then standard MFCC. References [1] D.A. Reynolds and R. C. Rose, Robust Text- Independent Speaker Identification Using Gaussian Mixture Speaker Models, IEEE Transaction on SAP, vol. 3, pp. 72-83, Jan, 1995. D. A. Reynolds, Experimental Evaluation of [2] Features for Robust Speaker Identification, IEEE Transaction on SAP, vol. 2, pp. 639-643, October, 1994. Cambpbell J.P. and Jr. Speaker recognition: a [3] tutorial. Proceeding of the IEEE. Vol 85, pp.1437-1462. Septembre, 1997. T. Irino, R. D. Patterson. Temporal asymmetry in [4] the auditory system. J. Acoust. Soc. Am. 99(4): [5] 2316-2331, April, 1997. T. Irino, D. Patterson. A time-domain, level dependent auditory filter: the gammachirp. J.Acoust Soc. Am. 101(1): 412-419, January, 1997. [6] T. Irino et M. Unoki. An analysis auditory based on an IIR implementation of the gammachirp. J. Acoust. Soc Japan. 20(6): 397-406, November, 1999. [7] T. Irino, R. D. Patterson. A compressive gammachirp auditory filter for both physiological and psychophysical data. J. Acoust Soc. Am. 109(5): 2008-2022, May 2001. [8] J. O. Smith III, J.S. Abel. Bark and ERB bilinear transforms, IEEE Tran. On speech and Audio Processing, Vol. 7, No. 6, November 1999. [9] J.E. Hawkins Jr. and S. S. Stevens The masking of pure tones and of speech by white noise J. Acoust. Soc. Am., 1950, vol. 22, pp. 6-13. [10] Linde Y., Buzo A., Gray, R. An Algorithm for Vector Quantizer Design. IEEE Transactions on Communications. Vol. 28(1), 84-95. Jan, 1980. ISSN: 1790-5117 23 ISBN: 978-960-474-144-1