Speaker Identification based on GFCC using GMM

Similar documents
Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

WHEN THERE IS A mismatch between the acoustic

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Human Emotion Recognition From Speech

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Speaker recognition using universal background model on YOHO database

Speech Emotion Recognition Using Support Vector Machine

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Speaker Identification by Comparison of Smart Methods. Abstract

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Speaker Recognition. Speaker Diarization and Identification

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

A study of speaker adaptation for DNN-based speech synthesis

Modeling function word errors in DNN-HMM based LVCSR systems

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Segregation of Unvoiced Speech from Nonspeech Interference

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

Modeling function word errors in DNN-HMM based LVCSR systems

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Speech Recognition at ICSI: Broadcast News and beyond

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Proceedings of Meetings on Acoustics

Probabilistic Latent Semantic Analysis

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Support Vector Machines for Speaker and Language Recognition

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

Lecture 1: Machine Learning Basics

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques

Learning Methods in Multilingual Speech Recognition

Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Affective Classification of Generic Audio Clips using Regression Models

INPE São José dos Campos

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Python Machine Learning

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Voice conversion through vector quantization

Learning Methods for Fuzzy Systems

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Calibration of Confidence Measures in Speech Recognition

Body-Conducted Speech Recognition and its Application to Speech Support System

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

Speech Recognition by Indexing and Sequencing

Automatic segmentation of continuous speech using minimum phase group delay functions

On the Formation of Phoneme Categories in DNN Acoustic Models

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Generative models and adversarial training

Word Segmentation of Off-line Handwritten Documents

Automatic Pronunciation Checker

Australian Journal of Basic and Applied Sciences

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

TRANSFER LEARNING IN MIR: SHARING LEARNED LATENT REPRESENTATIONS FOR MUSIC AUDIO CLASSIFICATION AND SIMILARITY

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

THE RECOGNITION OF SPEECH BY MACHINE

A comparison of spectral smoothing methods for segment concatenation based speech synthesis

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Author's personal copy

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS

arxiv: v2 [cs.cv] 30 Mar 2017

Statewide Framework Document for:

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

Artificial Neural Networks written examination

Perceptual scaling of voice identity: common dimensions for different vowels and speakers

An Online Handwriting Recognition System For Turkish

S T A T 251 C o u r s e S y l l a b u s I n t r o d u c t i o n t o p r o b a b i l i t y

On-Line Data Analytics

Comment-based Multi-View Clustering of Web 2.0 Items

Individual Component Checklist L I S T E N I N G. for use with ONE task ENGLISH VERSION

Math-U-See Correlation with the Common Core State Standards for Mathematical Content for Third Grade

Corrective Feedback and Persistent Learning for Information Extraction

Using EEG to Improve Massive Open Online Courses Feedback Interaction

Ansys Tutorial Random Vibration

Switchboard Language Model Improvement with Conversational Data from Gigaword

Lecture 9: Speech Recognition

JONATHAN H. WRIGHT Department of Economics, Johns Hopkins University, 3400 N. Charles St., Baltimore MD (410)

Montana Content Standards for Mathematics Grade 3. Montana Content Standards for Mathematical Practices and Mathematics Content Adopted November 2011

On Developing Acoustic Models Using HTK. M.A. Spaans BSc.

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

This scope and sequence assumes 160 days for instruction, divided among 15 units.

SARDNET: A Self-Organizing Feature Map for Sequences

Transcription:

Speaker Identification based on GFCC using GMM Md. Moinuddin Arunkumar N. Kanthi M. Tech. Student, E&CE Dept., PDACE Asst. Professor, E&CE Dept., PDACE Abstract: The performance of the conventional speaker identification system degrades drastically in presence of noise. The ability of human ear to identify the speaker s identity in noisy environment motivates us to use an auditory based feature called gammatone frequency cepstral coefficient (GFCC). The GFCC is based on gammatone filter bank, which models the basilar membrane as a series of overlapping band pass filters. The speaker identification system using the GFCC features and GMMs has been developed and analysed using TIMIT and NTIMIT databases. The performance of the system is compared with the baseline system using the traditional MFCC features. The results show that the GFCC features has a good recognition performance not only in clean speech environment, but also in noisy environment. Keywords Auditory based feature, Gammatone Frequency Cepstral Coefficient (GFCC), MFCC, GMM, EM - algorithm I. INTRODUCTION Speaker identification determines from which of the enrolled speakers the given utterance has come. The utterance can be constrained to a known phrase (text-dependent) or totally unconstrained (text-independent). It consists of feature extraction, speaker modeling and decision making. Typically, extracted speaker features are Mel-frequency cepstral coefficients (MFCCs). For speaker modeling, Gaussian mixture models (GMMs) are widely used to describe feature distributions of individual speakers. Recognition decisions are usually made based on likelihoods of observing feature frames given a speaker model. The poor performance of MFCCs in noisy or mismatched condition can be attributed to the use of triangular filters for modelling the auditory critical bands. To model cochlear filter more accurately, Gammatone filters are used instead of the triangular filters and the extracted features are called gammatone frequency cepstral coefficients (GFCCs). II. THE SYSTEM MODEL Speaker identification system consists of two parts: front-end & back-end. The front-end of the system is a feature extractor while the back-end consists of a classifier and a reference database. Front-End Back-End Train utterances GFCC Extractor GMM Modeling Test utterance Identification result ML Classifier Database Figure 1: Architecture of speaker identification system 2014, IJIRAE- All Rights Reserved Page - 224

The main task of the front-end is to extract features from a speech signal. The aim is to sufficiently represent the characteristics of the speech signal with reduced redundancy. Features are extracted based on frames. One feature vector is calculated for every frame. After feature extraction, the sequence of feature vectors is passed to the back-end of the speaker identification system. Based on the feature vectors, the back-end of the system selects the most likely utterance out of all the possibilities from the reference database. After training, the statistical models are stored in the database. When an unknown utterance is presented, feature vectors are obtained. The classifier calculates the maximum log likelihood based on the models and decides the most likely utterance. III. GFCC EXTRACTION The GFCC features are based on the GammaTone Filter Bank (GTFB). The feature vectors are calculated from the spectra of a series of windowed speech frames. The figure below shows the block diagram of GFCC extraction. Speech utterance Pre-emphasis Framing & Windowing DFT &. 2 GFCC features DCT Logarithmic compression GTFB Pre-emphasis stage: Figure 2: Block Diagram of GFCC extraction The high frequency components of a speech signal have low amplitude as compare to low frequency components due to radiation effect of lips. In order to spectrally flatten the speech signal i.e. to obtain similar amplitude for all frequency components, the speech signal is passed through a Pre-emphasis filter, which is a first order FIR digital filter, which can eliminate the lips spectral contribution effectively. The speech after pre-emphasis sounds much sharper. The transfer function of the pre-emphasis filter is given by the following equation Where a is a constant, it has a typical value of 0.97. (1) Fig 3: Pre-emphasis operation 2014, IJIRAE- All Rights Reserved Page -225

Framing & Windowing: Speech signal is non-stationary i.e. its statistical characteristics varies with time. Since the glottal system cannot change immediately, speech can be considered to be time-invariant over short segments of time (20-30 ms). Therefore speech signal is split into frames of 20ms. When the signal is framed, it is necessary to consider how to treat the edges of the frame otherwise the edges add. Therefore a windowing function is used to tone down the edges. The choice of window must be the one which has narrow main lobe and attenuated side lobes. Therefore hamming window is the preferred choice. The Hamming window is given by equation (2) Fig 4: windowing operation As a consequence of windowing, the samples will not be assigned the same weight in the following computations & for this reason it is sensible to use an overlap (10 ms). DFT: The windowed frame is transformed using a Discrete Fourier transform and the magnitude is taken because phase does not carry any speaker specific information. Gammatone filter banks stage: Fig 5: DFT operation The Gammatone filter bank consists of a series of band-pass filters, which models the frequency selectivity property of the basilar membrane.the impulse response of each filter is given by the equation (1 M) (3) Where a is the constant (usually equals to 1). n is the filter order ( here n=4). is the phase shift. is the center frequency and b m is the attenuation factor of the filter, which is related to the band of the filter, and is decisive factor of impulse response decay rate. Fig 6: Frequency response of 64 channel gammatone filter bank 2014, IJIRAE- All Rights Reserved Page -226

The centre frequency of m th Gammatone filter can be determined by the equation + (4) Where f L and f H are the lower and upper frequencies of the filter bank. The bandwidth of each filter is described by an Equivalent Rectangular Bandwidth (ERB). The ERB is a psychoacoustic measure of the width of the auditory filter at each point along the cochlea. The equation for the ERB ERB ( ) = 24.7 (5) The bandwidth of each filter is described by ERB as b m = 1.019 ERB ( ) (6) The FFT magnitude coefficients are binned by correlating them with each gammatone filter i.e., each FFT magnitude coefficient is multiplied by the gain of the corresponding filter and the result is accumulated. Thus, each bin holds the spectral magnitude in that filterbank channel. (7) Fig 7: Filter bank processing Logarithmic compression & Discrete cosine Transformation (DCT) stage: The logarithm is applied to each of the filter output to simulate the human perceived loudness given certain signal intensity and to separate the excitation (source) produced by the vocal cords and the filter that represents the vocal tract. Since the log-power spectrum is real, Discrete Cosine Transform (DCT) is applied to the filter outputs which produces highly uncorrelated features. The envelope of the vocal tract changes slowly, and thus presents at low quefrencies (lower order cepstrum), while the periodic excitation are at high quefrencies (higher order cepstrum)., where 1 (8) 2014, IJIRAE- All Rights Reserved Page -227

Fig 8: Logarithm and DCT operation IV. GAUSSIAN MIXTURE MODEL The task is to classify the feature vectors. Each speaker is represented by a speaker model. Where is the mean vector, is the covariance matrix, is the mixture weight. The Gaussian Mixture Model (GMM) is a model that expresses the probability density function of a random variable in terms of weighted sum of its components, each of which is described by a Gaussian density. The feature vectors extracted from the speech of the enrolled speaker are modelled as Where is a D-dimensional random vector and (9) is the component density. Let X = be the set of training feature vectors. Training a GMM requires the computation of and from the given feature vectors belonging to a speaker. Maximum Likelihood estimation is used to estimate these parameters. 2014, IJIRAE- All Rights Reserved Page -228

Maximum Likelihood Estimation ML aims to maximize the likelihood p (X of the GMM from the given set of feature vectors X = p (X (10) Since log cannot be moved inside the summation, direct maximization is not possible. However, estimates can be obtained iteratively using Expectation Maximization Algorithm. Expectation Maximization Algorithm 1. Initialize: Means by clustering the feature vectors through k-means algorithm. Mixture weights to be equally likely, by setting each weight to be. Co-variance matrix by using an identity matrix. 2. Expectation step: Evaluate responsibilities 3. Maximization step: Update the parameters using the current responsibilities Where 4. Evaluate the log likelihood 2014, IJIRAE- All Rights Reserved Page -229

Check for the convergence of the parameters or the log likelihood. If the convergence criterion is not satisfied return to step2. V. SPEAKER IDENTIFICATION Speaker identification is done by finding the speaker model which has the maximum a posterior probability for the given set of test feature vectors X =. i.e. By mixed baye s rule we ve Assuming speakers to be equally likely i.e. and is independent of speaker model the above equation simplifies to = Assuming feature vectors are occurrences of independent random variables Taking logarithm, we get VI. EXPERIMENTAL RESULTS Speech Database: Experiment is conducted on TIMIT and NTIMIT speech database. TIMIT consists of read speech recorded in a quiet environment without channel distortion. TIMIT database has 630 speakers (438 males and 192 females) with 10 utterances per speaker, each 3 seconds long on average. NTIMIT was created by transmitting all TIMIT utterances over actual telephone channels. The Performance of the speaker identification system is evaluated using the 23-dimensional GFCCs and the baseline 13-dimensional MFCCs features by taking different orders of GMM. 2014, IJIRAE- All Rights Reserved Page -230

1. For each database, 8 of the 10 utterances are used for training (about 24s) and 2 for testing (about 6s). i. Logarithmic compression: ii. Cubic root compression: 2. For each database, 9 of the 10 utterances are used for training (about 27s) and 1 for testing (about 3s). i. Logarithmic compression: ii. Cubic root compression: 2014, IJIRAE- All Rights Reserved Page -231

VII. CONCLUSION The results show that the gammatone frequency cepstral coefficient (GFCC) features, captures speaker characteristics better than the conventional MFCC features and has a good recognition performance not only in clean speech environment (TIMIT), but also in noisy environment (NTIMIT). Further, the modified MFCC (MMFCC) and modified GFCC (MGFCC) features, developed by replacing the log with the cubic root, shows drop in the identification performance both in clean and noisy speech. REFERENCES [1] E. B. Tazi, A. Benabbou, and M. Harti, Efficient Text Independent Speaker Identification Based on GFCC and CMN Methods" ICMCS 2012, pp. 90-95. [2] He Xu, Lin Lin, Xiaoying Sun and Huanmei Jin, "A New Algorithm for Auditory Feature Extraction" CSNT-2012, pp. 229-232. [3] Feng song H and Xiao Cao, An auditory Feature Extraction Method for Robust Speaker Recognition ICCT 2012, pp. 1067-1071. [4] M. Slaney, An Efficient Implementation of the Patterson-Holdsworth Auditory Filter Bank, Apple Technical Report No. 35, Advanced Technology Group, Apple Computer Inc., 1993. [5] Douglas A. Reynolds and Richard C. Rose Robust Text-Independent Speaker Identification Using Gaussian Mixture Speaker Models" IEEE Trans. Audio, Speech, and Language Processing, vol. 3(1), pp. 72-83,1995. [6] Christopher M. Bishop, Pattern Recognition and Machine Learning, Springer Science, Business Media, 2006. [7] L. Rabiner and B. H. Juang, Fundamentals of Speech Recognition, Pearson Education (Singapore) Pt. Ltd. 2005. 2014, IJIRAE- All Rights Reserved Page -232