SPEAKER RECOGNITION MODEL BASED ON GENERALIZED GAMMA DISTRIBUTION USING COMPOUND TRANSFORMED DYNAMIC FEATURE VECTOR

Similar documents
Human Emotion Recognition From Speech

Speech Emotion Recognition Using Support Vector Machine

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

WHEN THERE IS A mismatch between the acoustic

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

A study of speaker adaptation for DNN-based speech synthesis

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Word Segmentation of Off-line Handwritten Documents

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Speaker recognition using universal background model on YOHO database

Support Vector Machines for Speaker and Language Recognition

Speaker Identification by Comparison of Smart Methods. Abstract

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Learning Methods in Multilingual Speech Recognition

Speaker Recognition. Speaker Diarization and Identification

Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

Speech Recognition by Indexing and Sequencing

Affective Classification of Generic Audio Clips using Regression Models

Lecture 1: Machine Learning Basics

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Speech Recognition at ICSI: Broadcast News and beyond

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Probabilistic Latent Semantic Analysis

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Calibration of Confidence Measures in Speech Recognition

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Assignment 1: Predicting Amazon Review Ratings

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Data Fusion Models in WSNs: Comparison and Analysis

A Case Study: News Classification Based on Term Frequency

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Rule Learning With Negation: Issues Regarding Effectiveness

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

Reducing Features to Improve Bug Prediction

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

On the Combined Behavior of Autonomous Resource Management Agents

Proceedings of Meetings on Acoustics

Learning From the Past with Experiment Databases

Generative models and adversarial training

Using EEG to Improve Massive Open Online Courses Feedback Interaction

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

INPE São José dos Campos

Rule Learning with Negation: Issues Regarding Effectiveness

Circuit Simulators: A Revolutionary E-Learning Platform

Australian Journal of Basic and Applied Sciences

Segregation of Unvoiced Speech from Nonspeech Interference

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

On the Formation of Phoneme Categories in DNN Acoustic Models

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

CS Machine Learning

A Biological Signal-Based Stress Monitoring Framework for Children Using Wearable Devices

Linking Task: Identifying authors and book titles in verbose queries

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Python Machine Learning

Spoofing and countermeasures for automatic speaker verification

South Carolina English Language Arts

P. Belsis, C. Sgouropoulou, K. Sfikas, G. Pantziou, C. Skourlas, J. Varnas

SARDNET: A Self-Organizing Feature Map for Sequences

Mathematics subject curriculum

Statewide Framework Document for:

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Probability and Statistics Curriculum Pacing Guide

An Evaluation of E-Resources in Academic Libraries in Tamil Nadu

CONSTRUCTION OF AN ACHIEVEMENT TEST Introduction One of the important duties of a teacher is to observe the student in the classroom, laboratory and

Corpus Linguistics (L615)

Individual Component Checklist L I S T E N I N G. for use with ONE task ENGLISH VERSION

On-Line Data Analytics

Mining Association Rules in Student s Assessment Data

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

Evolutive Neural Net Fuzzy Filtering: Basic Description

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Why Did My Detector Do That?!

GACE Computer Science Assessment Test at a Glance

Measurement & Analysis in the Real World

TRANSFER LEARNING IN MIR: SHARING LEARNED LATENT REPRESENTATIONS FOR MUSIC AUDIO CLASSIFICATION AND SIMILARITY

The lab is designed to remind you how to work with scientific data (including dealing with uncertainty) and to review experimental design.

Transcription:

SPEAKER RECOGNITION MODEL BASED ON GENERALIZED GAMMA DISTRIBUTION USING COMPOUND TRANSFORMED DYNAMIC FEATURE VECTOR K Suri Babu 1, Srinivas Yarramalle 2, Suresh Varma Penumatsa 3 1 Scientist, NSTL (DRDO),Govt. of India, Visakhapatnam, India 2 Dept. of IT, GITAM University, Visakhapatnam. India. 3 Aadikavi Nannaya University, Rajahmundry, India.. { 1 suribabukorada@gmail.com, 2 sriteja.y@gmail.com } ABSTRACT In this paper, we present an efficient speaker identification system based on generalized gamma distribution. This system comprises of three basic operations, namely speech features classification and metrics for evaluation. The features extracted using MFCC are passed to shifted delta cepstral coefficients (SDC) and then applied to linear predictive coefficients (LPC) to have effective recognition. To demonstrate our method, a database is generated with speakers for training and around 5 speech samples for testing. Above 9% accuracy reported. KEYWORDS Speaker identification, MFCC, LPC, Generalized Gamma, Shifted Delta coefficients 1. INTRODUCTION With the recent advancements in Technology, lot of information can be stored in the databases, in any of the format such as audio, video or text. Therefore, searching the exact information is difficult task [1]. Automatic indexing to the multimedia content can solve this problem. To retrieve speech signal from this Meta data is a crucial task. The speech signal to be retrieved is considered and is divided into small streams (segments) and the features are to be extracted. In order to extract features, MFCC are mostly proffered [3], [4] since they are less vulnerable to noise and give less variability. In order to have effective recognition it is needed to extract the first and second order time derivatives of cepstral features, that is delta and delta-delta features[5], but these features will be effective for short term speech samples, for long term features shifted delta coefficients (SDC) are well proffered [6], [7], [8]. Hence in this paper, we develop a model for speaker identification, where the features obtained from MFCC are converted to shifted delta coefficients and also by converting MFCC to delta coefficients. It is observed that the features obtained from MFCC followed by SDC outperform MFCC followed by delta. DOI : 1.5121/ijesa.12.238 75

The paper is organized as follows, the section-2 of the paper discuses about feature extraction, in section-3 generalized gamma distribution is proposed. Section -4 deals with experimental results. Finally, in section-5 conclusions are presented. 2. FEATURE EXTRACTION In the proposed work we have considered the speech signals with frame amount of to 3 Ms. and the window analysis is shifted by 1ms. Each frame is transformed using cepstral coefficients such as linear prediction coding and Mel frequency cepstral coefficients (MFCC) MFCC s are considered as they are based on the known variation of human ear critical band width with the frequency, each frame is transformed into 12 MFCC and a normalized energy parameter each frame consist of 39 columns including first and second derivatives, i.e. Delta and double Delta. In feature extraction the speech waves stored in wav format each converted to a parametric form. The speech signals remains stationary between the time intervals 5 Ms. to Ms. and the changes observed over long periods i.e..2sec or more. Therefore to identify the speech variation in short time sequence, cepstral analysis is mostly preferred hence MFCC are considered Linear prediction coding (LPC) coefficient helps to extract signal more effectively in the presence of noise and when the speech signal is of very short duration.so in this thesis we have exploited MFCC combined with LPC to have effective feature vector identification. In speech analysis, significant information spread over few s of milliseconds there may be overlaps and the speech signals are not completely separated in-time. These overlaps may result in to ambiguities at the time of classification to overcome this it is assumed to extract the features between the frequencies 2 to 16 Hz, a maximum of 4 Hz. In order to distinguish these signals in the overlapping situations Delta features are mostly preferred. In delta coefficients we obtained the derivative to estimate the differences in the speech trajectories. Delta-Delta coefficients are also considered for every longer temporal context. But these features will be effective for short term speech samples, for long term features shifted delta coefficients (SDC) are well proffered. The features obtained from MFCC are converted to shifted delta coefficients. It is observed that the features obtained from MFCC followed by SDC outperform MFCC followed by delta. SDC reflects the dynamic cepstral features along with pseudo- Prosodic feature behavior. 3. SPEAKER RECOGNITION ALGORITHM The steps to be followed for recognizing the speaker effectively are given under Step1: Step2: Step3: Obtain the training set by recording the speech voices in a.wav form Pre-emphasis the speech signals to remove silence and noise. Identify the compound feature vectors feature vector of these speech signals by using MFCC, LPC, SDC, Delta, and Delta-Delta. 76

Step4: Step5: Step6: Step7: Generate the probability density function (PDF) of the generalized gamma distribution for all the trained data set. Same procedure is followed for test sequence. Find the range of speech of test signal in the trained set. Evaluation metrics such as Acceptance Rate (AR), False Acceptance Rate (FAR), and Missed Detection Rate (MDR) are calculated to find the accuracy of speaker recognition. 4. GENERALIZED GAMMA MIXTURE MODEL Today most of the research in speech processing is carried out by using Gaussian mixture model, but the main disadvantage with Gaussian mixture model is that it relies exclusively on the approximation and low in convergence, and also if Gaussian mixture model is used, the speech and the noise coefficients differ in magnitude [7]. To have a more accurate feature extraction, maximum posterior estimation models are to be considered [8]. Hence in this paper, a generalized gamma distribution is utilized for classifying the speech signal. Generalized gamma distribution represents the sum of n-exponential distributed random variables both the shape and scale parameters have non-negative integer values [9]. Generalized gamma distribution is defined in terms of scale and shape parameters [1]. The generalized gamma mixture is given by,,,, = (1) Where, k and c are the shape parameters, a is the location parameter, b is the scale parameter and gamma is the complete gamma function [11]. The shape and scale parameter of the generalized gamma distribution helps to classify the speech signal and identify the speaker accurately. 5. EXPERIMENTAL RESULTS During the training phase, the signal must be preprocessed and the features are extracted using MFCC. In order to have an effective recognition system we have sampled the data into short speech samples of different time frames and the MFCC features that are extracted are converted delta coefficients and shift delta coefficients. It is observed that MFCC combined delta coefficients could not effectively recognize the speech samples as compared to that of MFCC combined with SDC. The output is then fed to LPC (linear predictive coefficients). The features extracted are then given as input to the classifier that is generalized gamma distribution, using these feature set, the generalized gamma distribution is effectively recognized. The speech samples that are obtained from SDC-LPC, it can also be seen that as and when the sample size is increased, these features that are extracted helps to classify the speakers most effectively. The results are presented in both tabular and graphical formats. 77

1 Effect of recognition accuracy with trained no of speakers accuracy of recognition 3 no of trained speakers DELTA-LPC SDC- LPC Fig.1: Effect of Recognition accuracy with trained dataset percentage of correctness Effect of recognition accuracy with speech duration 1 frame amount(sec) DELTA-LPC Fig.2: Effect of Recognition accuracy with Speech duration accuracy of recognition Effect of recognition accuracy with test speech duration 1 5 1 15 test utterance length(sec) DELTA- LPC Fig.3: Effect of recognition accuracy with test Speech duration 78

Table 1: Statistical data showing accuracy % DELTA-LPC SDC- LPC No of trained speakers Frame amount (sec) Test utterance length (sec) Recognition Accuracy (%) to 5 to 5 to 5 Less than 5 to 5 to 1 5 to 1 Around to 3 1 to 3 1 to 15 Above 62 to 5 to 5 to 5 Less Than 5 to 5 to 1 5 to 1 Around 85 to 3 1 to 3 1 to 15 Above 9 From the above figures and table (Fig.1 to Fig.3 and Table 1), it could be easily seen that the SDC-LPC outperforms Delta-LPC and over all recognition rate is above 9% is seen in the developed model. 5. PERFORMANCE EVALUATION In order to evaluate the performance of the developed model various metrics such as Acceptance Rate (AR), False Acceptance Rate (FAR), and Missed Detection Rate (MDR) are considered. The various formulas for evaluating the metrics are given below. = = = The developed model is tested for accuracy using the above metrics mentioned in equations AR FAR MDR DELTA-LPC SDC-LPC Fig.4: feature vector performance evaluation 79

The above Fig.4 shows the performance with metrics as Acceptance rate(ar),false Acceptance Rate(FAR) and Missed Detection Rate(MDR) by considering Various combinations of feature vectors as DELTA-LPC,,SDC-LPC and it shows performance comparison of feature vectors using generalized gamma distribution with three metrics. From the above Fig.4, It can be clearly seen that SDC-LPC feature vector out performs than all the combinations of feature vectors. 5. CONCLUSIONS In this paper, we have developed a new model for speaker identification based on generalized gamma distribution. The speeches are extracted using MFCC are combined with delta coefficients followed by LPC and also MFCC combined with SDC followed by LPC. The model is demonstrated a database of samples and tested with 5 samples, the accuracy is around 9% and proved to be efficient model. REFERENCES [1] Marko kos, Damjan Vlaj,Zdravko Kacic,(11) Speaker s gender classification and segmentation using spectral and cepstral feature averaging, 18th International Conference on Systems, Signals and Image Processing - IWSSIP 11. [2] J.Razik,C.SEnac,D.Fohr,O.Mella and N Parlangeau-Valles,(3) comparision of two speech/music segmentation systems for audio indexing on Web,in Proc WMSCI 3,Florida,USA,July3. [3] Corneliu Octavian.D,I.Gavat,(5), Feature Extraction Modeling &Training Strategies in continuous speech Recognition For Roman Language, EU Proceedings of IEEE Xplore,EUROCN- 5,pp-1424-1428. [4] Sunil Agarwal et al,(1), Prosodic Feature Based Text-Dependent Speaker Recognition Using machine Learning Algorithm,International Journal of Engg.sc &Technology, Vol:2(1), 1,pp515-5157. [5] Dayana Ribas Gonzalez,Jose R.Calvo de Lara(9), Speaker verification with shifted delta cepstral features:its Pseudo-Prosodic Behaviour, proc I Iberian SLTech 9. [6] P.A.Torres-Carrasquillo and E.Singer and M.A.Kohlerand.R.J.Greene and A.Reynolds and J.R.Deller Jr.(2) Approches to language Identification Using GAusian Mixture Models and Shifted delta cepstral features, Proc of ICSLP2,pp89-92. [7] T.Kinnunen.C.W.E.Koh,L.Wang.H.Li,E.S.Chang,(6) Temporal discrete cosine trans-form: Towards longer term temporal features for speaker verification, Proc of ICSLP 6. [8] J.Calvo andr.fernndez and G.Hernndez,(7) Channel/Handset Mismatch Evaluation in Biometric Speaker Verification using Shifted Delta Cepstral Features.Proc of CIARP 7.LNCS 4756.PP96-15.