TEXT-INDEPENDENT SPEAKER IDENTIFICATION SYSTEM USING AVERAGE PITCH AND FORMANT ANALYSIS

Similar documents
Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Speech Emotion Recognition Using Support Vector Machine

Human Emotion Recognition From Speech

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Speaker recognition using universal background model on YOHO database

A study of speaker adaptation for DNN-based speech synthesis

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems

Learning Methods in Multilingual Speech Recognition

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Speech Recognition at ICSI: Broadcast News and beyond

WHEN THERE IS A mismatch between the acoustic

Voice conversion through vector quantization

Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Speaker Identification by Comparison of Smart Methods. Abstract

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Calibration of Confidence Measures in Speech Recognition

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS

Word Segmentation of Off-line Handwritten Documents

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Mining Association Rules in Student s Assessment Data

Generative models and adversarial training

A student diagnosing and evaluation system for laboratory-based academic exercises

Using EEG to Improve Massive Open Online Courses Feedback Interaction

Evolutive Neural Net Fuzzy Filtering: Basic Description

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Body-Conducted Speech Recognition and its Application to Speech Support System

Segregation of Unvoiced Speech from Nonspeech Interference

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Speaker Recognition. Speaker Diarization and Identification

On-Line Data Analytics

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Mandarin Lexical Tone Recognition: The Gating Paradigm

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

INPE São José dos Campos

Longest Common Subsequence: A Method for Automatic Evaluation of Handwritten Essays

Support Vector Machines for Speaker and Language Recognition

SARDNET: A Self-Organizing Feature Map for Sequences

Learning Methods for Fuzzy Systems

Proceedings of Meetings on Acoustics

A comparison of spectral smoothing methods for segment concatenation based speech synthesis

Affective Classification of Generic Audio Clips using Regression Models

Data Fusion Models in WSNs: Comparison and Analysis

COMPUTER INTERFACES FOR TEACHING THE NINTENDO GENERATION

AQUA: An Ontology-Driven Question Answering System

AUTOMATED FABRIC DEFECT INSPECTION: A SURVEY OF CLASSIFIERS

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Probabilistic Latent Semantic Analysis

Data Integration through Clustering and Finding Statistical Relations - Validation of Approach

Speech Recognition by Indexing and Sequencing

Python Machine Learning

Reducing Features to Improve Bug Prediction

The Good Judgment Project: A large scale test of different methods of combining expert predictions

IEEE Proof Print Version

Automatic Pronunciation Checker

Modeling user preferences and norms in context-aware systems

A Case Study: News Classification Based on Term Frequency

USER ADAPTATION IN E-LEARNING ENVIRONMENTS

Multisensor Data Fusion: From Algorithms And Architectural Design To Applications (Devices, Circuits, And Systems)

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

THE RECOGNITION OF SPEECH BY MACHINE

Perceptual scaling of voice identity: common dimensions for different vowels and speakers

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

Spoofing and countermeasures for automatic speaker verification

Large vocabulary off-line handwriting recognition: A survey

CLASSIFICATION OF PROGRAM Critical Elements Analysis 1. High Priority Items Phonemic Awareness Instruction

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Using Synonyms for Author Recognition

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Getting the Story Right: Making Computer-Generated Stories More Entertaining

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Constructing a support system for self-learning playing the piano at the beginning stage

Investigation on Mandarin Broadcast News Speech Recognition

Acoustic correlates of stress and their use in diagnosing syllable fusion in Tongan. James White & Marc Garellek UCLA

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

On Developing Acoustic Models Using HTK. M.A. Spaans BSc.

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Transcription:

TEXT-INDEPENDENT SPEAKER IDENTIFICATION SYSTEM USING AVERAGE PITCH AND FORMANT ANALYSIS M. A. Bashar 1, Md. Tofael Ahmed 2, Md. Syduzzaman 3, Pritam Jyoti Ray 4 and A. Z. M. Touhidul Islam 5 1 Department of Computer Science & Engineering, Comilla University, Bangladesh 2 Department of Information & Communication Technology, Comilla University, Bangladesh 3,4 Department of Computer Science and Engineering, SUST, Bangladesh 5 Department of Information & Communication Engineering, University of Rajshahi, Bangladesh ABSTRACT The aim of this paper is to design a closed-set text-independent Speaker Identification system using average pitch and speech features from formant analysis. The speech features represented by the speech signal are potentially characterized by formant analysis (Power Spectral Density). In this paper we have designed two methods: one for average pitch estimation based on Autocorrelation and other for formant analysis. The average pitches of speech signals are calculated and employed with formant analysis. From the performance comparison of the proposed method with some of the existing methods, it is evident that the designed speaker identification system with the proposed method is superior to others. KEYWORDS Speaker identification, average pitch, feature extraction, formant analysis 1. INTRODUCTION Speaker Identification (SI) refers to the process of identifying an individual by extracting and processing information from his/her speech. It is a task of finding the best-matching speaker for unknown speaker from a database of known speakers [1,2]. It is mainly a part of the speech processing, stemmed from digital signal processing and the SI system enables people to have secure information and property access. Speaker Identification method can be divided into two categories. In Open Set SI, a reference model for the unknown speaker may not exist and, thus, an additional decision alternative, the unknown does not match any of the models, is required [3]. On the other hand, in Closed Set SI, a set of N distinct speaker models may be stored in the identification system by extracting abstract parameters from the speech samples of N speakers. In speaker identification task, similar parameters from new speech input are extracted first and then decide which one of the N known speakers mostly matches with the input speech parameters [3-6]. One can divide Speaker Identification methods into two: Text-dependent and Text-independent methods. Although text-dependent method requires speaker to provide utterances of the key words or sentences which have the same text for both the training and identification trials, the DOI : 10.5121/ijit.2014.3303 23

text-independent method does not rely on a specific text being spoken. The aim of this work is to design a closed-set and text-independent Speaker Identification System (SIS). The SIS system has been developed using Matlab programming language [7-8]. 2. RELATED WORKS A brief review of relevant work of this paper is stated as follows. Authors in Ref. [9] studied the performance of text-independent, multilingual speaker identification system using MFCC feature, pitch based DMFCC feature and the combination of these two features. They shown that combination of features modeled on the human vocal tract and auditory system provides better performance than individual component model. Their study also revealed that Gaussian Mixture Model (GMM) is efficient for language and text-independent speaker identification. Reynolds et al. [10] shown that GMM provide a robust speaker representation for the text-independent speaker identification using corrupted, unconstrained speech. The authors in Ref. [11] implemented a robust and secure text-independent voice recognition system using three levels of encryption for data security and autocorrelation based approach to find the pitch of the sample. Their proposed algorithm outperforms the conventional algorithms in actual identification tasks even under noisy environments. 3. SPEAKER IDENTIFICATION CONCEPT The overall architecture of Speaker Identification System is illustrated in Fig. 1. Figure 1. System architecture of closed-set and text-independent SIS. From the above figure we can see that a Speaker Identification system is composed of the following modules: 24

a) Front-end processing: It is the "signal processing" part, which converts the sampled speech signal into set of feature vectors, which characterize the properties of speech that can separate different speakers. Front-end processing is performed both in training and identification phases. b) Speaker modeling: It performs a reduction of feature data by modeling the distributions of the feature vectors. c) Speaker database: The speaker models are stored here. d) Decision logic: It makes the final decision about the identity of the speaker by comparing unknown speaker to all models in the database and selecting the best matching model. Among several speech parameterization methods, we focus on average pitch estimation based on auto-correlation method. There are many classification approaches, but all have some limitations at some particular field. At present the state-of-art classification engine in the Speaker Identification technology are the Gaussian Mixture Model (GMM), Hidden Markov Model (HMM), Vector Quantization (VQ), Artificial Neural Network (ANN) and Formant [12]. In this paper the formant analysis is based on power spectral density (PSD). 4. AVERAGE PITCH ESTIMATION Pitch represents the perceived fundamental frequency (F0) of a sound and is one of the major auditory attributes of sounds along with loudness and quality [13-14]. Here we are interested to find out the average pitch of a speech signal. A method is designed for estimating average pitch. We named this method Avgpitch. The flowchart of Avgpitch is shown in Fig. 2. Figure 2. Flowchart of average Pitch estimation (Avgpitch). Average pitch was used to reduce the comparison task in formant analysis. We calculated average pitch for speaker.wav (the unknown speaker in identification phase) file as well as for all trained files in speaker database. Pitch contour and average pitch (158.6062Hz) of speaker.wav file is shown in Fig. 3. 25

Figure 3. Pitch outline of speaker.wav file. Then we calculated average pitch differences between the speaker.wav file and all the trained speech files. To illustrate this with figure we used 40 trained files in database. Fig. 4 shows average pitch differences between the unknown speaker and 40 trained speakers. Figure 4. Plot of average pitch differences of 40 trained files from speaker.wav file. Fig. 4 gives us a closer look in identification task. We can see that some of the differences are small enough while others are so high. As the average pitch differences could potentially characterize a speaker so we can prune out some of trained files with high average pitch differences from our consideration. Actually in our proposed system we discard a significant number of trained files based on a certain difference limit (roughly above 40Hz). And rest of the trained files are used in next consideration, that is, for formant analysis. From Fig. 4 we can see 10 speakers are with ID (in orderly) 13, 6, 38, 39, 21, 36, 17, 26, 31 and 20 whose average pitch differences are not more than 40 Hz. So we will do formant analysis on these ten selected trained files to identify the best match speaker ID for the unknown speaker (speaker.wav file). 5. FORMANT ANALYSIS Formants are the meaningful frequency components of human speech [3]. The information that humans require to distinguish between vowels can be represented by the frequency content of the vowel sounds. In speech, these are the characteristic part that identifies vowels to the listener. We designed an algorithm for formants analysis. The flowchart of formant analysis algorithm is presented in Fig. 5. Applying this algorithm we get the PSD of speech signal. The vector position of the peaks in the power spectral density is also calculated that can be used to characterize a particular voice file. Fig. 6 shows first four peaks in power spectral density of speaker.wav file. 26

Figure 5: Flowchart of Formant Analysis. Figure 6. Plot of the first four peaks in power spectral density of speaker.wav file. Formant analysis was also done on ten selected trained speaker files getting from the previous section. Fig. 7 shows the PSD of ten trained speaker files with ID 13, 6, 38, 39, 21, 36, 17, 26, 31, and 20 respectively. We calculated formant vector (vector positions of peaks) of speaker.wav file as well as of ten selected trained files. The purpose of these formant vectors is to find out the difference of peaks between the speaker.wav file and all other trained files. Then the root mean square (rms) value of the differences is calculated each time to get the single value of formant peak difference. Fig. 8 shows the formant peak differences of ten selected trained files from speaker.wav file. 27

Figure 7. PSD of ten selected trained files (ID 13, 6, 38, 39, 21, 36, 17, 26, 31 and 20). Figure 8. Plot of formant peak differences between speaker.wav file and ten selected trained files. 6. RESULTS AND DISCUSSION Using the information obtained from Fig 8, the result of this system could easily be found. The ID of speaker that has the minimum formant difference should be the best matched speaker for the unknown speaker (speaker.wav). From Fig. 8 we can see that the lowest formant difference is for speaker ID13. The next best matching speakers are found easily from the sorted formant difference vector between speaker.wav file and ten selected trained files. This is shown in Fig. 9. From Fig. 9 we get the best matching speakers with ID 13, 20, 17, 31, 38, 21, 26, 36, 39 and 6 respectively. We checked out the trained file with ID 13 and the unknown speaker (speaker.wav) and found that two voices are of the same speaker. The Speaker Identification code has been written using the MATLAB. It was found that comparison based on average pitch helped us to reduce the number of trained file to be compared in formant analysis. And comparison based on formant analysis produced results with most accuracy. 28

Figure 9. Plot of formant peak differences between speaker.wav file and ten selected trained files. To verify the performance of the proposed Speaker Identification system, the speech signals of 80 speakers are recorded in the laboratory environment. For identification phase some speech signals also recorded in laboratory and in noisy environment as well. We got about 90% accuracy for normal voices (in laboratory environment). We got about 75% accuracy for the twisted (change the form of speaking style) voice in identification phase and about 70% when the testing signal is noisy. 7. CONCLUSIONS In this paper a closed-set text-independent Speaker Identification system has been proposed using average pitch and formant analysis. The highest Speaker Identification accuracy is 91.75%, which satisfies the practical demands. All experiments were done in a laboratory environment which was not fully noise proof. The accuracy of this system will increase considerably in a fully noise proof environment. We successfully extracted feature parameters of each speech signal with the MATLAB implementation of feature extraction. For characterizing the signal, it was broken down into discrete parameters because it can significantly reduce memory required for storing the signal data. It can also shorten computation time because only a small, finite set of numbers are used for parallel comparison of speakers identities. We hope that may be one day, we will expand this work and make an even better version of Speaker Identification system. REFERENCES [1] K. Shikano, Text-Independent Speaker Recognition Experiments using Codebooks in Vector Quantization, CMU Dept. of Computer Science, April 9, 1985. [2] S. Furui, An overview of Speaker Recognition Technology, ESCA workshop on Au tomatic Speaker Recognition, Identification and Verification, 1994. [3] Wikipedia. http://en.wikipedia.org/wiki/. [4] Lincoln Mike, Characterization of Speakers for Improved Automatic Speech Recogni tion, Thesis paper, University of East Anglia, 1999. [5] B. Atal Automatic Recognition of Speakers from Their Voices, Proceedings of the IEEE, vol. 64, April 1976, pp. 460-475. [6] H. Poor, An Introduction to Signal Detection and Estimation, New York: Springer-Verlag, 1985. [7] Royce Chan and Michael Ko, Speaker Identification by MATLAB, June 14, 2000. 29

[8] Vinay K. Ingle, John G. Proakis, Digital Signal Processing Using Matlab V4, PWS Publishing Company, 1997. [9] Todor Dim itrov Ganchev, Speaker Recognition, PhD Thesis, Wire Communication Laboratory, Dept. of Computer Science and Engineering, University of Patras, Greece, November 2005. [10] S. S. Nidhyananthan and R. S. Kumari, Language and Text-Independent Speaker Identification System using GMM, WSEAS Transactions on Signal Processing, Vol.9, pp. 185-194, 2013. [11] D. A. Reynolds and R. C. Rose, Robust Text-Independent Speaker Identification using Gaussian Mixture Speaker Models, IEEE Transactions on Speech and Audio Processing, Vol. 3, pp. 72-83, 1995. [12] A. Chadha, D. Jyoti, and M. M. Roja, Text-Independent Speaker Recognition for Low SNR Environments with Encryption, International Journal of Computer Applications, Vol. 31, pp. 43-50, 2011. [13] D. Gerhard. Pitch Extraction and Fundamental Frequency: History and Current Tech-niques, technical report, Dept. of Computer Science, University of Regina, 2003. [14] Dmitry Terez, Fundamental frequency estimation using signal embedding in state space. Journal of the Acoustical Society of America, 112(5):2279, November 2002. 30