PERFORMANCE ANALYSIS OF MFCC AND LPC TECHNIQUES IN KANNADA PHONEME RECOGNITION 1

Similar documents
Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Human Emotion Recognition From Speech

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Speech Emotion Recognition Using Support Vector Machine

Speaker recognition using universal background model on YOHO database

Speaker Identification by Comparison of Smart Methods. Abstract

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

WHEN THERE IS A mismatch between the acoustic

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems

Speaker Recognition. Speaker Diarization and Identification

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

A study of speaker adaptation for DNN-based speech synthesis

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Speech Recognition at ICSI: Broadcast News and beyond

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Learning Methods in Multilingual Speech Recognition

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Segregation of Unvoiced Speech from Nonspeech Interference

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

On Developing Acoustic Models Using HTK. M.A. Spaans BSc.

A comparison of spectral smoothing methods for segment concatenation based speech synthesis

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Mandarin Lexical Tone Recognition: The Gating Paradigm

Speech Recognition by Indexing and Sequencing

Automatic intonation assessment for computer aided language learning

Body-Conducted Speech Recognition and its Application to Speech Support System

Lecture 9: Speech Recognition

On the Formation of Phoneme Categories in DNN Acoustic Models

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

Automatic Pronunciation Checker

Proceedings of Meetings on Acoustics

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

SARDNET: A Self-Organizing Feature Map for Sequences

THE RECOGNITION OF SPEECH BY MACHINE

Affective Classification of Generic Audio Clips using Regression Models

Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

Voice conversion through vector quantization

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

PHONETIC DISTANCE BASED ACCENT CLASSIFIER TO IDENTIFY PRONUNCIATION VARIANTS AND OOV WORDS

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

Automatic segmentation of continuous speech using minimum phase group delay functions

COMPUTER INTERFACES FOR TEACHING THE NINTENDO GENERATION

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

Probabilistic Latent Semantic Analysis

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Support Vector Machines for Speaker and Language Recognition

Rule Learning With Negation: Issues Regarding Effectiveness

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

INPE São José dos Campos

Corpus Linguistics (L615)

Edinburgh Research Explorer

1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all

Word Segmentation of Off-line Handwritten Documents

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

English Language and Applied Linguistics. Module Descriptions 2017/18

Investigation on Mandarin Broadcast News Speech Recognition

Acoustic correlates of stress and their use in diagnosing syllable fusion in Tongan. James White & Marc Garellek UCLA

Circuit Simulators: A Revolutionary E-Learning Platform

Arabic Orthography vs. Arabic OCR

CS Machine Learning

Author's personal copy

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method

TRANSFER LEARNING IN MIR: SHARING LEARNED LATENT REPRESENTATIONS FOR MUSIC AUDIO CLASSIFICATION AND SIMILARITY

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Statewide Framework Document for:

Detailed course syllabus

A Case-Based Approach To Imitation Learning in Robotic Agents

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Australian Journal of Basic and Applied Sciences

Python Machine Learning

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Expressive speech synthesis: a review

Transcription:

PERFORMANCE ANALYSIS OF MFCC AND LPC TECHNIQUES IN KANNADA PHONEME RECOGNITION 1 Kavya.B.M, 2 Sadashiva.V.Chakrasali Department of E&C, M.S.Ramaiah institute of technology, Bangalore, India Email: 1 kavyabm91@gmail.com, 2 Sadashiva.c@msrit.edu Abstract- Speech is one of the oldest and the most natural used means of information exchange among the human beings. Accurate Phoneme recognition forms the backbone of most successful speech recognition systems. A collection of techniques exists to extract the relevant features from the steady state regions of phonemes both in time and frequency domains. Here we build an automatic phoneme recognition system based on Hidden Markov Model (HMM) which is dynamic modeling scheme. The Mel-Frequency Cepstrum Coefficients (MFCC) and Linear Predictive Coding (LPC) techniques are used for feature extraction. The performance comparison of these two techniques with HMM classification is done to achieve better performance with high recognition rate and low computational complexity. MFCC features with HMM gives the high recognition rate while LPC with HMM is computationally less complex. The major advantage of comparing these two techniques is that they improve reliability of the system. This has been carried out on five Kannada phonemes. Keywords: Automatic Speech Recognition, HMM, MFCC, LPC, Kannada. 1. INTRODUCTION Speech and hearing, is the man s most used means of communication. For centuries people have tried to develop machines that can understand and produce speech as humans do so naturally. Speech recognition can be defined as the process of converting an acoustic signal captured by microphone to set of words. Automatic Speech Recognition (ASR) is one of the fastest developing fields in speech science and engineering. The Automatic Speech Recognition System of any language must be able to recognize spoken sentences, words, syllables, and phonemes of that language. Speech technology is the technology of today and tomorrow with the developing number of methods and tools for better implementation. Speech recognition has a number of practical implementations for both serious and fun or entertainment works. ASR has an interesting and useful implementation in expert systems, a technology whereby computer can act as a substitute for human expert. In country like India where there are so many dialects variation this technology helps in reducing human staff trained in different languages. To demonstrate these concepts, we have built a database of 5 Kannada phonemes. Each phoneme is recorded 120 times out of which 90 is used for training while 30 for testing with a sampling rate of 8 khz. Hence total we have 450 phonemes for training and 150 phonemes for testing. 2. METHODOLOGY I. Construction of database A speaker dependent system is built. All the samples were recorded from native Kannada speakers both for training and testing. Audacity software is used to record phonemes. And they are stored in.wav format. Details of the database are shown in Table 1. 21

Sl. No. Phonemes used in speech recognition 1 Short Vowel /a/ 120 2 Short Vowel /i/ 120 3 Short Vowel /ou/ 120 4 Diphthong /ai/ 120 5 Diphthong /au/ 120 No. samples taken of Step 1: After normalizing the signal, the signal is converted into frames. Speech signal S Divide S into frames i= 1 Windowing = Frame(i) * Hamming Table 1: Phonemes used in speech recognition II. Pre-processing Since the recordings were taken under normal conditions with background noise, it is important to remove these noises. And then normalization is done. III. Feature extraction The goal of feature extraction is to represent any speech signal by a finite number of measures of the signal. Here Mel frequency cepstrum coefficients (MFCC) and linear predictive coding (LPC) coefficients are used for feature extraction. Mel frequency cepstrum coefficients (MFCC): The Mel-Frequency Cepstrum (MFC) is a representation of short-term power spectrum of sound. The MFCCs are coefficients that collectively makeup MFC. The difference between the cepstrum and the Mel frequency cepstrum is that in MFC, the frequency bands are equally spaced on the mel scale. The Mel frequency scale is linear frequency spacing below 1000Hz and logarithmic spacing above 1000Hz. In other words frequency filters are spaced linearly at low frequencies and logarithmically at high frequencies which is used to capture the phonetically important characteristics of speech. This is the important property of human ear. MFCC mimics the human auditory system. Fig. 1 shows the flowchart of MFCC method feature extraction. Apply FFT function on windowing and put the result in a matrix C Fig. 1: MFCC method feature extraction. Step 2: For every frame windowing is done. This minimizes the signal discontinuities at the beginning and end of the frame. Minimize the spectral distortion by using the window to taper the signal to zero at the beginning and end of each frame. Hamming window is used whose form is 0.53836 0.46164 cos ------> 2.1 Where N- No. of samples in a frame n- Total no. of frames i=i+1 i>no. of Frames Enter C to mel filter bank Log and DCT MFCC Coefficients Step 3: Apply Fast Fourier transform (FFT) to each frame. This step converts samples from time domain to frequency domain. The FFT is a fast algorithm to implement the Discrete Fourier transform (DFT). It is defined over set of N samples, 22

, N-1 -------------------------> 2.2 Step 4: Once for all frames the FFT is calculated, the values are entered to Mel filter bank. The approximate formula for calculating mel frequency for given f frequency in Hz is 2595 ------------------------> 2.3 Step 5: Final step where we convert the log Mel spectrum back to time. The result is called Mel Frequency Cepstrum Coefficients (MFCC).we convert them back to time domain using discrete cosine transform (DCT). Therefore MFCC s is calculated as log cos. -------------------> 2.4 By applying the above procedure for each frame, a set of MFCCs is calculated. Fig. 2 shows the MFCC of phoneme /a/. Fig. 2: MFCC of /a/ Linear predictive coding (LPC): Linear predictive coding is a technique used mostly in speech processing to estimate basic speech parameters like pitch, formants and spectral envelope of the speech signal, in compressed form, using the information of linear predictive model. LPC is one of the most useful methods for encoding good quality speech at low bit rates. The coefficients of current sample are generated by the linear combination of the past samples using autocorrelation or auto covariance method. LPC coefficients are obtained by applying some procedures on the speech sample. First the autocorrelation is applied on the windowed frames. Every frame which is windowed is auto correlated by the order. Once the autocorrelation coefficients are calculated, the Levinson Durbin algorithm is used to find the LPC coefficients. At the beginning, first coefficient of the first column (of LPC coefficients) is calculated as follows,, --------> 2.5 E(1) = B(1) Where A = matrix of LPC coefficients B = Vector of the autocorrelation coefficients error 2.6 E = Vector of the energy of the prediction E(i+1) = (1-A(i,i)^2) * E(i) -----------> The above equation is used to calculate the E(i) values. In the second stage the second coefficient of the second column is calculated using equation 2.5 and the rest coefficients of the second column is calculated using the below equation.,,,, ----------- -> 2.7 So on the procedure is calculated to find all the coefficients. The last column of the matrix A gives the coefficients. Fig. 2 shows LPC coefficients of phoneme /a/. Fig. 3: LPC coefficients of /a/ IV. Recognition using HMM Hidden markov model is the dynamic modeling scheme which is used to recognize the phonemes. A finite state machine with probabilistic state transitions is a markov model. It follows markov property and Markov property states that probability distribution of future states depends only upon the present state. HMM is 23

similar to markov model except that the states are hidden. Viterbi algorithm is used to find the most likely sequence of hidden states. Features extracted are trained with HMM and finally recognition of the phonemes is done. 3. RESULTS AND DISCUSSION In case 1 where MFCC is used for feature extraction and HMM is used for classification, the accuracy in each class is shown in Table 2. Class Match Accuracy (%) /a/ 18/30 60 /i/ 29/30 96.67 /ou/ 29/30 96.67 /ai/ 30/30 100 /au/ 30/30 100 Table 2: Accuracy using MFCC and HMM Total system efficiency using MFCC and HMM = 90.67% In case 2 where LPC is used for feature extraction and HMM is used for classification, the accuracy in each class is shown in Table 3. Class Match Accuracy (%) /a/ 19/30 63.33 /i/ 30/30 100 /ou/ 26/30 86.67 /ai/ 28/30 93.33 /au/ 18/30 60 Table 3: Accuracy using LPC and HMM Total system efficiency using LPC and HMM = 80.67% Fig. 4 and 5 shows all the MFCC and LPC coefficients of all 5 phonemes. We can see that MFCC with HMM works better compared to LPC with HMM in terms of overall system efficiency. MFCC has many steps to compute coefficients (Windowing, FFT, DCT) while LPC (Autocorrelation) has fewer steps. This makes MFCC computationally complex compared to LPC. Fig. 4: All the MFCC coefficients in 3D plot Fig. 5: All the LPC coefficients 4. CONCLUSION In this work, a speaker dependent Hidden Markov Modeling for Kannada phoneme recognition was done. Feature was extracted in two different methods. That is Mel Frequency Cepstrum Coefficients and Linear Predictive Coding Coefficients. Mel frequency cepstrum coefficients with Hidden Markov Model gave 10% better efficiency compared to linear predictive coefficients with Hidden Markov Model. But simulation results show that computations in case of Mel frequency cepstrum coefficients is more compared to linear predictive coefficients computations. MFCC is computationally complex compared to LPC. So based on the application the feature extraction technique can be chosen with some relaxation of accuracy and efficiency. In this work only five Kannada phonemes were modeled. This should be further expanded to all. So that we can have a complete Kannada phoneme recognition system which further helps in building word recognition system. 5. REFERENCES 24

[1] Prashanth Kannadaguli, Ananthakrishna Thalengala, Phoneme Modeling for Speech Recognition in Kannada using Hidden Markov Model, IEEE international conference on Signal processing, Informatics, Communication and engineering(spices), Feb, 2015 [2] Thomas F. Quatieri, Discrete Time Speech Signal Processing Principles and practice, Pearson Education, 2011 [3] L.R. Rabiner, R.W. Schafer, Digital Processing of Speech Signals, Pearson Education, 2005 [4] Peri BhaskaraRao, Salient phonetic features of Indian languages for Speech Technology, Sādhanā, Vol. 36, Part 5, Oct 2011, pp. 587 599. [5] Hemakumar G., Punitha P., Speech Recognition Technology: A Survey on Indian Languages, International Journal of Information Science and Intelligent System, Vol. 2, No. 4, 2013 [6] Yusnita M. A., Phoneme-based or Isolatedword modeling Speech Recognition System?, IEEE 7 th International Colloquium on Signal Processing and its Applications, 2011 [7] Hemakumar G., Acoustic Phonetic Characteristics of Kannada Language, International Journal of Computer Science Issues, Vol. 8, Issue 6, No. 2, Nov 2011 [8] Youngjik Lee, Kyu-Woong Hwang, Selecting Good Speech Features for Recognition, ETRI Journal, Vol. 18, No. 1, April 1996 [9] N K Narayanan, T M Thasleema, V Kabeer, Malayalam Vowel Recognition Based On Linear Predictive Coding Parameters and k-nn Algorithm, International Conference on Computational Intelligence and Multimedia Applications, 2007 [10] Prashanth Kannadaguli, Vidya Bhat, A Comparison of Bayesian Multivariate Modeling and Hidden Markov Modeling (HMM) based approaches for Automatic Phoneme Recognition in Kannada, Recent and Emerging trends in Computer and Computational Sciences (RETCOMP), 2015 25