PERFORMANCE COMPARISON OF SPEECH RECOGNITION FOR VOICE ENABLING APPLICATIONS - A STUDY

Similar documents
Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Human Emotion Recognition From Speech

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

WHEN THERE IS A mismatch between the acoustic

Speaker recognition using universal background model on YOHO database

Speech Emotion Recognition Using Support Vector Machine

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Speech Recognition at ICSI: Broadcast News and beyond

A study of speaker adaptation for DNN-based speech synthesis

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems

Speaker Identification by Comparison of Smart Methods. Abstract

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

Learning Methods in Multilingual Speech Recognition

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Speech Recognition by Indexing and Sequencing

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Speaker Recognition. Speaker Diarization and Identification

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

Voice conversion through vector quantization

Automatic Pronunciation Checker

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Word Segmentation of Off-line Handwritten Documents

Proceedings of Meetings on Acoustics

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Lecture 9: Speech Recognition

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

On the Formation of Phoneme Categories in DNN Acoustic Models

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Python Machine Learning

Circuit Simulators: A Revolutionary E-Learning Platform

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

TRANSFER LEARNING IN MIR: SHARING LEARNED LATENT REPRESENTATIONS FOR MUSIC AUDIO CLASSIFICATION AND SIMILARITY

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

Affective Classification of Generic Audio Clips using Regression Models

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Mining Association Rules in Student s Assessment Data

Body-Conducted Speech Recognition and its Application to Speech Support System

Author's personal copy

Support Vector Machines for Speaker and Language Recognition

Master s Programme in Computer, Communication and Information Sciences, Study guide , ELEC Majors

Learning Methods for Fuzzy Systems

Reducing Features to Improve Bug Prediction

Automatic segmentation of continuous speech using minimum phase group delay functions

Calibration of Confidence Measures in Speech Recognition

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

FUZZY EXPERT. Dr. Kasim M. Al-Aubidy. Philadelphia University. Computer Eng. Dept February 2002 University of Damascus-Syria

Individual Component Checklist L I S T E N I N G. for use with ONE task ENGLISH VERSION

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Segregation of Unvoiced Speech from Nonspeech Interference

A comparison of spectral smoothing methods for segment concatenation based speech synthesis

A GENERIC SPLIT PROCESS MODEL FOR ASSET MANAGEMENT DECISION-MAKING

A Case-Based Approach To Imitation Learning in Robotic Agents

Courses in English. Application Development Technology. Artificial Intelligence. 2017/18 Spring Semester. Database access

Evolutive Neural Net Fuzzy Filtering: Basic Description

THE RECOGNITION OF SPEECH BY MACHINE

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Data Fusion Models in WSNs: Comparison and Analysis

On Human Computer Interaction, HCI. Dr. Saif al Zahir Electrical and Computer Engineering Department UBC

Modeling user preferences and norms in context-aware systems

On-Line Data Analytics

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

INPE São José dos Campos

Probabilistic Latent Semantic Analysis

Longest Common Subsequence: A Method for Automatic Evaluation of Handwritten Essays

Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques

Large vocabulary off-line handwriting recognition: A survey

1 Use complex features of a word processing application to a given brief. 2 Create a complex document. 3 Collaborate on a complex document.

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Mandarin Lexical Tone Recognition: The Gating Paradigm

Automatic intonation assessment for computer aided language learning

COMPUTER INTERFACES FOR TEACHING THE NINTENDO GENERATION

On Developing Acoustic Models Using HTK. M.A. Spaans BSc.

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

A Case Study: News Classification Based on Term Frequency

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

A Biological Signal-Based Stress Monitoring Framework for Children Using Wearable Devices

Speech Translation for Triage of Emergency Phonecalls in Minority Languages

Transcription:

PERFORMANCE COMPARISON OF SPEECH RECOGNITION FOR VOICE ENABLING APPLICATIONS - A STUDY V. Karthikeyan 1 and V. J. Vijayalakshmi 2 1 Department of ECE, VCEW, Thiruchengode, Tamilnadu, India, Karthick77keyan@gmail.com 2 Department of EEE, KPRCET, Coimbatore Tamilnadu, India, vijik810@gmail.com Abstract - Performance Comparison of Speech Recognition for Voice Enabling Applications is presented. This paper focuses on the speaker independent system to provide all the mobile phone applications without touching the device especially for those visually challenged. The proposed system is evaluated through two different classification modeling such as Template generation using Dynamic time warping (DTW) and Hidden Markov Model (HMM) / Vector Quantization (VQ) with Mel Frequency Cepstral Coefficient (MFCC) features. The performance comparison between two classification modeling is made based on the average recognition accuracy Hidden Markov Model (HMM) / Vector Quantization (VQ) classification modeling with Mel Frequency Cepstral Coefficient (MFCC) features gives recognition rate of 82.77% for the utterances which is higher than the conventional methods. TMS320C2x DSP processor can be used to implement voice enabled mobile phone. Index Terms- Template generation, Dynamic time warping (DTW), Hidden Markov Model (HMM) / Vector Quantization (VQ) and Mel Frequency Cepstral Coefficient (MFCC) INTRODUCTION Speech recognition is a field in which the system recognizes the spoken words [1]. It enables the system to identify the words that a person speaks into microphone and recognizes them by converting them into written text. Automatic Speech Recognition (ASR) is the process of determining a sequence of words spoken by human using machines. The goal of ASR is to have speech as a medium of interaction between man and machine. This Automatic Speech Recognition (ASR) is commonly used in voice enabled mobile phones implemented using voice as an input. In these systems, it is necessary to perform recognition of voice input so that the appropriate action can be enabled. Voice enabled mobile phone refers to the enabling of mobile applications where a user may dial a contact in the device memory by saying the digit of the contact which is saved in speed dial or name of the contact[2], sending and receiving the message and making the phone to put on hold mode and loudspeaker mode. The first step in the process of voice enabling application is speech recognition. To do so, it is necessary to extract the features from the uttered word. Recognition of the uttered word is done by means of comparing both trained sample and test sample. The most commonly used feature extraction algorithm is Mel frequency Cepstral coefficient front end. The power frequency data is then filtered with filter that resembles the human ear s sensitivity curve called Mel scale Filter. While the human ear is sensitive to frequency variations in lower frequency values, this sensitivity is reduced for higher frequency 48

signal components. From the experimental results, it is well known that Mel Frequency Cepstral Coefficients (MFCC) is among the best acoustic features used in automatic speech recognition [3]. The Mel Frequency Cepstral Coefficients are robust, contain much information about the vocal tract configuration regardless the source of excitation, and can be used to represent all classes of speech sounds. This paper is organized as follows: In section 2, the general principles of ASR system have been discussed. Section 3 deals with the Mel Frequency Cepstral Coefficient (MFCC) feature extraction technique which is used in the proposed system and section 4 deals with two classification modeling. In section 5 results and conclusion are discussed which shows the accuracy of two classification modeling. AUTOMATIC SPEECH RECOGNITION SYSTEM In general, the Automatic Speech Recognition (ASR) system [4] consists of two modes such as training mode and testing mode. First the input speech signal is pre-processed and then features are extracted. From the extracted features reference samples are created from which the comparison and recognition is made. The general block diagram of Automatic Speech Recognition (ASR) system is shown in figure.1. Training Mode: In speaker dependent and speaker independent system, the system has to be trained by the speaker. It means that samples have to be collected from different speakers using microphone as an input device Accuracy is directly proportional to the no of samples Preprocessing methods are used to extract acoustic characteristics of the speech signal. Software analyses and generates patterns from the extracted feature vectors and stores it in matrix form as a reference pattern. These reference patterns are matched with the input speech during the recognition mode. TESTING MODE: In this mode, the test sample is analyzed for its acoustic characteristics and the important features are extracted from the input speech sample. This feature vectors are used to generate an input pattern using and stores it in matrix form. This unknown pattern is compared against known reference pattern, element by element. Once the best match is found, the appropriate action is enabled. Fig. 1.Block Diagram of Automatic Speech Recognition System FEATURE EXTRACTION Feature extraction is the key to the front-end process in speaker verification systems. The performance of a speaker verification system is highly dependent on the quality of the selected 49

speech features. The speech signal is a slowly varying signal and is often termed stationary. Therefore, short-time spectral analysis is the most common way to characterize speech signal. Before extracting the features the speech signal is preprocessed using following steps. i) Framing ii) Windowing, The speech signal is divided into short fixed length frames. The continuous speech signal is divided into frames where each frame consists of M samples [5]. Very often successive frames are overlapping with each other by M samples. For the proposed system, frame size of M = 256 with an overlap of 50% i.e. N=128 have been used. After frame segmentation, windowing is carried out to minimize the spectral distortion by using the window to taper the signal on both ends thus reducing the side effects caused by signal discontinuity at the beginning and at the end due to framing. Hamming window is used as spectral leakage is less. It is multiplied with each frame and the window function is as given in eqn. (1), (1) Where M is the number of samples in each frame There are different types of feature extraction techniques available such as Mel-Frequency Cepstral Coefficient (MFCC), Linear Prediction Cepstral Co-efficient (LPCC), Bark Frequency Cepstral Co-efficient (BFCC), Perceptual Linear Prediction (PLP) and Rasta Perceptual Linear Prediction. Among those Mel Frequency Cepstral Coefficient (MFCC) provides good recognition accuracy. Mel-Frequency Cepstral Coefficient (MFCC) vectors are used to provide an estimate of the vocal tract filter [6]. Background noise energy level is evaluated at the beginning and the end of speech signal and energy thresholds are applied to find speech beginning and end points. The pre emphasized speech signal is blocked into frames of N = 256 samples, with adjacent frames separated by M = 128 samples. Then windowing is done to minimize the speech signal discontinuities at the beginning and end of each analysis frame. The general block diagram of Mel Frequency Cepstral Coefficient (MFCC) is shown in figure.2 Fig. 2. Block Diagram of MFCC Technique After windowing, Fast Fourier Transform (FFT) is applied. Then a spectrum is passed through 20 Mel scale triangular filter bank. The Mel scale is a critical band frequency scale that takes into account the frequency perception in the human auditory system. Discrete Cosine Transform (DCT) is applied to the log Mel scale filter outputs and thus 12 Mel-frequency Cepstral coefficients (MFCC) are obtained. Then Cepstral filtering is performed with the help of eqn. (2), (2) 50

Where N is the number of filter bank channels. Mel-Frequency Cepstral Coefficients (MFCCs) are calculated from the log filter bank amplitudes {mj} using the eqn. (3) Where N is the number of filter bank channels. SPEAKER MODELLING As the extracted features requires more storage memory, it is necessary to convert those features into vectors so that storage memory requirement can be met which can be done by using classification modeling. The proposed system is verified through Template model and Hidden Markov Model (HMM) / Vector Quantization (VQ). The block diagram of the proposed system is as shown in figure.3. (3) Fig.3. Block Diagram of Proposed System Template generation modeling uses Dynamic Time Warping (DTW) for speech pattern matching [7] for speaker dependent system in which it expands or contracts the time axis non-linearly to match the input speech with the reference template. The reason here to use Dynamic Time Warping (DTW) algorithm is that Dynamic Time Warping (DTW) -based recognition engine has been widely embedded inside Qualcomm MSM (Mobile Station Modem) chips because of its less computational complexity for phone dialing. Template Generation using DTW Templates are reference patterns that are derived from features recorded over the length of the whole word rather than at particular points. The procedure is as follows: Choose one utterance from the training data for reference. Use Dynamic Time Warping (DTW) technique to align all the training data to match with the reference. Once the training data are aligned, compute the reference pattern vector as the Centroid of the feature vectors (of cepstral coefficients) corresponding to all the occurrences of the digit. Dynamic Time Warping is used to create reference templates and to find the best match between the reference template and the input template derived from the test input speech sample. The matching process needs to compensate for length differences and take account of the non-linear 51

nature of the length differences within the words. Dynamic Time Warping (DTW) grid is used to find the best match between input data and stored sequence. We can find a path through the grid which minimizes the total distance between them. The input data is either stretched or compressed in order to match with the input. Once an overall path has been found, the total distance between the input sequence and reference sequence can be calculated for this particular input template. In Dynamic Time Warping method (DTW), when comparing sequences with different length, the sequence length is modified by repeating or omitting some frames, so that both sequences will have the same length. This modification of sequences is called time warping. Hidden Markov Model with Vector Quantization (HMM/VQ) Here the proposed method, speaker independent isolated word recognition is implemented using Vector Quantization (VQ) and Hidden Markov model (HMM) which is suitable to provide higher accuracy rate with more no of samples. One of the most popular approaches to speaker independent [8] speech recognition, is the combination of Vector Quantization (VQ) for the encoding of segments of speech with a Hidden Markov Modelling (HMM) for the classification of sequences of segments [9] as in figure.4. After extracting the features, K-means clustering algorithm is used to iteratively create the vector quantizer codebook until the average distance falls below a preset threshold. A set of such vectors (corresponding to multiple utterances of the same word) is used to re-estimate the Hidden Markov Model [10] for that word. Fig.4. Block Diagram of Hidden Markov Model with Vector Quantization After extracting the features, K-means clustering algorithm is used to iteratively create the vector quantizer codebook until the average distance falls below a preset threshold. A set of such vectors (corresponding to multiple utterances of the same word) is used to re-estimate the Hidden Markov Model [10] for that word. This procedure is repeated for each word in the vocabulary. In the testing mode, the set of Mel Frequency Cepstral Coefficient (MFCC) vectors corresponding to the unknown word is quantized by the vector quantizer to give a vector of codebook indices. This is scored on each word Hidden Markov model (HMM) to give a probability score for each word model. The decision rule is used to choose the word whose model gives the highest probability. VQ is a process of mapping vectors from a large vector space to a finite number of regions in that space. Each region is called a cluster and can be represented by its center called a centroid. The collection of all code words is called a codebook [11]. In speech recognition, vector quantization can be used to train Hidden Markov model (HMMs). RESULTS AND DISCUSSION The proposed speaker independent isolated speech recognition system is implemented in 52

MATLAB software. Database containing 0-9 digits and 8 words Dial, Cut, On, Off, Hold, Read, Write and Loudspeaker are considered for voice dialing applications. The digits are used to dial a contact no assigned for speed dial. The words like Dial, Cut and On, Off and Hold are used to control voice dialing application and Mobile phone on and off operation respectively. The words like Read, Write and Loudspeaker are used to enable message read and write application and hands free mode respectively. Database is collected from 25 speakers pronouncing each word 10 times yielding 250 utterances per word and 180 utterances per speaker, totally 4500 samples. The first 8 utterance of each word is used for training and remaining is used for testing. In Training phase, the uttered digits are recorded using 8-bit Pluse Code Modulation (PCM) with a sampling rate of 8 KHz, which holds the information of the communication and converted as a wave file using Total audio converter software The performance of the speech recognition systems is given in terms of a word error rate (%) and is calculated from eqn.4, W= (M/N) x100% (4) Where: N = Total no of samples taken M = Total no of recognized samples In the confusion matrix, each uttered word is compared with all other words and it shows how many times each word was correctly recognized and from which the average recognition accuracy is calculated. The Table.1 shows the result of speech recognition system using Template generation modeling with Mel Frequency Cepstral Coefficient (MFCC) features. As shown in Table.1, digit 0 and 2 have the highest accuracy rate of 60% and digit 4 & 9 and the word cut have the least accuracy rate of 50%. The word off has the second highest accuracy rate of 58%. In Hidden Markov Model (HMM) / Vector Quantization (VQ) model, codebook is generated using Vector Quantization (VQ) method. In this codebook generation, some n no of clusters are generated after different no of iterations as shown in Table 2. As shown in Table.3, digit 0 has the highest accuracy rate of 88% and digit 2 has the least accuracy rate of 78%. Digit 9 and dial has the next highest accuracy rate of 87%. Words hold and off have the accuracy rate of 85%. Words Loudspeaker and on have the accuracy rate of 84%. Digit 5, 8 and word write have the accuracy rate of 83% Digit 3 and word cut have the accuracy rate of 82%. Digit 6 has the accuracy rate of 81% followed by the digit 4 and word write which have the accuracy rate of 80%. Comparison between the recognition rates of two different classification modeling with Mel Frequency Cepstral Coefficient (MFCC) features is shown in Table.4 which implies that Hidden Markov Model (HMM) / Vector Quantization (VQ) has higher recognition rate than Template generation modeling CONCLUSION The proposed system designed using Hidden Markov Model (HMM) / Vector Quantization (VQ) classification modeling with Mel Frequency Cepstral Coefficient (MFCC) features gives recognition rate of 82.77% for the utterances which is higher than Template generation modeling using Dynamic time warping (DTW) which is of 53.88%. Further it can be implemented using DSP processor for hardware implementation so that voice enabling applications can be made. TMS320C2x dsp processor can be used to implement voice enabled mobile phone. 53

54

References [1] Rabiner, L.R. and R.W. Schafer, Prentice-Hall Inc. Juha Iso-Sipila, Design and Implementation of a Speaker-Independent Voice Dialing System: A Multi- Lingual Approach Ph.D., 1978, Digital Processing of Speech Signals, April 2008. [2] Tilo Schurer, An Experimental comparison of different feature extraction for Telephone speech, in proc of 2 nd IEEE workshop on interactive voice technology for 55

Telecommunication applications, 1994 [3] Jason Chong and Roberto Togneri, Speaker Independent Recognition of Small Vocabulary, M.S.Thesis, Centre for Intelligent Information Processing Systems, the University of Western Australia [4] L. Rabiner and B. H. Jaung, Fundamentals of Speech recognition, Prentice Hall Englewood Cliffs, New Jersey, 1993. [5] S.B. Davis and P. Mermelstein, Comparison of Parametric representations for Monosyllabic word recognition in continuously spoken sentences, IEEE Transactions on Acoustics, Speech, Signal Processing, vol. ASSP-28(4), pp.357-366, August 1980 [6] Itakura F. (1975). Minimum prediction residual applied to speech recognition. IEEE Trans. Acoustics, Speech, Signal Proc., ASSP-23 (1), 67 72 [7] Schafer, R.W. Scientific bases of human-machine communication by voice in Proceedings of the National Academy of Science. Vol. 92, 1995. [8] Hoshimi M., Miyata M., and Hiraoka S., Speaker independent speech recognition method using training from a small number of speakers, in proc IEEE international conference on acoustics, speech and signal processing, pp. 469-472,1992. [9] Nilsson M., Ejnarsson M., Speech Recognition Using HMM: Performance Evaluation in Noisy Environments, MS Thesis, Blekinge Institute of Technology, Department of Telecommunications and Signal Processing, 2002 [10] Ferrer M.A., Alonso I., Travieso C., Influence of initialization and Stop Criteria on MM based recognizers Electronics letters of IEEE, Vol. 36, pp. 1165-1166, June 2000. [11] Rabiner L.R., Levinson S.E., Rosenberg A.E., Wilson J.G., "Speaker independent Recognition of isolated words using clustering techniques, IEEE Trans. Acoustic Speech Signal Process Vol.27, pp. 336 349, 1979 56