PERFORMANCE COMPARISON OF SPEECH RECOGNITION FOR VOICE ENABLING APPLICATIONS - A STUDY

PERFORMANCE COMPARISON OF SPEECH RECOGNITION FOR VOICE ENABLING APPLICATIONS - A STUDY V. Karthikeyan 1 and V. J. Vijayalakshmi 2 1 Department of ECE, VCEW, Thiruchengode, Tamilnadu, India, Karthick77keyan@gmail.com 2 Department of EEE, KPRCET, Coimbatore Tamilnadu, India, vijik810@gmail.com Abstract - Performance Comparison of Speech Recognition for Voice Enabling Applications is presented. This paper focuses on the speaker independent system to provide all the mobile phone applications without touching the device especially for those visually challenged. The proposed system is evaluated through two different classification modeling such as Template generation using Dynamic time warping (DTW) and Hidden Markov Model (HMM) / Vector Quantization (VQ) with Mel Frequency Cepstral Coefficient (MFCC) features. The performance comparison between two classification modeling is made based on the average recognition accuracy Hidden Markov Model (HMM) / Vector Quantization (VQ) classification modeling with Mel Frequency Cepstral Coefficient (MFCC) features gives recognition rate of 82.77% for the utterances which is higher than the conventional methods. TMS320C2x DSP processor can be used to implement voice enabled mobile phone. Index Terms- Template generation, Dynamic time warping (DTW), Hidden Markov Model (HMM) / Vector Quantization (VQ) and Mel Frequency Cepstral Coefficient (MFCC) INTRODUCTION Speech recognition is a field in which the system recognizes the spoken words [1]. It enables the system to identify the words that a person speaks into microphone and recognizes them by converting them into written text. Automatic Speech Recognition (ASR) is the process of determining a sequence of words spoken by human using machines. The goal of ASR is to have speech as a medium of interaction between man and machine. This Automatic Speech Recognition (ASR) is commonly used in voice enabled mobile phones implemented using voice as an input. In these systems, it is necessary to perform recognition of voice input so that the appropriate action can be enabled. Voice enabled mobile phone refers to the enabling of mobile applications where a user may dial a contact in the device memory by saying the digit of the contact which is saved in speed dial or name of the contact[2], sending and receiving the message and making the phone to put on hold mode and loudspeaker mode. The first step in the process of voice enabling application is speech recognition. To do so, it is necessary to extract the features from the uttered word. Recognition of the uttered word is done by means of comparing both trained sample and test sample. The most commonly used feature extraction algorithm is Mel frequency Cepstral coefficient front end. The power frequency data is then filtered with filter that resembles the human ear s sensitivity curve called Mel scale Filter. While the human ear is sensitive to frequency variations in lower frequency values, this sensitivity is reduced for higher frequency 48

signal components. From the experimental results, it is well known that Mel Frequency Cepstral Coefficients (MFCC) is among the best acoustic features used in automatic speech recognition [3]. The Mel Frequency Cepstral Coefficients are robust, contain much information about the vocal tract configuration regardless the source of excitation, and can be used to represent all classes of speech sounds. This paper is organized as follows: In section 2, the general principles of ASR system have been discussed. Section 3 deals with the Mel Frequency Cepstral Coefficient (MFCC) feature extraction technique which is used in the proposed system and section 4 deals with two classification modeling. In section 5 results and conclusion are discussed which shows the accuracy of two classification modeling. AUTOMATIC SPEECH RECOGNITION SYSTEM In general, the Automatic Speech Recognition (ASR) system [4] consists of two modes such as training mode and testing mode. First the input speech signal is pre-processed and then features are extracted. From the extracted features reference samples are created from which the comparison and recognition is made. The general block diagram of Automatic Speech Recognition (ASR) system is shown in figure.1. Training Mode: In speaker dependent and speaker independent system, the system has to be trained by the speaker. It means that samples have to be collected from different speakers using microphone as an input device Accuracy is directly proportional to the no of samples Preprocessing methods are used to extract acoustic characteristics of the speech signal. Software analyses and generates patterns from the extracted feature vectors and stores it in matrix form as a reference pattern. These reference patterns are matched with the input speech during the recognition mode. TESTING MODE: In this mode, the test sample is analyzed for its acoustic characteristics and the important features are extracted from the input speech sample. This feature vectors are used to generate an input pattern using and stores it in matrix form. This unknown pattern is compared against known reference pattern, element by element. Once the best match is found, the appropriate action is enabled. Fig. 1.Block Diagram of Automatic Speech Recognition System FEATURE EXTRACTION Feature extraction is the key to the front-end process in speaker verification systems. The performance of a speaker verification system is highly dependent on the quality of the selected 49

speech features. The speech signal is a slowly varying signal and is often termed stationary. Therefore, short-time spectral analysis is the most common way to characterize speech signal. Before extracting the features the speech signal is preprocessed using following steps. i) Framing ii) Windowing, The speech signal is divided into short fixed length frames. The continuous speech signal is divided into frames where each frame consists of M samples [5]. Very often successive frames are overlapping with each other by M samples. For the proposed system, frame size of M = 256 with an overlap of 50% i.e. N=128 have been used. After frame segmentation, windowing is carried out to minimize the spectral distortion by using the window to taper the signal on both ends thus reducing the side effects caused by signal discontinuity at the beginning and at the end due to framing. Hamming window is used as spectral leakage is less. It is multiplied with each frame and the window function is as given in eqn. (1), (1) Where M is the number of samples in each frame There are different types of feature extraction techniques available such as Mel-Frequency Cepstral Coefficient (MFCC), Linear Prediction Cepstral Co-efficient (LPCC), Bark Frequency Cepstral Co-efficient (BFCC), Perceptual Linear Prediction (PLP) and Rasta Perceptual Linear Prediction. Among those Mel Frequency Cepstral Coefficient (MFCC) provides good recognition accuracy. Mel-Frequency Cepstral Coefficient (MFCC) vectors are used to provide an estimate of the vocal tract filter [6]. Background noise energy level is evaluated at the beginning and the end of speech signal and energy thresholds are applied to find speech beginning and end points. The pre emphasized speech signal is blocked into frames of N = 256 samples, with adjacent frames separated by M = 128 samples. Then windowing is done to minimize the speech signal discontinuities at the beginning and end of each analysis frame. The general block diagram of Mel Frequency Cepstral Coefficient (MFCC) is shown in figure.2 Fig. 2. Block Diagram of MFCC Technique After windowing, Fast Fourier Transform (FFT) is applied. Then a spectrum is passed through 20 Mel scale triangular filter bank. The Mel scale is a critical band frequency scale that takes into account the frequency perception in the human auditory system. Discrete Cosine Transform (DCT) is applied to the log Mel scale filter outputs and thus 12 Mel-frequency Cepstral coefficients (MFCC) are obtained. Then Cepstral filtering is performed with the help of eqn. (2), (2) 50

Where N is the number of filter bank channels. Mel-Frequency Cepstral Coefficients (MFCCs) are calculated from the log filter bank amplitudes {mj} using the eqn. (3) Where N is the number of filter bank channels. SPEAKER MODELLING As the extracted features requires more storage memory, it is necessary to convert those features into vectors so that storage memory requirement can be met which can be done by using classification modeling. The proposed system is verified through Template model and Hidden Markov Model (HMM) / Vector Quantization (VQ). The block diagram of the proposed system is as shown in figure.3. (3) Fig.3. Block Diagram of Proposed System Template generation modeling uses Dynamic Time Warping (DTW) for speech pattern matching [7] for speaker dependent system in which it expands or contracts the time axis non-linearly to match the input speech with the reference template. The reason here to use Dynamic Time Warping (DTW) algorithm is that Dynamic Time Warping (DTW) -based recognition engine has been widely embedded inside Qualcomm MSM (Mobile Station Modem) chips because of its less computational complexity for phone dialing. Template Generation using DTW Templates are reference patterns that are derived from features recorded over the length of the whole word rather than at particular points. The procedure is as follows: Choose one utterance from the training data for reference. Use Dynamic Time Warping (DTW) technique to align all the training data to match with the reference. Once the training data are aligned, compute the reference pattern vector as the Centroid of the feature vectors (of cepstral coefficients) corresponding to all the occurrences of the digit. Dynamic Time Warping is used to create reference templates and to find the best match between the reference template and the input template derived from the test input speech sample. The matching process needs to compensate for length differences and take account of the non-linear 51

nature of the length differences within the words. Dynamic Time Warping (DTW) grid is used to find the best match between input data and stored sequence. We can find a path through the grid which minimizes the total distance between them. The input data is either stretched or compressed in order to match with the input. Once an overall path has been found, the total distance between the input sequence and reference sequence can be calculated for this particular input template. In Dynamic Time Warping method (DTW), when comparing sequences with different length, the sequence length is modified by repeating or omitting some frames, so that both sequences will have the same length. This modification of sequences is called time warping. Hidden Markov Model with Vector Quantization (HMM/VQ) Here the proposed method, speaker independent isolated word recognition is implemented using Vector Quantization (VQ) and Hidden Markov model (HMM) which is suitable to provide higher accuracy rate with more no of samples. One of the most popular approaches to speaker independent [8] speech recognition, is the combination of Vector Quantization (VQ) for the encoding of segments of speech with a Hidden Markov Modelling (HMM) for the classification of sequences of segments [9] as in figure.4. After extracting the features, K-means clustering algorithm is used to iteratively create the vector quantizer codebook until the average distance falls below a preset threshold. A set of such vectors (corresponding to multiple utterances of the same word) is used to re-estimate the Hidden Markov Model [10] for that word. Fig.4. Block Diagram of Hidden Markov Model with Vector Quantization After extracting the features, K-means clustering algorithm is used to iteratively create the vector quantizer codebook until the average distance falls below a preset threshold. A set of such vectors (corresponding to multiple utterances of the same word) is used to re-estimate the Hidden Markov Model [10] for that word. This procedure is repeated for each word in the vocabulary. In the testing mode, the set of Mel Frequency Cepstral Coefficient (MFCC) vectors corresponding to the unknown word is quantized by the vector quantizer to give a vector of codebook indices. This is scored on each word Hidden Markov model (HMM) to give a probability score for each word model. The decision rule is used to choose the word whose model gives the highest probability. VQ is a process of mapping vectors from a large vector space to a finite number of regions in that space. Each region is called a cluster and can be represented by its center called a centroid. The collection of all code words is called a codebook [11]. In speech recognition, vector quantization can be used to train Hidden Markov model (HMMs). RESULTS AND DISCUSSION The proposed speaker independent isolated speech recognition system is implemented in 52

MATLAB software. Database containing 0-9 digits and 8 words Dial, Cut, On, Off, Hold, Read, Write and Loudspeaker are considered for voice dialing applications. The digits are used to dial a contact no assigned for speed dial. The words like Dial, Cut and On, Off and Hold are used to control voice dialing application and Mobile phone on and off operation respectively. The words like Read, Write and Loudspeaker are used to enable message read and write application and hands free mode respectively. Database is collected from 25 speakers pronouncing each word 10 times yielding 250 utterances per word and 180 utterances per speaker, totally 4500 samples. The first 8 utterance of each word is used for training and remaining is used for testing. In Training phase, the uttered digits are recorded using 8-bit Pluse Code Modulation (PCM) with a sampling rate of 8 KHz, which holds the information of the communication and converted as a wave file using Total audio converter software The performance of the speech recognition systems is given in terms of a word error rate (%) and is calculated from eqn.4, W= (M/N) x100% (4) Where: N = Total no of samples taken M = Total no of recognized samples In the confusion matrix, each uttered word is compared with all other words and it shows how many times each word was correctly recognized and from which the average recognition accuracy is calculated. The Table.1 shows the result of speech recognition system using Template generation modeling with Mel Frequency Cepstral Coefficient (MFCC) features. As shown in Table.1, digit 0 and 2 have the highest accuracy rate of 60% and digit 4 & 9 and the word cut have the least accuracy rate of 50%. The word off has the second highest accuracy rate of 58%. In Hidden Markov Model (HMM) / Vector Quantization (VQ) model, codebook is generated using Vector Quantization (VQ) method. In this codebook generation, some n no of clusters are generated after different no of iterations as shown in Table 2. As shown in Table.3, digit 0 has the highest accuracy rate of 88% and digit 2 has the least accuracy rate of 78%. Digit 9 and dial has the next highest accuracy rate of 87%. Words hold and off have the accuracy rate of 85%. Words Loudspeaker and on have the accuracy rate of 84%. Digit 5, 8 and word write have the accuracy rate of 83% Digit 3 and word cut have the accuracy rate of 82%. Digit 6 has the accuracy rate of 81% followed by the digit 4 and word write which have the accuracy rate of 80%. Comparison between the recognition rates of two different classification modeling with Mel Frequency Cepstral Coefficient (MFCC) features is shown in Table.4 which implies that Hidden Markov Model (HMM) / Vector Quantization (VQ) has higher recognition rate than Template generation modeling CONCLUSION The proposed system designed using Hidden Markov Model (HMM) / Vector Quantization (VQ) classification modeling with Mel Frequency Cepstral Coefficient (MFCC) features gives recognition rate of 82.77% for the utterances which is higher than Template generation modeling using Dynamic time warping (DTW) which is of 53.88%. Further it can be implemented using DSP processor for hardware implementation so that voice enabling applications can be made. TMS320C2x dsp processor can be used to implement voice enabled mobile phone. 53

References [1] Rabiner, L.R. and R.W. Schafer, Prentice-Hall Inc. Juha Iso-Sipila, Design and Implementation of a Speaker-Independent Voice Dialing System: A Multi- Lingual Approach Ph.D., 1978, Digital Processing of Speech Signals, April 2008. [2] Tilo Schurer, An Experimental comparison of different feature extraction for Telephone speech, in proc of 2 nd IEEE workshop on interactive voice technology for 55

Telecommunication applications, 1994 [3] Jason Chong and Roberto Togneri, Speaker Independent Recognition of Small Vocabulary, M.S.Thesis, Centre for Intelligent Information Processing Systems, the University of Western Australia [4] L. Rabiner and B. H. Jaung, Fundamentals of Speech recognition, Prentice Hall Englewood Cliffs, New Jersey, 1993. [5] S.B. Davis and P. Mermelstein, Comparison of Parametric representations for Monosyllabic word recognition in continuously spoken sentences, IEEE Transactions on Acoustics, Speech, Signal Processing, vol. ASSP-28(4), pp.357-366, August 1980 [6] Itakura F. (1975). Minimum prediction residual applied to speech recognition. IEEE Trans. Acoustics, Speech, Signal Proc., ASSP-23 (1), 67 72 [7] Schafer, R.W. Scientific bases of human-machine communication by voice in Proceedings of the National Academy of Science. Vol. 92, 1995. [8] Hoshimi M., Miyata M., and Hiraoka S., Speaker independent speech recognition method using training from a small number of speakers, in proc IEEE international conference on acoustics, speech and signal processing, pp. 469-472,1992. [9] Nilsson M., Ejnarsson M., Speech Recognition Using HMM: Performance Evaluation in Noisy Environments, MS Thesis, Blekinge Institute of Technology, Department of Telecommunications and Signal Processing, 2002 [10] Ferrer M.A., Alonso I., Travieso C., Influence of initialization and Stop Criteria on MM based recognizers Electronics letters of IEEE, Vol. 36, pp. 1165-1166, June 2000. [11] Rabiner L.R., Levinson S.E., Rosenberg A.E., Wilson J.G., "Speaker independent Recognition of isolated words using clustering techniques, IEEE Trans. Acoustic Speech Signal Process Vol.27, pp. 336 349, 1979 56