Performance Analysis of Spoken Arabic Digits Recognition Techniques

Similar documents
Human Emotion Recognition From Speech

Speech Emotion Recognition Using Support Vector Machine

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

WHEN THERE IS A mismatch between the acoustic

A study of speaker adaptation for DNN-based speech synthesis

Speech Recognition at ICSI: Broadcast News and beyond

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Learning Methods in Multilingual Speech Recognition

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Proceedings of Meetings on Acoustics

Speaker Identification by Comparison of Smart Methods. Abstract

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Segregation of Unvoiced Speech from Nonspeech Interference

Mandarin Lexical Tone Recognition: The Gating Paradigm

Speech Recognition by Indexing and Sequencing

Australian Journal of Basic and Applied Sciences

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

SARDNET: A Self-Organizing Feature Map for Sequences

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Voice conversion through vector quantization

Support Vector Machines for Speaker and Language Recognition

Lecture 1: Machine Learning Basics

Python Machine Learning

Affective Classification of Generic Audio Clips using Regression Models

Learning Methods for Fuzzy Systems

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

On the Formation of Phoneme Categories in DNN Acoustic Models

Speaker recognition using universal background model on YOHO database

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Word Segmentation of Off-line Handwritten Documents

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Automatic Pronunciation Checker

A Case Study: News Classification Based on Term Frequency

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Speech Translation for Triage of Emergency Phonecalls in Minority Languages

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Calibration of Confidence Measures in Speech Recognition

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Data Fusion Models in WSNs: Comparison and Analysis

Reducing Features to Improve Bug Prediction

Automatic intonation assessment for computer aided language learning

Evolutive Neural Net Fuzzy Filtering: Basic Description

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Probabilistic Latent Semantic Analysis

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

A comparison of spectral smoothing methods for segment concatenation based speech synthesis

Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS

Lecture 9: Speech Recognition

Author's personal copy

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Master s Programme in Computer, Communication and Information Sciences, Study guide , ELEC Majors

Revisiting the role of prosody in early language acquisition. Megha Sundara UCLA Phonetics Lab

Arabic Orthography vs. Arabic OCR

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Dyslexia/dyslexic, 3, 9, 24, 97, 187, 189, 206, 217, , , 367, , , 397,

Automatic segmentation of continuous speech using minimum phase group delay functions

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

A Biological Signal-Based Stress Monitoring Framework for Children Using Wearable Devices

Using EEG to Improve Massive Open Online Courses Feedback Interaction

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

Speaker Recognition. Speaker Diarization and Identification

INPE São José dos Campos

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Using dialogue context to improve parsing performance in dialogue systems

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

THE RECOGNITION OF SPEECH BY MACHINE

Multivariate k-nearest Neighbor Regression for Time Series data -

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

COMPUTER INTERFACES FOR TEACHING THE NINTENDO GENERATION

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Transcription:

JOURNAL OF ELECTRONIC SCIENCE AND TECHNOLOGY, VOL., NO., JUNE 5 Performance Analysis of Spoken Arabic Digits Recognition Techniques Ali Ganoun and Ibrahim Almerhag Abstract A performance evaluation of sound recognition techniques in recognizing some spoken Arabic words, namely digits from zero to nine, is proposed. One of the main characteristics of all Arabic digits is polysyllabic words except for zero. The performance analysis is based on different features of phonetic isolated Arabic digits. The main aim of this paper is to compare, analyze, and discuss the outcomes of spoken Arabic digits recognition systems based on three recognition features: the Yule-Walker spectrum features, the Walsh spectrum features, and the Mel frequency Cepstral coefficients (MFCC) features. The MFCC based recognition system achieves the best average correct recognition. On the other hand, the Yule-Walker based recognition system achieves the worst average correct recognition. Index Terms Arabic digits, spectrum analysis, speech recognition.. Introduction Automatic speech recognition (ASR) is a technology that allows an electronic platform such as a smart phone or a computer to identify spoken words. Automatic recognition of spoken digits is one of the challenging tasks in the field of ASR. There are many applications where recognition of spoken digits systems are used, such as recognizing telephone numbers, telephone dialing using speech, airline reservation, and automatic directory to retrieve or send information []. The main advantage of automatic recognition systems of spoken digits is the ease of speech inputting as it does not require any specialized skills. Another advantage is that the information could be recorded even if the user is involved in other activities. Manuscript received June, ; revised June, ; presented at nd International Conference on Signal, Image Processing and Applications, Hong Kong, August,. A. Ganoun is with the Faculty of Engineering, University of Triploi, Triploi 75, Libya (e-mail: ali.ganoun@ee.edu.ly). I. Almerhag is with the Faculty of Information Technology, University of Triploi, Triploi 75, Libya (e-mail: almerhag@hotmail.com). Digital Object Identifier:.969/j.issn.67-86X... However, the automatic recognition of spoken digits process is not straightforward because it involves a number of problems, such as different duration of the same word sound, the redundancy in the speech signal that makes discrimination between spoken digits difficult, and the presence of temporal and frequency variability in pronunciation of spoken digits and signal degradation due to different types of noise found with the signal. The interest in this work is motivated by the minimum efforts in applying known speech recognition techniques on Arabic language recognition in comparison with other languages. In addition, we think that, the performance of recognition systems is language dependent. Therefore, conclusions drawn as a result of evaluating recognition techniques based on other languages may not be applied to Arabic language []. The main aim of this paper is to compare, analyze, and evaluate the accuracy of spoken Arabic digit recognition system of a single speaker using three features used to represent sound signals: the Yule-Walker spectrum analysis, the Walsh spectrum, and the Mel frequency Cepstral coefficients (MFCC) analysis. The performance evaluation of the recognition system is based on the overall system performance and the individual digit accuracy using two parameters: the normalization of the sound feature vector and filtering of the sound feature vector []. The rest of the paper is organized as follows. Section presents a description of the database used by the system. Section presents a brief description of feature extraction processes. Section discusses the experimental setup. Section 5 presents the results of comparisons obtained as a result of this work. The paper concludes with Section 6.. Database Preparation In order to evaluate the selected recognition techniques, a database of the sounds of the Arabic digits ( to 9) was created; where a male Arabic native speaker was asked to utter all digits; each time the speech was recorded in a single file which was approximately s long. This process was repeated times, so that speech files were collected, and each file contained all the Arabic digits. Every speech file contained both speech signals and non-speech signals. Then, each file was analyzed by a

5 JOURNAL OF ELECTRONIC SCIENCE AND TECHNOLOGY, VOL., NO., JUNE detection program in order to locate and segment each spoken digit accurately. In this process, two measures were used in the segmentation of the sound signals: the zero crossing rate and the signal energy. The set of recorded samples has been divided into two groups. One group, consisting of ten samples, was chosen to form the dataset, while the remaining three samples were used as a test set.. Feature Extraction The speech is a signal consisting of a finite number of samples, yet a direct comparison between signals is impossible as the amount of information contained is large. Therefore, the most important features have to be extracted; this process is called feature extraction. The main objective of this step is to transform the original data into a dataset with a reduced number of variables that contain the most discriminatory information and provide a relevant set of features for a classifier, resulting in improved recognition performance []. An example of the recorded speech file with the segmented spoken digits is shown in Fig.. Another goal is to recover a new meaningful underlying variables or features; the data may easily be viewed with a reduced bandwidth compared with the input data. Most feature extraction methods use spectral analysis to extract meaningful components from the speech signal. Choosing effective features is important to achieve a high recognition performance. In this paper three features were used in the comparison, specifically: Yule-Walker spectrum analysis, Walsh spectrum analysis, and MFCC..5.5.5.5 6 8 Time (s) 6 8 Time (s) Fig.. Example of sound signals: recoded sound signal of Arabic spoken digits and segmentation of sound signal. The Yule-Walker algorithm estimates the spectral content of the sound signal by fitting an auto-regressive linear prediction filter model of a given order to the signal. Cepstral based features, such as MFCC, typically represent the magnitude of frequency band power for each speech window, which are widely used in speech processing. The comparison between the test signal and the signals stored in database is based on the Euclidean distance between the two features; the closer the distance, the better the matches. So, the minimum distance value corresponds to the best match. Figs. to Fig. show the spectra of the selected features of two spoken Arabic digits, One and Nine. For more details on those audio features and their application on audio analysis, one can refer to [] [7]. Power (Hz 5 ) Power (Hz 6 ) Fig Fig.. Yule-Walker spectrum of the spoken digits: One and Nine. Power (Hz 5 ) Power (Hz 6 ) 6 8 6 8 Fig.. Samples of Walsh spectrum of Arabic spoken digits: One and Nine.

ALI et al.: Performance Analysis of Spoken Arabic Digits Recognition Techniques 55 5 5 5 5 6 7 Samples window. 5.5 5 5 5 6 7 Samples window Fig.. MFCC features of the Arabic spoken digits: One and Nine. From Fig. to Fig., we can see that there is a difference between the features of the chosen Arabic spoken digits. In fact, the same conclusion is true for all Arabic spoken digits. In general, for the selected three features, the correlation between the features of different spoken digits is very low. On the other hand, even for the same spoken digit we noted, there are variations in the features, as shown in Fig. 5. Normalization and filtration of the sound feature vector are the parameters used for the sake of comparison between the selected features. Normalization will adjust the feature level from to. We expect that this will increase the performance of comparison between the features of the same spoken digit with different volumes. The other parameter is the filtration of the features in order to smooth the feature vectors. Fig. 6 shows the effect of these two parameters on the Walsh feature vector of the Arabic spoken digit Zero.. Experimental Setup For each test sequence every spoken sound is recognized independently. The performance of the selected techniques is evaluated based on the recognition of Arabic spoken digits by performing distinct experiments. Every single experiment is concerned about a specific feature with certain parameters as shown in Table. For all experiments we select the best five matches among the test signal and the signals stored in the database. The main stages of the comparison steps are shown in Fig. 7. The dynamic time wrapping (DTW) step is the nonlinear process that expands or contracts the time axis to match the same landmark positions between the input speech signal and the reference signal in the database...5 Experiment number Table : Comparison experiments Recognition approaches Normalize feature vector Feature filtering Exp Yule- Exp Walker Exp spectrum Exp analysis Exp 5 Exp 6 Walsh spectrum Exp 7 Exp 8 Exp 9 Exp MFCC analysis Exp Exp Power (Hz 6 ) 8 6 6 8 Fig. 5. Mean and the variance of the Walsh features of the Arabic spoken digit Zero. Normalized power (Hz).....6 6 Number of samples..8.6.. 8 (d) 8 8 (d) Fig. 6. Walsh features of the spoken digit zero and the effect of normalization and filtering of the feature vector: sound signal unnormalized feature, (c) normalized feature, and (d) filtered and normalized feature. Power (Hz 6 ) Normalized power (Hz) 5..8.6..

56 JOURNAL OF ELECTRONIC SCIENCE AND TECHNOLOGY, VOL., NO., JUNE 5. Results In order to investigate the performance of recognition approaches the recognition of the Arabic spoken digits was evaluated for each experiment with three test sequences. The obtained results are summarized in Fig. 8, Fig. 9 and Table. Fig. 8 shows the best match of the three test sequences with each experiment. In general it can be noted that the comparison based on MFCC features gives the best recognition results. Another way to represent the recognition results is by calculating the percentages of the exact (best) match in the first five matches. Fig. 9 shows the percentages of the correct match in the first five matches of the three test sequences with each experiment. Table shows both the average score per experiment and the average score for each recognized digit. The results show that the spoken digit achieved the highest recognition rate (with accuracy equal to 8%); then the spoken digit (with accuracy equal to 76%). Again, MFCC analysis gives the best recognition results for the percentages of the first five correct recognition matches. Experiments 9 and in the MFCC analysis without normalization of the feature vectors can be considered here as the best approaches for the recognition of Arabic spoken digits (with accuracy equal to 87% for both cases). From Input sound signal Signal segmentation Select a ssegment Database the result shown in Table we remark also that the recognition of spoken digits 9 and 7 was the worst compared with other spoken digits (with accuracy equal to 5% for both cases). Percent of correct recognition (%) 8 6 Test sequence Test sequence Test sequence 5 6 7 8 9 Experiment number Fig. 8. Percentages of the best correct match of the three test sequences with each experiment. Percent of correct recognition (%) 9 8 7 6 5 Test sequence Test sequence Test sequence 5 6 7 8 9 Experiment number Select features & parameters Features calculation Fig. 9. Percentages of the correct match in the first five matches of the three test sequences for each experiment. No Compare with the database Recognize the digit Last segment? Fig. 7. Flowchart of the comparison tests. Yes End DTW 6. Conclusions In this paper a comparison of three approaches for the recognition of Arabic spoken digits has been presented. As expected, it has been shown that the recognition of Arabic spoken digits based on MFCC features outperform the recognition based on both Yule-Walker features and Walsh spectrum features. Further research will attempt to produce more comparisons based on other features and larger databases with more than one speaker.

ALI et al.: Performance Analysis of Spoken Arabic Digits Recognition Techniques 57 Table : Recognition rate of Arabic spoken digits Num. 5 6 7 8 9 Avg. Exp. 6 5 6 6 Exp. 7 86 86 6 6 7 Exp. 6 5 6 6 Exp. 7 86 86 6 6 7 Exp. 5 66 6 5 6 6 Exp. 6 7 9 7 6 6 6 6 5 6 6 Exp. 7 66 6 6 6 6 Exp. 8 7 9 9 7 6 6 5 6 5 5 6 Exp. 9 8 86 7 8 9 8 86 87 Exp. 6 86 8 7 66 86 6 66 7 7 Exp. 8 86 7 8 9 8 86 87 Exp. 6 86 8 7 66 86 6 66 7 7 Avg. 69 76 8 6 5 7 6 5 5 5 References [] S. Theodoridis and K. Koutroumbas, Pattern Recognition, rd ed. San Diego: Academic Press, Inc., 6. [] J. Holmes and W. Holmes, Speech Synthesis and Recognition, London: Taylor & Francis,. [] K. Saeed and M. Nammous, A speech-and-speaker identification system: feature extraction, description, and classification of speech-signal image, IEEE Trans. on Industrial Electronics, vol. 5, no., pp. 887 897, 7. [] Z. Hachkar, B. Mounir, A. Farchi, et al., Comparison of MFCC and PLP parameterization in pattern recognition of Arabic alphabet speech, Canadian Journal on Artificial Intelligence, Machine Learning & Pattern Recognition, vol., no., pp. 56 6,. [5] M. Abushariah, R. Ainon, R. Zainuddin, et al., Arabic speaker-independent continuous automatic speech recognition based on a phonetically rich and balanced speech corpus, The Int. Arab Journal of Information Technology, vol. 9, no., pp. 8 9,. [6] M. Abdulfattah and R. El Awady, Phonetic recognition of Arabic alphabet letters using neural networks, Int. Journal of Electric & Computer Sciences, vol., no., pp. 5 58,. [7] T. Ganchev, M. Siafarikas, and N. Fakotakis, Evaluation of speech parameterization methods for speaker recognition, Proc. of the Acoustics, vol. 8 9, pp. 5, Sep. 6. Ali Ganoun was born in Libya, in 966. He received the B.S. degree from the University of Benghazi in 988, the M.Sc. degree from the University of Tripoli in 995, both in electrical engineering, and the Ph.D. degree from Orleans University, France, in 7. He is currently a lecturer with the Electrical Engineering Department of University of Triploi, Faculty of Engineering, Libya. His research interests include signal and image processing and computer vision. Ibrahim Almerhag was born in Libya, in 96. He received his Ph.D. degree in computing in 6 and the MBA in from Bradford University. He also holds the M.Sc. degree in electronics and computer engineering from the Technical University of Warsaw in 995. Currently, he is holding the post of assistant professor with the Faculty of Information Technology, University of Tripoli-Libya. His research interests include networking, information security, and signal and image processing.