Speech Recognition with Indonesian Language for Controlling Electric Wheelchair

Similar documents
Human Emotion Recognition From Speech

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Speech Emotion Recognition Using Support Vector Machine

Speaker Identification by Comparison of Smart Methods. Abstract

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems

WHEN THERE IS A mismatch between the acoustic

Speaker recognition using universal background model on YOHO database

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

A study of speaker adaptation for DNN-based speech synthesis

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Circuit Simulators: A Revolutionary E-Learning Platform

Word Segmentation of Off-line Handwritten Documents

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Speech Recognition by Indexing and Sequencing

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Learning Methods in Multilingual Speech Recognition

Speech Recognition at ICSI: Broadcast News and beyond

Speaker Recognition. Speaker Diarization and Identification

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

Australian Journal of Basic and Applied Sciences

On the Formation of Phoneme Categories in DNN Acoustic Models

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Automatic Pronunciation Checker

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Body-Conducted Speech Recognition and its Application to Speech Support System

Python Machine Learning

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

A Biological Signal-Based Stress Monitoring Framework for Children Using Wearable Devices

How to Judge the Quality of an Objective Classroom Test

INPE São José dos Campos

Support Vector Machines for Speaker and Language Recognition

Cal s Dinner Card Deals

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

A Case Study: News Classification Based on Term Frequency

Lecture 10: Reinforcement Learning

On Human Computer Interaction, HCI. Dr. Saif al Zahir Electrical and Computer Engineering Department UBC

An Online Handwriting Recognition System For Turkish

On Developing Acoustic Models Using HTK. M.A. Spaans BSc.

Affective Classification of Generic Audio Clips using Regression Models

Segregation of Unvoiced Speech from Nonspeech Interference

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

CEFR Overall Illustrative English Proficiency Scales

TRANSFER LEARNING IN MIR: SHARING LEARNED LATENT REPRESENTATIONS FOR MUSIC AUDIO CLASSIFICATION AND SIMILARITY

Voice conversion through vector quantization

Rule Learning With Negation: Issues Regarding Effectiveness

Numeracy Medium term plan: Summer Term Level 2C/2B Year 2 Level 2A/3C

Instructor: Mario D. Garrett, Ph.D. Phone: Office: Hepner Hall (HH) 100

SIE: Speech Enabled Interface for E-Learning

Statewide Framework Document for:

White Paper. The Art of Learning

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

Evolutive Neural Net Fuzzy Filtering: Basic Description

Calibration of Confidence Measures in Speech Recognition

Large vocabulary off-line handwriting recognition: A survey

MTH 215: Introduction to Linear Algebra

GCSE Mathematics B (Linear) Mark Scheme for November Component J567/04: Mathematics Paper 4 (Higher) General Certificate of Secondary Education

Data Fusion Models in WSNs: Comparison and Analysis

Probabilistic Latent Semantic Analysis

Lecture 9: Speech Recognition

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Automatic segmentation of continuous speech using minimum phase group delay functions

LEGO MINDSTORMS Education EV3 Coding Activities

Individual Component Checklist L I S T E N I N G. for use with ONE task ENGLISH VERSION

Rule Learning with Negation: Issues Regarding Effectiveness

Dublin City Schools Mathematics Graded Course of Study GRADE 4

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Teaching a Laboratory Section

Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques

DOES RETELLING TECHNIQUE IMPROVE SPEAKING FLUENCY?

Learning Methods for Fuzzy Systems

Transcription:

Speech Recognition with Indonesian Language for Controlling Electric Wheelchair Daniel Christian Yunanto Master of Information Technology Sekolah Tinggi Teknik Surabaya Surabaya, Indonesia danielcy23411004@gmail.com Endang Setyati Master of Information Technology Sekolah Tinggi Teknik Surabaya Surabaya, Indonesia endang@stts.edu Abstract The number of people with physical disabilities are increasing and innovations are needed for availability tools that can help them. This study uses a speech recognition system to control electric wheelchairs which are this believed to help people with physical disabilities. Speech recognition in this study used MFCC as a feature extraction method and HMM as its classification method. Recognized words amount until 17 words which recognizable words would be processed so recognize an electric wheelchair based on commands contained in the words. The result of this study is very well, there are some words that can be recognized perfectly. Keywords speech recognition; control electric wheelchair; mel-frequency cepstral coefficients; hidden markov model I. INTRODUCTION In this century, the number of people which have physical disability are increasing. One of them is disability for stand up and move from a place to another place by themselves. Wheelchair is a tool to help people with physical disability can move from a place to another place. From time to time, wheelchair always be upgraded. From design transformation for more convenient to used until the electrical wheelchair is appeared. The study of wheelchair is rarely today, so it needs more improvement. One of the problems cannot be solved at this time is hands and feet disabilities. In order to move elsewhere, they still needs help from the others. The final solution to solve this problem is controlling the electric wheelchair based on human commands through speech. Thus, to drive an electric wheelchair is just enough to say the command, so it could help them who have physical disabilities on the hands and feet. Speech recognition is one of the most frequently topics in research today. Some equipment has been developed, so it can be ruled by human speech [1]. The development of speech recognition is going to be better over the time, and in the future predicted that the machine can recognize human speech perfectly. There are two elements in speech recognition, namely feature extraction and classification. MFCC and HMM are methods that commonly used in many cases of speech recognition [2] [3]. Speech recognition has been widely used to control something that can move, like a robot. Paper [4] told controls to move a legged robot using speech command in English. Paper [5] controls mobile robot using speech command in Canadian. This paper uses speech recognition to control an electric wheelchair. The advantage of this research lies in the number of vocabularies that could be recognize and converted into electric wheelchair s movement. In addition, the language used to here is Indonesian language, where there are rarely research on speech recognition with this language. II. MEL-FREQUENCY CEPSTRAL COEFFICIENTS Mel-Frequency Cepstral Coefficients (MFCC) is one of the most widely feature extraction methods in studies at speech recognition [6]. Therefore, this study would be used MFCC for feature extraction method. The figure 1 is a maju speech signal and the result of feature extraction using MFCC. (a) Maju speech signal (b) After MFCC Fig. 1. Maju speech signal and the result of MFCC Figure 2 is a diagram block of MFCC. The diagram block shows that MFCC have 7 stages to performing feature 90

extraction. All of the stages will be explained in the following discussion. equation of the window function to the input sound signal, where the sample signal value after windowing, the sample value of the frame to i, the window function, 0,1,...,N-1, the frame size. (2) Window function commonly used in speech recognition is hamming window. Equation (3) is the general equation of the hamming window, where 0,1,...,M-1, frame length. (3) Fig. 2. MFCC Block Diagram A. Pre-Emphasis Pre-emphasis is a type of filter that used to maintain the high frequencies contained in a spectrum. The sound signal with pre-emphasis would be resulted in the distribution of energy at each frequency to be more balanced compare to sound signals without pre-emphasis. Equation (1) is a general equation of pre-emphasis filter, where the signal after the pre-emphasis filter, the initial signal, the constant between 0.9-1. (1) B. Frame Blocking Voice signals can not be directly processed just like that, but it must been divided into several short frames. The frame size is used should not be haphazard, if the frame is too long, the frame has a bad time resolution. Conversely, if the frame is to short, the frame has a poor frequency resolution. Frame length commonly used to speech recognition range from 10-30 milliseconds. The condition of divide the signal into multiple frames is the Frame N must overlap with the frame to n-1. Overlapping is done in order to avoid the loss of voice characteristics at the intersection of each frame. Overlap length range from 30% to 50%. C. Windowing The frame blocking process in the previous step, has a weakness must overcome. The disadvantage is it can cause spectral or aliasing leakage. In order to overcome this weakness, the result of the frame blocking process must go through the windowing process. Equation (2) is the general D. Fast Fourier Transform (FFT) The next process required is to change the signal from the time domain to the frequency domain. The usual method used to do this is Fourier analysis using the Discrete Fourier Transform (DFT) formula. Equation (4) is a general DFT equation, where the calculation results DFT (in complex numbers), the value of the signal samples after the windowing, the number of samples processed, the discrete frequency variable. Over time, the DFT formula is considered so slow if it is directly used in a programming algorithm. The DFT formula is finally developed into an FFT (Fast Fourier Transform) algorithm, in which FFT is able to eliminate the twin calculation process in the DFT formula. E. Mel Filterbank The next step in MFCC after fourier calculation is filterbank mel. In some papers, this step also called mel frequency wrapping. In this step, the things is looking for is a filterbank, where filterbank is a filter used to determine the energy size of a particular band frequency in a signal. Equation (5) is the general equation of filterbank's filter, where the calculated value of the filterbank, the magnitude spectrum at frequency j, the filter coefficient at frequency j,, the number of channels in the filter bank. F. Discrete Cosine Transform (DCT) This step is actually the last step in feature extraction process with MFCC method. After going through this DCT stage, the next step is a step improve the performance of the features of the signal obtained. The concept of DCT is decorate the mel spectrum so as to produce a good representation of the local spectral. Results from DCT (4) (5) 91

approach PCA (Principle Component Analysis), where PCA is a frequently used method of data analysis and compression. Equation (6) is a general equation of DCT, where the output of the mel filterbank, the number of expected coefficients. The zero coefficients of DCT are usually eliminated, because based on the studies that have been done, the coefficient to zero is not reliable for speech recognition. Although it contains the energy of the frame signal, the coefficient to zero, is considered unnecessary in speech recognition. G. Cepstral Liftering Feature extraction results with the MFCC method, actually only up to DCT step. However, it still has some disadvantages, namely low order of cepstral coefficients is very sensitive to spectral slope, and on the high order is very sensitive to noise. This step, namely cepstral liftering is done to minimize the sensitivity. With cepstral liftering, the resulting spectrum of MFCC becomes smoother and can be used better for speech recognition. Equation (7) is a general equation of cepstral liftering, where the number of cepstral coefficients, the index of cepstral coefficients. III. HIDDEN MARKOV MODEL Hidden Markov Model (HMM) is a data modeling method with Markov chain as the basis for thinking. This method use probability theory as the basis of his knowledge. In HMM, there are two types of states: [7, 8, 9] Observable state : is a state that contains observation and observation data. Hidden state : is a state that contains things to be recognized or guessed. The HMM model is expressed in three main parts, namely vector, transition matrix A, and observation matrix B. In general, the HMM equation is written in the following form: (8) 1) Vector This vector contains the initial state of the observational data. This vector is a N-size matrix, where N is the number of states. The requirement of this vector is that the total value of this vector must be 1. 2) Matrix A This matrix contains the probability values of state changes. This matrix is NxN, where N is the number of states. The requirement of this matrix is that the sum of values in a row matrix must be 1. 3) Matrix B (6) (7) This matrix contains the probability values of an observation feature in a state state. This matrix is NxM, where N is the number of states and M is the number of features in the observation data. The requirement of this matrix is that the sum of values in a row matrix must be 1. There are 3 basic problems that arise when using HMM, namely: [7, 8, 9] Evaluation Problem. This problem arise when determining the appropriate HMM model for the observation data dump that arises from some of the available HMM models. This problem can be solving with forward or backward algorithms and the Viterbi algorithm. Decoding Problem. This problem arise when determining suitable hidden state rows of observed pair of observation data and HMM models already available. This problem can be solved with the Viterbi algorithm. Training Problem. This problem arise when creating an appropriate HMM model from a set of observation data rows and a set of hidden state rows. This problem can be solved with the Baum-Welch algorithm and the K-Means segmental algorithm. IV. SPEECH RECOGNITION USING MFCC AND HMM This study will involve 10 people, where the voice of speech from 7 people will be used as training data and 3 people will be used as data testing. Recognized speech input as many as 17 utterances, as described in table 1. For voice recording, each word pronunciation is performed 10 times per person, so there will be 1,190 training datasets and 510 dataset testing. The movement listed in table 1 will be judged as a definite movement, so that if the prototype movement in response to an input, different from the above list, then the movement will immediately be regarded as the wrong movement Table 1. List Input dan Output No. Input (Speech) Output (Prototype Movement) 1. Maju (Forward) Forward 2. Mundur (Backward) Backward 3. Kanan (Right) Rotate right 4. Kiri (Left) Rotate left 5. Stop (Stop) Stop 6. Maju kanan (Forward then right) Move forward, then rotate right 90 0, then move forward 7. Maju kiri (Forward then left) Move forward, then rotate left 90 0, then move forward 8. Serong kanan Rotate right 45 0, then move forward (Right oblique) 9. Serong kiri Rotate left 45 0, then move forward (Left oblique) 10. Belok kanan Rotate right 90 0, then move forward (Turn right) 11. Belok kiri Rotate left 90 0, then move forward (Turn left) 12. Lurus (Forward) Forward 13. Pelan (Slow) Decrease speed 14. Lambat (Slow) Decrease speed 15. Cepat (Fast) Increase speed 16. Berhenti (Stop) Stop 17. Putar balik (U-Turn) U-Turn 92

A. Data Training Recorded sound signal is used for training, processed with MFCC to get its features. The features obtained from MFCC are called cepstrum. Cepstrum sounds obtained from MFCC are trained by HMM and produce vectors, matrices A, and matrices B for each input type. Therefore, the number of inputs that can be accepted by the system is 17 words, then HMM for training will produce 17 sets of HMM (one set consisting of vector, matrix A, and matrix B). All HMM sets will then be stored in the database. Figure 3 is a block diagram explaining the training data flow. Fig 3. Training data block diagram B. Data Testing The sound signal is tested, processed with MFCC and produces cepstrum. Cepstrum is then processed by HMM. Processing is done by searching for the greatest probability of the probabilities of calculation result of cepstrum with 17 sets of HMM obtained during training. One index of the 17 sets of HMMs that produce the greatest probability of incoming cepstrum, will be considered as the class of the tested signal. Figure 4 is a block diagram that describe the data testing. V. EXPERIMENTAL RESULT The experiment was performed using recorded speech voices. The speech voice from 7 people were trained using the proposed algorithm, to generate the HMM model of each word. Then, all trained voices is tested to know the capability of this system to recognize the words. This study also tested all voices that had never been trained before. These voices have been previously defined as test voices. Table [] represents the percentage of the system to recognize each words. Table 2. Percentage of System Success to Recognize Each Word No. Word Trained Voice Non-Trained Voice 1. Maju 100% 85% 2. Mundur 100% 85% 3. Kanan 98% 84% 4. Kiri 98% 84% 5. Stop 100% 90% 6. Maju kanan 98% 83% 7. Maju kiri 98% 83% 8. Serong kanan 97% 83% 9. Serong kiri 97% 83% 10. Belok kanan 97% 84% 11. Belok kiri 97% 84% 12. Lurus 98% 85% 13. Pelan 100% 85% 14. Lambat 100% 85% 15. Cepat 100% 85% 16. Berhenti 95% 83% 17. Putar balik 100% 84% Table 2 tell us that system can recognize each word from trained voice very well. However, when system was tested with different voice, the percentage is decreased. From that fact, this system can be implemented very well in electrical wheelchair, when the user s voice is trained first. VI. CONCLUSION Fig 4. Testing data block diagram Speech recognition is one of the topics currently be developed continuously. The use of MFCC as feature extractor and HMM as a classifier has been widely applied in various studies on speech recognition. This research give a new nuance in the sphere of speech recognition, where speech recognition can be used to control mobile devices, one of them is an electric wheelchair. In subsequent studies, it is hoped that the development of speech recognition in other languages control electric wheelchairs, so that this system could be used globally and could help people which have body weakness. The result of this study is the system can recognize each word from trained voice very well. However, when system was tested with different voice, the percentage is decreased. Our future work will focus on implementing this system to electric wheelchair. The first thing that must to do is design the recording system. The recording system must automatically record voice from the user. After that, we must to design the controller system based on word from recognizing system. 93

REFERENCES [1] T. Phanprasit, Controlling Robot using Thai Speech Recognition Based on Eigen Sound, International Conference on Knowledge and Smart Technology, 2014, pp. 57-62. [2] J. H. Im and S. Y. Lee, Unified Training of Feature Extractor and HMM Classifier for Speech Recognition, IEEE Signal Processing Letters, 2012, pp. 111-114. [3] C. T. Do, D. Pastor and A. Goalic, On The Recognition of Cochlear Implant-Like Spectrally Reduced Speech with MFCC and HMM-Based ASR, IEEE Transactions On Audio, Speech, And Language Processing, 2010, pp. 1065-1068. [4] D. D. Phal, D. K. D. Phal and P. S. Jacob, Design, Implementation and Reliability Estimation of Speech-controlled Mobile Robot, International Conference on Emerging Technology Trends in Electronics, Communication and Networking, 2014, pp. 1-6. [5] H. Muralikrishna ; T. Ananthakrishna ; Kumara shama, HMM Based Isolated Kannada Digit Recognition System using MFCC, International Conference on Advances in Computing, Communications, and Informatics, 2013, pp. 730-733. [6] P. A. Shigli, D. K. S. Rao dan I. Patel, A Spectral Feature Process For Speech Recognition Using Hmm With Mfcc Approach, National Conference on Computing and Communication Systems, 2012. [7] L. R. Rabiner, A Tutorial on Hidden Markov Models and Selected Aplications in Speech Recognition, Proceedings of The IEEE, vol. 77,, February 1989, pp. 257-286. [8] E. Setyati, "Talking Head System In Indonesian Language With Affective Facial Expressions Synthesis", Surabaya: Institut Teknologi Sepuluh November, Disertation 2017, pp. 1-172. [9] Endang Setyati, Joan Santoso, Surya Sumpeno, Mauridhi Hery Purnomo, Hidden Markov Models based Indonesian Viseme Model for Natural Speech with Affection, Kursor, Scientific Journal on Information Technology, University of Trunojoyo Madura, Vol.8, No.3, July 2016, pp. 215 222. 94