Speech Recognition with Indonesian Language for Controlling Electric Wheelchair Daniel Christian Yunanto Master of Information Technology Sekolah Tinggi Teknik Surabaya Surabaya, Indonesia danielcy23411004@gmail.com Endang Setyati Master of Information Technology Sekolah Tinggi Teknik Surabaya Surabaya, Indonesia endang@stts.edu Abstract The number of people with physical disabilities are increasing and innovations are needed for availability tools that can help them. This study uses a speech recognition system to control electric wheelchairs which are this believed to help people with physical disabilities. Speech recognition in this study used MFCC as a feature extraction method and HMM as its classification method. Recognized words amount until 17 words which recognizable words would be processed so recognize an electric wheelchair based on commands contained in the words. The result of this study is very well, there are some words that can be recognized perfectly. Keywords speech recognition; control electric wheelchair; mel-frequency cepstral coefficients; hidden markov model I. INTRODUCTION In this century, the number of people which have physical disability are increasing. One of them is disability for stand up and move from a place to another place by themselves. Wheelchair is a tool to help people with physical disability can move from a place to another place. From time to time, wheelchair always be upgraded. From design transformation for more convenient to used until the electrical wheelchair is appeared. The study of wheelchair is rarely today, so it needs more improvement. One of the problems cannot be solved at this time is hands and feet disabilities. In order to move elsewhere, they still needs help from the others. The final solution to solve this problem is controlling the electric wheelchair based on human commands through speech. Thus, to drive an electric wheelchair is just enough to say the command, so it could help them who have physical disabilities on the hands and feet. Speech recognition is one of the most frequently topics in research today. Some equipment has been developed, so it can be ruled by human speech [1]. The development of speech recognition is going to be better over the time, and in the future predicted that the machine can recognize human speech perfectly. There are two elements in speech recognition, namely feature extraction and classification. MFCC and HMM are methods that commonly used in many cases of speech recognition [2] [3]. Speech recognition has been widely used to control something that can move, like a robot. Paper [4] told controls to move a legged robot using speech command in English. Paper [5] controls mobile robot using speech command in Canadian. This paper uses speech recognition to control an electric wheelchair. The advantage of this research lies in the number of vocabularies that could be recognize and converted into electric wheelchair s movement. In addition, the language used to here is Indonesian language, where there are rarely research on speech recognition with this language. II. MEL-FREQUENCY CEPSTRAL COEFFICIENTS Mel-Frequency Cepstral Coefficients (MFCC) is one of the most widely feature extraction methods in studies at speech recognition [6]. Therefore, this study would be used MFCC for feature extraction method. The figure 1 is a maju speech signal and the result of feature extraction using MFCC. (a) Maju speech signal (b) After MFCC Fig. 1. Maju speech signal and the result of MFCC Figure 2 is a diagram block of MFCC. The diagram block shows that MFCC have 7 stages to performing feature 90
extraction. All of the stages will be explained in the following discussion. equation of the window function to the input sound signal, where the sample signal value after windowing, the sample value of the frame to i, the window function, 0,1,...,N-1, the frame size. (2) Window function commonly used in speech recognition is hamming window. Equation (3) is the general equation of the hamming window, where 0,1,...,M-1, frame length. (3) Fig. 2. MFCC Block Diagram A. Pre-Emphasis Pre-emphasis is a type of filter that used to maintain the high frequencies contained in a spectrum. The sound signal with pre-emphasis would be resulted in the distribution of energy at each frequency to be more balanced compare to sound signals without pre-emphasis. Equation (1) is a general equation of pre-emphasis filter, where the signal after the pre-emphasis filter, the initial signal, the constant between 0.9-1. (1) B. Frame Blocking Voice signals can not be directly processed just like that, but it must been divided into several short frames. The frame size is used should not be haphazard, if the frame is too long, the frame has a bad time resolution. Conversely, if the frame is to short, the frame has a poor frequency resolution. Frame length commonly used to speech recognition range from 10-30 milliseconds. The condition of divide the signal into multiple frames is the Frame N must overlap with the frame to n-1. Overlapping is done in order to avoid the loss of voice characteristics at the intersection of each frame. Overlap length range from 30% to 50%. C. Windowing The frame blocking process in the previous step, has a weakness must overcome. The disadvantage is it can cause spectral or aliasing leakage. In order to overcome this weakness, the result of the frame blocking process must go through the windowing process. Equation (2) is the general D. Fast Fourier Transform (FFT) The next process required is to change the signal from the time domain to the frequency domain. The usual method used to do this is Fourier analysis using the Discrete Fourier Transform (DFT) formula. Equation (4) is a general DFT equation, where the calculation results DFT (in complex numbers), the value of the signal samples after the windowing, the number of samples processed, the discrete frequency variable. Over time, the DFT formula is considered so slow if it is directly used in a programming algorithm. The DFT formula is finally developed into an FFT (Fast Fourier Transform) algorithm, in which FFT is able to eliminate the twin calculation process in the DFT formula. E. Mel Filterbank The next step in MFCC after fourier calculation is filterbank mel. In some papers, this step also called mel frequency wrapping. In this step, the things is looking for is a filterbank, where filterbank is a filter used to determine the energy size of a particular band frequency in a signal. Equation (5) is the general equation of filterbank's filter, where the calculated value of the filterbank, the magnitude spectrum at frequency j, the filter coefficient at frequency j,, the number of channels in the filter bank. F. Discrete Cosine Transform (DCT) This step is actually the last step in feature extraction process with MFCC method. After going through this DCT stage, the next step is a step improve the performance of the features of the signal obtained. The concept of DCT is decorate the mel spectrum so as to produce a good representation of the local spectral. Results from DCT (4) (5) 91
approach PCA (Principle Component Analysis), where PCA is a frequently used method of data analysis and compression. Equation (6) is a general equation of DCT, where the output of the mel filterbank, the number of expected coefficients. The zero coefficients of DCT are usually eliminated, because based on the studies that have been done, the coefficient to zero is not reliable for speech recognition. Although it contains the energy of the frame signal, the coefficient to zero, is considered unnecessary in speech recognition. G. Cepstral Liftering Feature extraction results with the MFCC method, actually only up to DCT step. However, it still has some disadvantages, namely low order of cepstral coefficients is very sensitive to spectral slope, and on the high order is very sensitive to noise. This step, namely cepstral liftering is done to minimize the sensitivity. With cepstral liftering, the resulting spectrum of MFCC becomes smoother and can be used better for speech recognition. Equation (7) is a general equation of cepstral liftering, where the number of cepstral coefficients, the index of cepstral coefficients. III. HIDDEN MARKOV MODEL Hidden Markov Model (HMM) is a data modeling method with Markov chain as the basis for thinking. This method use probability theory as the basis of his knowledge. In HMM, there are two types of states: [7, 8, 9] Observable state : is a state that contains observation and observation data. Hidden state : is a state that contains things to be recognized or guessed. The HMM model is expressed in three main parts, namely vector, transition matrix A, and observation matrix B. In general, the HMM equation is written in the following form: (8) 1) Vector This vector contains the initial state of the observational data. This vector is a N-size matrix, where N is the number of states. The requirement of this vector is that the total value of this vector must be 1. 2) Matrix A This matrix contains the probability values of state changes. This matrix is NxN, where N is the number of states. The requirement of this matrix is that the sum of values in a row matrix must be 1. 3) Matrix B (6) (7) This matrix contains the probability values of an observation feature in a state state. This matrix is NxM, where N is the number of states and M is the number of features in the observation data. The requirement of this matrix is that the sum of values in a row matrix must be 1. There are 3 basic problems that arise when using HMM, namely: [7, 8, 9] Evaluation Problem. This problem arise when determining the appropriate HMM model for the observation data dump that arises from some of the available HMM models. This problem can be solving with forward or backward algorithms and the Viterbi algorithm. Decoding Problem. This problem arise when determining suitable hidden state rows of observed pair of observation data and HMM models already available. This problem can be solved with the Viterbi algorithm. Training Problem. This problem arise when creating an appropriate HMM model from a set of observation data rows and a set of hidden state rows. This problem can be solved with the Baum-Welch algorithm and the K-Means segmental algorithm. IV. SPEECH RECOGNITION USING MFCC AND HMM This study will involve 10 people, where the voice of speech from 7 people will be used as training data and 3 people will be used as data testing. Recognized speech input as many as 17 utterances, as described in table 1. For voice recording, each word pronunciation is performed 10 times per person, so there will be 1,190 training datasets and 510 dataset testing. The movement listed in table 1 will be judged as a definite movement, so that if the prototype movement in response to an input, different from the above list, then the movement will immediately be regarded as the wrong movement Table 1. List Input dan Output No. Input (Speech) Output (Prototype Movement) 1. Maju (Forward) Forward 2. Mundur (Backward) Backward 3. Kanan (Right) Rotate right 4. Kiri (Left) Rotate left 5. Stop (Stop) Stop 6. Maju kanan (Forward then right) Move forward, then rotate right 90 0, then move forward 7. Maju kiri (Forward then left) Move forward, then rotate left 90 0, then move forward 8. Serong kanan Rotate right 45 0, then move forward (Right oblique) 9. Serong kiri Rotate left 45 0, then move forward (Left oblique) 10. Belok kanan Rotate right 90 0, then move forward (Turn right) 11. Belok kiri Rotate left 90 0, then move forward (Turn left) 12. Lurus (Forward) Forward 13. Pelan (Slow) Decrease speed 14. Lambat (Slow) Decrease speed 15. Cepat (Fast) Increase speed 16. Berhenti (Stop) Stop 17. Putar balik (U-Turn) U-Turn 92
A. Data Training Recorded sound signal is used for training, processed with MFCC to get its features. The features obtained from MFCC are called cepstrum. Cepstrum sounds obtained from MFCC are trained by HMM and produce vectors, matrices A, and matrices B for each input type. Therefore, the number of inputs that can be accepted by the system is 17 words, then HMM for training will produce 17 sets of HMM (one set consisting of vector, matrix A, and matrix B). All HMM sets will then be stored in the database. Figure 3 is a block diagram explaining the training data flow. Fig 3. Training data block diagram B. Data Testing The sound signal is tested, processed with MFCC and produces cepstrum. Cepstrum is then processed by HMM. Processing is done by searching for the greatest probability of the probabilities of calculation result of cepstrum with 17 sets of HMM obtained during training. One index of the 17 sets of HMMs that produce the greatest probability of incoming cepstrum, will be considered as the class of the tested signal. Figure 4 is a block diagram that describe the data testing. V. EXPERIMENTAL RESULT The experiment was performed using recorded speech voices. The speech voice from 7 people were trained using the proposed algorithm, to generate the HMM model of each word. Then, all trained voices is tested to know the capability of this system to recognize the words. This study also tested all voices that had never been trained before. These voices have been previously defined as test voices. Table [] represents the percentage of the system to recognize each words. Table 2. Percentage of System Success to Recognize Each Word No. Word Trained Voice Non-Trained Voice 1. Maju 100% 85% 2. Mundur 100% 85% 3. Kanan 98% 84% 4. Kiri 98% 84% 5. Stop 100% 90% 6. Maju kanan 98% 83% 7. Maju kiri 98% 83% 8. Serong kanan 97% 83% 9. Serong kiri 97% 83% 10. Belok kanan 97% 84% 11. Belok kiri 97% 84% 12. Lurus 98% 85% 13. Pelan 100% 85% 14. Lambat 100% 85% 15. Cepat 100% 85% 16. Berhenti 95% 83% 17. Putar balik 100% 84% Table 2 tell us that system can recognize each word from trained voice very well. However, when system was tested with different voice, the percentage is decreased. From that fact, this system can be implemented very well in electrical wheelchair, when the user s voice is trained first. VI. CONCLUSION Fig 4. Testing data block diagram Speech recognition is one of the topics currently be developed continuously. The use of MFCC as feature extractor and HMM as a classifier has been widely applied in various studies on speech recognition. This research give a new nuance in the sphere of speech recognition, where speech recognition can be used to control mobile devices, one of them is an electric wheelchair. In subsequent studies, it is hoped that the development of speech recognition in other languages control electric wheelchairs, so that this system could be used globally and could help people which have body weakness. The result of this study is the system can recognize each word from trained voice very well. However, when system was tested with different voice, the percentage is decreased. Our future work will focus on implementing this system to electric wheelchair. The first thing that must to do is design the recording system. The recording system must automatically record voice from the user. After that, we must to design the controller system based on word from recognizing system. 93
REFERENCES [1] T. Phanprasit, Controlling Robot using Thai Speech Recognition Based on Eigen Sound, International Conference on Knowledge and Smart Technology, 2014, pp. 57-62. [2] J. H. Im and S. Y. Lee, Unified Training of Feature Extractor and HMM Classifier for Speech Recognition, IEEE Signal Processing Letters, 2012, pp. 111-114. [3] C. T. Do, D. Pastor and A. Goalic, On The Recognition of Cochlear Implant-Like Spectrally Reduced Speech with MFCC and HMM-Based ASR, IEEE Transactions On Audio, Speech, And Language Processing, 2010, pp. 1065-1068. [4] D. D. Phal, D. K. D. Phal and P. S. Jacob, Design, Implementation and Reliability Estimation of Speech-controlled Mobile Robot, International Conference on Emerging Technology Trends in Electronics, Communication and Networking, 2014, pp. 1-6. [5] H. Muralikrishna ; T. Ananthakrishna ; Kumara shama, HMM Based Isolated Kannada Digit Recognition System using MFCC, International Conference on Advances in Computing, Communications, and Informatics, 2013, pp. 730-733. [6] P. A. Shigli, D. K. S. Rao dan I. Patel, A Spectral Feature Process For Speech Recognition Using Hmm With Mfcc Approach, National Conference on Computing and Communication Systems, 2012. [7] L. R. Rabiner, A Tutorial on Hidden Markov Models and Selected Aplications in Speech Recognition, Proceedings of The IEEE, vol. 77,, February 1989, pp. 257-286. [8] E. Setyati, "Talking Head System In Indonesian Language With Affective Facial Expressions Synthesis", Surabaya: Institut Teknologi Sepuluh November, Disertation 2017, pp. 1-172. [9] Endang Setyati, Joan Santoso, Surya Sumpeno, Mauridhi Hery Purnomo, Hidden Markov Models based Indonesian Viseme Model for Natural Speech with Affection, Kursor, Scientific Journal on Information Technology, University of Trunojoyo Madura, Vol.8, No.3, July 2016, pp. 215 222. 94