A Novel Fuzzy Approach to Speech Recognition

Similar documents
AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Human Emotion Recognition From Speech

Speech Emotion Recognition Using Support Vector Machine

Rule Learning With Negation: Issues Regarding Effectiveness

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Learning Methods in Multilingual Speech Recognition

Speaker Identification by Comparison of Smart Methods. Abstract

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

WHEN THERE IS A mismatch between the acoustic

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Speech Recognition at ICSI: Broadcast News and beyond

Modeling function word errors in DNN-HMM based LVCSR systems

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

SARDNET: A Self-Organizing Feature Map for Sequences

Evolutive Neural Net Fuzzy Filtering: Basic Description

Rule Learning with Negation: Issues Regarding Effectiveness

Axiom 2013 Team Description Paper

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Automatic Pronunciation Checker

Body-Conducted Speech Recognition and its Application to Speech Support System

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Word Segmentation of Off-line Handwritten Documents

Modeling function word errors in DNN-HMM based LVCSR systems

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Mathematics Success Grade 7

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Multiplication of 2 and 3 digit numbers Multiply and SHOW WORK. EXAMPLE. Now try these on your own! Remember to show all work neatly!

Python Machine Learning

Lecture 2: Quantifiers and Approximation

LEGO MINDSTORMS Education EV3 Coding Activities

Learning Methods for Fuzzy Systems

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Ricopili: Postimputation Module. WCPG Education Day Stephan Ripke / Raymond Walters Toronto, October 2015

A study of speaker adaptation for DNN-based speech synthesis

A Case Study: News Classification Based on Term Frequency

The Evolution of Random Phenomena

Artificial Neural Networks written examination

Knowledge-Based - Systems

Mandarin Lexical Tone Recognition: The Gating Paradigm

Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment

Switchboard Language Model Improvement with Conversational Data from Gigaword

Spinners at the School Carnival (Unequal Sections)

J j W w. Write. Name. Max Takes the Train. Handwriting Letters Jj, Ww: Words with j, w 321

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Reducing Features to Improve Bug Prediction

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Calibration of Confidence Measures in Speech Recognition

Instructional Supports for Common Core and Beyond: FORMATIVE ASSESMENT

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Running head: DELAY AND PROSPECTIVE MEMORY 1

Reinforcement Learning by Comparing Immediate Reward

On the Formation of Phoneme Categories in DNN Acoustic Models

Curriculum Design Project with Virtual Manipulatives. Gwenanne Salkind. George Mason University EDCI 856. Dr. Patricia Moyer-Packenham

Softprop: Softmax Neural Network Backpropagation Learning

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

Probabilistic Latent Semantic Analysis

Arizona s College and Career Ready Standards Mathematics

Australian Journal of Basic and Applied Sciences

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Progress Monitoring for Behavior: Data Collection Methods & Procedures

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

arxiv: v1 [cs.cv] 10 May 2017

Mathematics Success Level E

Classify: by elimination Road signs

NCEO Technical Report 27

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Grade 2: Using a Number Line to Order and Compare Numbers Place Value Horizontal Content Strand

Dinesh K. Sharma, Ph.D. Department of Management School of Business and Economics Fayetteville State University

Field Experience Management 2011 Training Guides

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Generative models and adversarial training

Speaker recognition using universal background model on YOHO database

Algebra 2- Semester 2 Review

A SURVEY OF FUZZY COGNITIVE MAP LEARNING METHODS

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Build on students informal understanding of sharing and proportionality to develop initial fraction concepts.

Problems of the Arabic OCR: New Attitudes

First Grade Standards

Lecture 1: Basic Concepts of Machine Learning

Large vocabulary off-line handwriting recognition: A survey

Modeling user preferences and norms in context-aware systems

learning collegiate assessment]

Sight Word Assessment

Lecture 10: Reinforcement Learning

INPE São José dos Campos

An Online Handwriting Recognition System For Turkish

Predicting Students Performance with SimStudent: Learning Cognitive Skills from Observation

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Introduction to the Practice of Statistics

Automatic segmentation of continuous speech using minimum phase group delay functions

Transcription:

A Novel Fuzzy Approach to Speech Recognition Ramin Halavati, Saeed Bagheri Shouraki, Mahsa Eshraghi, Milad Alemzadeh, Pujan Ziaie Computer Engineering Department, Sharif University of Technology, Tehran, Iran {halavati, sbagheri,eshraghi, alemzadeh, pujan}@ce.sharif.edu Abstract This paper presents a novel approach to speech recognition using fuzzy modeling. The task begins with conversion of speech spectrogram into a linguistic description based on arbitrary colors and lengths. While phonemes are also described using these fuzzy measures, and recognition is done by normal fuzzy reasoning, a genetic algorithm optimizes phoneme definitions so that to classify samples into correct phonemes. The method is tested over a standard speech data base and the results are presented.. Introduction Recognition of human speech is a problem with many solutions, but still open because none of the current methods are fast and precise enough to be comparable with a human recognizer. Several methods exist for recognition of human phonemes such as Hidden Markov Models [], Time Delay Neural Networks [2], Support Vector Classifiers with HMM [4], Independent Component Analysis [6], HMM and Neural-Network Hybrid [8], and more. Although all methods have scored enough to be accepted as a suitable approach, but they have more or less problems such as too much processing or low immunity versus noise and these weaknesses hold the road open for new approaches. Zadeh proposes computations with words instead of precise numbers as a new paradigm for cognitive problems [9]. He insists that more precise computations do not necessarily result in more precise result in cognitive tasks and it may even result in poorer answers. To reduce the computation effort while keeping the recognition precision, this paper presents a new approach to phoneme identification using fuzzy computations. To do so, the spectrogram of training speech samples is converted into a fuzzy representation using some arbitrary colors and lengths and all computations take place using these linguistic descriptions. A standard genetic algorithm trains the fuzzy definitions to find the best description for each phoneme and recognition is done by a very fast fuzzy reasoning. The rest of this paper is organized as follows: In the next section, fuzzy representation of speech data is described. Section three deals with the recognition method, and training approach is described in section four. Section five presents experimental results and the last section makes the conclusions and presents the further works. 2. Linguistic Spectrogram Representation The first step of both recognition and training processes is conversion of the spectrogram of speech signals into a fuzzy description. The fuzzification approach is based on four major ideas: First, a human recognizer does not read the spectrogram with full precision and pays attention only at local features. Second, a human does not decide based on precise speech amplitudes and only a rough measure is sufficient. Third, we do not count speech frames and use relative lengths such as long or short. Four, we are more sensitive to lower frequencies than higher ones. Based on these ideas, the frequency axis is separated into 25 ranges according to MEL filter banks. The MEL filter bank frequencies are selected so that the ranges are narrower in lower frequencies and wider in higher ones. Figure 2 shows the spectrogram separated with horizontal lines based on MEL bands and vertical lines based on Spectrogram is a 2-D Image in which, the vertical axis represents the frequencies and the horizontal axis represents time. The brightness of each point shows the amplitude of a certain frequency at a certain time. A sample is presented in Figure.

phoneme positions. Then, to make the local data reduction, each sample column in each MEL band is considered as one block and represented with one value which is the average of 0% of highest amplitude points in that block. Figure 3 shows the result of this stage. Note that new image has only 25 values in each vertical line (frequency axis) but the time axis is not altered. As the next step in fuzzification, a gradient of color is used to represent different ranges of amplitudes which starts from black for lowest amplitude, then blue, magenta, red, yellow, green, cyan and finally fading to white showing the highest value. To describe these eight colors, eight fuzzy sets are used which are each described with a trapezoidal as in Figure 4. Also, as shown in Figures 2 and 3, different phonemes have different lengths. To express this differences, five fuzzy lengths (Very Short, Short, Average, Long and Very Long) are defined. Trapezoidal shapes are used for their definitions, as in colors. Now, using the above conversions and definitions, each phoneme can be described using an expression stating its length, and its probable colors for each band. For example, one can express a phoneme as stated in Table. Note that a disjunction of different colors can be used in expressing each band's color. Based on this type of phoneme expression, a phoneme recognizer system must have the definitions for fuzzy colors (definition of trapezoids for each of 8 colors), definition of fuzzy lengths, and description of all phonemes using the previously stated colors and lengths which includes a length value for each phoneme and color values for each frequency band of every phoneme. Phoneme: Range : Range 2: Range 3: Range 25: Length: Table, Sample phoneme definition /SH/ Black or Blue Blue Red or Yellow Purple or Blue Average Figure, Spectrogram of voice signals, the vertical axis represents the frequency and the horizontal axis represents time. The color of each point shows the amplitude of a specific frequency at a specific time

Figure 2, Spectrogram separated with MEL filter bank horizontal lines and phoneme separator vertical lines. Figure 3, Fuzzified Spectrogram of Figure 2. Black Blue White 0 35 50 63 00 Figure 4, Color fuzzy sets: The horizontal axis is the amplitudes ranging from 0 to 00, and the vertical axis is the degree of compatibility with each set.

Input Value Definition Max Value Band Band 2 Band 3 34 2 56 0 00 0 00 0 Black or Blue 0 00 0 Blue 0 00 0 00 0 Red or Yellow 00 % 85 % 00 % Band 25 89 0 00 0 00 0 Red or White 90 % Min Value Final Value this frame: 85 % Figure 5, Sample Belongness computation for one frame (Recognition Step ). Frame: 0 2 3 4 5 6 7 Belongness: 8 54 95 2 64 87 77 99 Matching Length = 4 Matching Value = 82.25 50% Phoneme Length: average Final Result: 50% * 82.25 = 4.3 Figure 6, Sample Computation of Step 3 of Recognition

3. The Recognition Process Assuming that a suitable set of fuzzy colors, fuzzy lengths and phoneme descriptors exist, this part presents the recognition approach which classifies given phonemes. Each input sample may have an arbitrary number of frames, but each frame has exactly 25 values, computed as stated in the previous part. To recognize, a degree of belongness to each of the phonemes is computed for the given input and the input is classified into the phoneme which has the highest belongness value. This task is done in three steps: 3.. Step One The first step is to compute how much a frame of the given input belongs to the specified pattern of colors. For example, if the input data has 3 frames, the belongness of each frame to current phonemes' expressing colors is computed independent of other frames. To compute this belongness, the value of each frequency band is compared to the described colors of that band and their belongness values are computed. Then, minimum of all theses values is considered as the total belongness to that color pattern. Figure 5 shows a sample of this approach. 3.2. Step Two After computation of belongness measures for each frame, a small refinement filter is done over the resulting values for neighbor frames. This filter increases low values which are sieged by high values and decreases high values whose neighbors are all low. This task is done to decrease the effect of noise and dissimilarities in identified patterns. 3.3. Step Three The input of the last step is the refined belongness measure of each frame to the specified pattern of colors. The longest sequence of frames whose belongnesses are above a certain threshold are found and is called the matching length and their average is called matching value. Then, matching length is compared with the fuzzy set specifying the phoneme's length and its belongness value is computed and multiplied by the so called matching value. This final result specifies the degree of belongness of the given sample to that phoneme pattern. Figure 6 presents a sample for this step. 4. Training Algorithm The training approach takes place as a normal genetic algorithm [5]. A complete recognizer including color definitions, length definitions and all phonemes descriptions is called a genome. The fitness of a genome is defined as how good it can classify phoneme samples. The training process begins with a collection of random genomes. In each iteration, all genomes are sorted based on their fitnesses and 50% of the low ranking ones are thrown away. Then the population is recovered by creating new genomes from the remaining ones using normal genetic algorithm cross-over and mutation operators. This circle is repeated as long as the top scoring genome passes a required limit. 5. Experimental Results The above algorithm is applied on Timit database with 62 phoneme classes and is tested for single speaker sample sets. In each test, some of one speaker's sentences are used for training and one sentence is used for testing and the results for none-trained or none tested phonemes are removed 2. The results are presented in Table 2. Table 2 also presents the comparison results with Hidden Markov Model results which is a widely used approach in speech recognition ([], [3], [7]). It is to note that in our first trial, HMM could not create a model and learn the samples, so we used more samples to train than our fuzzy model. Table 2, Experimental Results This Method HMM st correct answers: 85% 63% 3 rd correct answers (out of 62) 3 : 95% 80% 6 th correct answers (out of 62): 98% 87% 6. Conclusion and Future Work Despite the existence of several methods for speech recognition, the problem is still open as no algorithm is both fast and accurate enough to be an ultimate answer. Based on fuzzy computations, a novel approach is presented which is believed to be fast and accurate but still 2 Some sample sets or test sets did not include all phonemes. 3 One of the top three guesses has been correct.

needs work to be a complete solution. In this method, the speech signal data are converted into a representation with words and using same representation for phonemes, their correct phoneme classes are identified. The phoneme descriptors are trained using normal genetic algorithms and while having 3 to 4 hours of training time for a set of around 400 phoneme samples, the classification task is very fast due to very little needed computations. The algorithm is tested on Timit database for single-speaker samples and it has resulted in around 85% first correct answers and 95% third correct ones and they all show a better result than Hidden Markov Model results which is a widely used speech recognition approach. As a next step to improve the algorithm, the training algorithm can be altered so that to be faster while having more samples, sequence data can be used so that to consider the relative changes of frames, and word descriptors instead of phoneme descriptors can be designed. [7] Ohkawa, Y., Yoshida, A., Suzuki, M., Ito, A., Makino, S., "An optimized multi-duration HMM for spontaneous speech recognition", In EUROSPEECH- 2003, 485-488, 2003. [8] Schwarz, P., Cernocky, M., and Cernocky, J., Phoneme recognition based on TRAPs, Workshop on Multimodal Interaction and Related Machine Learning Algorithms, June 2004. [9] Zadeh, L.A., From Computing with Numbers, to Computing With Words, A New Paradigm, International Journal on Applied Mathematics, Vol.2, No.3, 307 324, 2002. 7. References [] Babaali, B., and Sameti, H., The Sharif Speaker- Independent Large Vocabulary Speech Recognition System, The 2nd Workshop on Information Technology & Its Disciplines (WITID 2004), Feb. 24-26, Kish Island, Iran, 2004. [2] Berthold,M.R., A Time Delay Radial Basis Function Network for Phoneme Recognition, Proceedings of the IEEE International Conference on Neural Networks, vol. 7, pp.4470-4473, Orlando, 994. [3] Duchateau J, Demuynck, K, and Compernolle, D.V., Fast and Accurate Acoustic Modelling with Semi- Continuous HMMs. Speech Communication, volume 24, No., pages 5--7, April 998. [4] Golowich, S.E., and Sun, D.X., A Support Vector/Hidden Markov Model Approach to Phoneme Recognition, ASA Proceedings of the Statistical Computing Section, pp. 25-30. [5] Holland, J., Adaptation in Natural and Artificial Systems. Ann Arbor, Michigan: University of Michigan Press, 975. [6] Kwona, O.W., and Lee, T.W., Phoneme recognition using ICA-based feature extraction and transformation, Signal Processing, Vol. 84, No. 6, pp. 005-09, June 2004.