AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Similar documents
Human Emotion Recognition From Speech

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Speech Emotion Recognition Using Support Vector Machine

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Speaker recognition using universal background model on YOHO database

Speaker Identification by Comparison of Smart Methods. Abstract

WHEN THERE IS A mismatch between the acoustic

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems

Automatic Pronunciation Checker

Speech Recognition at ICSI: Broadcast News and beyond

A study of speaker adaptation for DNN-based speech synthesis

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

On the Formation of Phoneme Categories in DNN Acoustic Models

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Speaker Recognition. Speaker Diarization and Identification

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Speech Recognition by Indexing and Sequencing

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Support Vector Machines for Speaker and Language Recognition

Lecture 1: Machine Learning Basics

Body-Conducted Speech Recognition and its Application to Speech Support System

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Learning Methods in Multilingual Speech Recognition

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Proceedings of Meetings on Acoustics

Learning Methods for Fuzzy Systems

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Python Machine Learning

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS

Evolutive Neural Net Fuzzy Filtering: Basic Description

A comparison of spectral smoothing methods for segment concatenation based speech synthesis

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Lecture 9: Speech Recognition

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Automatic segmentation of continuous speech using minimum phase group delay functions

INPE São José dos Campos

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

SARDNET: A Self-Organizing Feature Map for Sequences

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Generative models and adversarial training

Axiom 2013 Team Description Paper

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

On Developing Acoustic Models Using HTK. M.A. Spaans BSc.

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Data Integration through Clustering and Finding Statistical Relations - Validation of Approach

Artificial Neural Networks written examination

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

Segregation of Unvoiced Speech from Nonspeech Interference

Software Maintenance

Australian Journal of Basic and Applied Sciences

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Voice conversion through vector quantization

The Use of Statistical, Computational and Modelling Tools in Higher Learning Institutions: A Case Study of the University of Dodoma

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

arxiv: v1 [math.at] 10 Jan 2016

An Online Handwriting Recognition System For Turkish

Knowledge Transfer in Deep Convolutional Neural Nets

1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Word Segmentation of Off-line Handwritten Documents

THE enormous growth of unstructured data, including

CSL465/603 - Machine Learning

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

Assignment 1: Predicting Amazon Review Ratings

Comparison of EM and Two-Step Cluster Method for Mixed Data: An Application

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Perceptual scaling of voice identity: common dimensions for different vowels and speakers

COMPUTER INTERFACES FOR TEACHING THE NINTENDO GENERATION

Linking Task: Identifying authors and book titles in verbose queries

Problems of the Arabic OCR: New Attitudes

Curriculum Design Project with Virtual Manipulatives. Gwenanne Salkind. George Mason University EDCI 856. Dr. Patricia Moyer-Packenham

TRANSFER LEARNING IN MIR: SHARING LEARNED LATENT REPRESENTATIONS FOR MUSIC AUDIO CLASSIFICATION AND SIMILARITY

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

Transcription:

JOURNAL OF MEDICAL INFORMATICS & TECHNOLOGIES Vol. 11/2007, ISSN 1642-6037 Marek WIŚNIEWSKI *, Wiesława KUNISZYK-JÓŹKOWIAK *, Elżbieta SMOŁKA *, Waldemar SUSZYŃSKI * HMM, recognition, speech, disorders AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH The Hidden Markov Model (HMM) is a stochastic approach to recognition of patterns appearing in an input signal. In the work author s implementation of the HMM were used to recognize speech disorders - prolonged fricative phonemes. To achieve the best recognition effectiveness and simultaneously preserve reasonable time required for calculations two problems need to be addressed: the choice of the HMM and the proper preparation of an input data. Tests results for recognition of the considered type of speech disorders are presented for HMM models with different number of states and for different sizes of codebooks. 1. INTRODUCTION The HMMs are stochastic models and are widely used for recognition of various patterns. They gained great significance particularly in speech recognition systems [1,2,3]. The HMM is a kind of extension of the Markov Model. The difference is that in the HMM the current state of the model is hidden and only the output is observed (observation vector). Thus by observation of the output of the HMM, the probability of the model being in a given state can be determined. In relation to the speech recognition, the observation is the acoustic signal (in the form of an observation vector) and the state of the model is associated with the generated word (or another speech entity, such as phoneme) [4]. Classification of speech disorders includes many cases, however, their number is much lower than the number of words used in a given language. The prolonged fricative phoneme is a disturbance that often appears in a nonfluent speech. The proper detection has important significance in the further determination of the method therapy [5,6] The recognition process with the HMM approach is as follows. First of all, it is necessary to determine the number of states of the model, as well as the size of the codebook. Next, having sufficient number of samples, a database of models can be generated one model per one kind of a disorder. Creation of a model that recognizes a given pattern is considered to be learning. With the base model and appropriate number of encoded nonfluent utterances of the same kind, model parameters can be learned (reestimated) so that it would be able to achieve maximum emission likelihood for that kind of pattern (observation vector). When such a database of learned models has been created, any sample can undergo examination. The recognition process consists in finding a model that * Maria Curie-Skłodowska University in Lublin, marek.wisniewski@umcs.lublin.pl

gives the biggest probability. Since a particular dysfluency is associated with each model, that dysfluency can be detected in an acoustic signal. The HMM is defined by three parameters λ=(a,b,π), where A is the matrix of transition probabilities between particular states, B is the matrix of probabilities of emission of each element of the codebook for each model state, and π is the probability vector of the model being in a particular state at the time t=0. The problem that should be addressed is the decision about the optimal size of matrixes A and B. These values need to be chosen so that efficiency of the recognition and the time of computing is on acceptable level. 2. SAMPLE PARAMETERISATION The acoustic signal requires to be parameterised before analysis. The most often used set of parameters in the case are Mel Frequency Cepstral Coefficients (MFCC). The process of determining MFCC parameters in the work is as follows: splitting signals into frames of 512 samples length, FFT (Fast Fourier Transform) analysis on every frame, transition from linear to mel frequency scale according to the formula: F mel =2595*log(1+F/700) [7, 8], signal frequency filtering by 20 triangular filters, calculation of the required (20) number of MFCC parameters. The elements of each filter are determined by summing up the convolution results of the power spectrum with a given filter amplitude, according to the formula: S k = J j= 0 P A j k, j (1) where: S k power spectrum coefficient, J =subsequent frequency ranges from FFT analysis, P j average power of an input signal for j frequency, A k,j k-filter coefficient. With S k values for each filter given, cepstrum parameter in the mel scale can be determined [9]: MFCC n = K k= 1 π (logsk )cos n( k 0.5) K, for n=1..n, (2) where: N required number of MFCC parameters, S k power spectrum coefficients, K number of filters. The justification of the transition from the linear scale to mel scale is that the latter reflects the human perception of sounds better. 3. CODEBOOK PREPARATION The MFCC analysis of the acoustic signal gives too many parameters to be analysed with the application of the HMM with a discrete output. At the same time, the number of 294

MFCC parameters cannot be decreased, since then important information may be lost and so the effectiveness of recognition may be poor. In order to reduce the number of parameters, encoding with a proper codebook can be applied [9]. Preparation of the codebook is as follows. First, the proper sample of an utterance needs to be chosen, which covers the entire acoustic space to be examined. Next it can be generated, for example by the use the k-means algorithm. Three fragments of utterances were selected, each lasting 54 seconds and articulated by three different persons and, afterwards MFCC coefficients were calculated. The obtained set of parameters were divided into appropriate number of regions and their centroids were found. For counting the distances between vectors the Euclidean formula were used: N 2 x, y ( i yi ) i= 1 d = x (3) where: d x,y the Euclidean distance between N -dimensional vectors X and Y. According to the described method there were several codebooks prepared with sizes 30, 38, 64, 128, 256, 512 and used in further tests. 4. TESTING PROCEDURE The examination process was as follows. First, there were sufficient number of prolonged fricative phonemes chosen (^, s, z, x,, v,, f). For the every phoneme there were 5 fragments prepared, that contained only the prolongation. The fragments came from different recordings of stuttered people. Every group of fragments were encoded with the earlier prepared codebooks. As the result there were training vectors acquired for following codebook sizes: 30, 38, 64, 128, 256 and 512. These vectors were used for training recognition models. In the tests several models were used with sizes: 5, 8, 10 and 15 states. It means that for every prolongation there were 24 models prepared (the total number of used models was 192). As base models served models with randomly generated probability values for matrixes A, B, π. For testing, the application named HMM was used, where appropriate algorithms were implemented. Parameters of the sound samples which were used were as follows: sample frequency: 22050Hz, amplitude resolution: 16 bits. All the records were normalized to the same dynamic range 50dB. The examination of the recognition effectiveness of fricative phonemes was carried on 22 fragments of utterances, each lasting several seconds. Every utterance contained only the one disorder. The test piece was encoded with every prepared codebook and then analysed by appropriate groups of learned models. From the sample, segments of the length of 10 symbols were taken (which corresponds to approximately 232 ms length) with the step of one-symbol length (approximately 23 ms) and the emission probability for each model were counted. As the result of the recognition process the probability distribution across the time were acquired for each model. Then the time, when the maximum likelihood was revealed (achieved by whatever model), were compared with the time of the disorder appearance 295

(read out from the sample spectrogram). If the both were compliant it was considered as a successful recognition, otherwise as a failure. a) 8,0E-07 7,0E-07 6,0E-07 5,0E-07 4,0E-07 3,0E-07 2,0E-07 1,0E-07 0,0E+00 1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 probability 127 134 141 148 155 162 b) succesive frames Fig.1. The analysis result of the utterance drzewka są ośśśśś ośnieżone ( d efka son o^^^^^^^^^^ o^øe one ) : a) probability distribution for one of 8-state model with the codebook size of 256-elements; b) spectrogram [11]. In the figure 1a the probability distribution is shown for one of the model for the utterance drzewka są ośśśśś ośnieżone ( d efka son o^^^^^^^^^^ o^øe one ). This is the example of a very well recognition. Additionally from the graph one can estimate the duration of the disorder. Table 1. The dependency of the recognition efficiency on the model size and the codebook size. Number of HMM states 5 8 Codebook size Recognition ratio [%] 30 64 38 64 64 73 128 64 256 77 512 82 30 45 38 59 64 59 128 68 256 68 512 77 Number of HMM states 10 15 Codebook size Recognition ratio [%] 30 59 38 45 64 64 128 68 256 77 512 82 30 50 38 68 64 73 128 73 256 73 512 77 Form the table 1 it comes out that the best result, approximately 80%, were achieved for the codebook with the largest size: 512 elements, without regard for the number of states of the used model. Alongside decreasing the codebook size, the recognition ratio was also 296

decreasing. The influence the number of states on the recognition has the minimal importance. 5. SUMMARY In the polish language one can distinguish 37 phonemes [12]. It seems that this number could be appropriate as a size of the codebook. But there exists a large group of sounds considered as inter-phoneme transitions. The prolongation of fricative phonemes are characterized by inclusion many sounds that are difficult to classify as a real phoneme, so the results are better for larger codebooks. BIBLIOGRAPHY [1] http://cmusphinx.sourceforge.net [2] http://htk.eng.cam.ac.uk/ [3] http://julius.sourceforge.jp [4] DELLER J. R., HANSEN J. H. L., PROAKIS J. G., Discrete-Time Processing of Speech Signals, IEEE, New York 2000. [5] KUNISZYK-JÓŹKOWIAK W., SMOŁKA E., SUSZYŃSKI W., Akustyczna analiza niepłynności w wypowiedziach osób jąkających się, Technologia mowy i języka. Poznań 2001. [6] SUSZYŃSKI W., Komputerowa analiza i rozpoznawanie niepłynności mowy, rozprawa doktorska, Gliwice 2005. [7] WAHAB A., SEE NG G., DICKIYANTO, R., Speaker Verification System Based on Human Auditory and Fuzzy Neural Network System, Neurocomputing Manuscript Draft, Singapore. [8] PICONE J.W., Signal modeling techniques in speech recognition, Proceedings of the IEEE, 1993, 81(9): 1215-1247. [9] SCHROEDER, M.R., Recognition of complex acoustic signals, Life Science Research Report, T.H. Bullock, Ed., (Abakon Verlag, Berlin) vol. 55, pp. 323-328, 1977. [10] TADEUSIEWICZ R., Sygnał mowy, Warszawa 1988. [11] HORNE R. S., Spectrogram for Windows, ver. 3.2.1 [12] BASZTURA CZ., Źródła, sygnały i obrazy akustyczne, WKŁ, Warszawa 1988. 297

298 KNOWLEDGE BASES AND MEDICAL AUTOMATIC CONCLUSIONS