Implementation of Vocal Tract Length Normalization for Phoneme Recognition on TIMIT Speech Corpus

Similar documents
A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Speech Emotion Recognition Using Support Vector Machine

Modeling function word errors in DNN-HMM based LVCSR systems

WHEN THERE IS A mismatch between the acoustic

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Speech Recognition at ICSI: Broadcast News and beyond

Modeling function word errors in DNN-HMM based LVCSR systems

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Speaker recognition using universal background model on YOHO database

Human Emotion Recognition From Speech

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

A study of speaker adaptation for DNN-based speech synthesis

Mandarin Lexical Tone Recognition: The Gating Paradigm

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

PHONETIC DISTANCE BASED ACCENT CLASSIFIER TO IDENTIFY PRONUNCIATION VARIANTS AND OOV WORDS

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

Learning Methods in Multilingual Speech Recognition

Proceedings of Meetings on Acoustics

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Speaker Identification by Comparison of Smart Methods. Abstract

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Body-Conducted Speech Recognition and its Application to Speech Support System

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Speaker Recognition. Speaker Diarization and Identification

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Lecture 9: Speech Recognition

Automatic segmentation of continuous speech using minimum phase group delay functions

The Perception of Nasalized Vowels in American English: An Investigation of On-line Use of Vowel Nasalization in Lexical Access

Voice conversion through vector quantization

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Segregation of Unvoiced Speech from Nonspeech Interference

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Investigation on Mandarin Broadcast News Speech Recognition

Speech Recognition by Indexing and Sequencing

On Developing Acoustic Models Using HTK. M.A. Spaans BSc.

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

On the Formation of Phoneme Categories in DNN Acoustic Models

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

COMPUTER INTERFACES FOR TEACHING THE NINTENDO GENERATION

Automatic Pronunciation Checker

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Speech Translation for Triage of Emergency Phonecalls in Minority Languages

Perceptual scaling of voice identity: common dimensions for different vowels and speakers

Automatic intonation assessment for computer aided language learning

Letter-based speech synthesis

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Affective Classification of Generic Audio Clips using Regression Models

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Consonants: articulation and transcription

Multi-View Features in a DNN-CRF Model for Improved Sentence Unit Detection on English Broadcast News

Support Vector Machines for Speaker and Language Recognition

Calibration of Confidence Measures in Speech Recognition

LEARNING A SEMANTIC PARSER FROM SPOKEN UTTERANCES. Judith Gaspers and Philipp Cimiano

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

Bi-Annual Status Report For. Improved Monosyllabic Word Modeling on SWITCHBOARD

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS

Lecture Notes in Artificial Intelligence 4343

Disambiguation of Thai Personal Name from Online News Articles

Quantitative Evaluation of an Intuitive Teaching Method for Industrial Robot Using a Force / Moment Direction Sensor

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

A new Dataset of Telephone-Based Human-Human Call-Center Interaction with Emotional Evaluation

Spoofing and countermeasures for automatic speaker verification

Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures

Reducing Features to Improve Bug Prediction

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

DNN ACOUSTIC MODELING WITH MODULAR MULTI-LINGUAL FEATURE EXTRACTION NETWORKS

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Robot manipulations and development of spatial imagery

What s in a Step? Toward General, Abstract Representations of Tutoring System Log Data

Functional Skills Mathematics Level 2 assessment

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING

The IRISA Text-To-Speech System for the Blizzard Challenge 2017

Using dialogue context to improve parsing performance in dialogue systems

SEGMENTAL FEATURES IN SPONTANEOUS AND READ-ALOUD FINNISH

Multimedia Courseware of Road Safety Education for Secondary School Students

Different Requirements Gathering Techniques and Issues. Javaria Mushtaq

Word Segmentation of Off-line Handwritten Documents

An Online Handwriting Recognition System For Turkish

Transcription:

2011 International Conference on Information Communication and Management IPCSIT vol.16 (2011) (2011) IACSIT Press, Singapore Implementation of Vocal Tract Length Normalization for Phoneme Recognition on TIMIT Speech Corpus Jensen Wong Jing Lung +, Md.Sah Hj.Salam, Mohd Shafry Mohd Rahim and Abdul Manan Ahmad Department of Computer Graphics & Multimedia, Faculty of Computer Science and Information System, University Technology Malaysia, 81310 UTM Skudai, Johor, Malaysia Abstract. Inter-speaker variability, one of the problems faced in speech recognition system, has caused the performance degradation in recognizing varied speech spoken by different speakers. Vocal Tract Length Normalization (VTLN) method is known to improve the recognition performances by compensating the speech signal using specific warping factor. Experiments are conducted using TIMIT speech corpus and Hidden Markov Model Toolkit (HTK) together with the implementation of VTLN method in order to show improvement in speaker independent phoneme recognition. The results show better recognition performance using Bigram Language Model compared to Unigram Language Model, with Phoneme Error (PER) 28.8% as the best recognition performance for Bigram and PER 38.09% for Unigram. The best warp factor used for normalization in this experiment is 1.40. Keywords: VTLN, inter-speaker variability, speech signal, warp factor, phoneme recognition. 1. Introduction Differences in human voices are caused by the different sizes of vocal tract (VT), thus their generated speech signals contain frequencies that are not always constantly the same. The variation in acoustic speech signals from different speakers, adding with different accent, dialect, speaking rate and style, contribute to more problems in matching the trained speech signal accurately in the system. These physiology and linguistic differences between speakers are known to be the inter-speaker variability [8, 11], affecting the overall performance for continuous Automatic Speech Recognition (ASR) system. One physical source of inter-speaker variability is the vocal tract length (VTL). In Figure 1, this model represents the human vocal apparatus which is the main source of human speech-voice generation. Speech spectrum is shaped by VT that is marked within dotted box, starts with the opening of the vocal cords, or glottis, and ends at the lips and nasal [2]. By using simple analogy on each bottle with different water level generate different frequency, the similarity can be apply that the size and length of VT affects the speech signal s frequency. Physical difference in VTL is more noticeable between male and female speakers. Male speakers have longer VT that generates lower frequency speech spectrum. On the other hand, female speakers have shorter VT which generates higher frequency speech spectrum. According to Lee & Rose [3,4], VTL can vary from approximately 13 cm for adult females to over 18 cm for adult males. These VTL differences affect the position of spectral formant frequency by as much as 25% between adult speakers. This formant position difference leads to the mismatched formant frequencies, resulting in decreased recognition performance. Due to these VT differences, speaker independent ASR system that is trained with different speakers, is generally worse than speaker dependent ASR system in recognition performance. ASR modelling efficiency + Corresponding author. Tel.: +60128802126. E-mail address: jensen_wg@yahoo.com. 136

is dramatically reduced without the appropriate alignment on the frequency axis from the speech spectrum [12]. Hence, the frequency speech spectrums needed to be approximately scaled linearly. Vocal Tract Fig. 1: The model of the vocal tract [7]. As VTLN method is used to eliminate inter-speaker variability, this paper focus on the warp factor and warping frequency cutoff implementation effect on the phoneme recognition performance. Section 2 contains the experimental setup for running the experiment. The recognition results are presented in Section 3, following by the results elaboration in Section 4 before conclusion in Section 5. 2. Experimental Setup 2.1. Preparation The experiment begins with the speech corpus and toolkit preparation for phoneme recognition. Phoneme recognition approach is considered to be a very delicate recognition task which focuses on recognizing phonemes from every speech corpus files. This approach enables the observation task on the actual recognition performance level of every phoneme in a sentence. TIMIT Acoustic-Phonetic Continuous Speech Corpus contains a total of 6300 sentences, with 10 sentences spoken by each of 630 speakers with different sex from 8 major dialect regions of the United States. The dialect sentences (SA sentences) are more dialectal variants compare to other sentences [1,9,10] in TIMIT, and thus are removed from the experiment setup to ensure the experiment is free from dialectal variants. After the exclusion of dialectal variants, a total remaining data of 5040 sentences are divided into training data and testing data. Total training data to be used to conduct the experiment are 3696 sentences, while total testing data are 1344 sentences. TIMIT transcriptions are included together with the speech data, and consist of 61 phonemes. Due to TIMIT speech corpus in waveform, it is necessary to convert the speech corpus into digital form. Mel-Frequency Cepstral Coefficient (MFCC) is widely use audio representation form as it gives good discrimination between speech variations [5] and takes human perception sensitivity with respect to frequencies into consideration [14], making MFCC less sensitive to pitch. MFCC also offers better suppression of insignificant spectral variation in higher frequency band and able to preserve sufficient information for speech and phoneme recognition with a small number of required coefficients [13]. The conversion from waveform to MFCC (Figure 2) is done by using HTK, a command prompt based toolkit selected as a medium to train and test out every TIMIT speech corpus to obtain the highest possible recognition performance from every different setting. MFCC conversion also enables the HTK to read and process the input properly [5]. Each HTK setting depends on its configuration setup for transforming, 137

training and testing the speech. The implementation of VTLN can be done from Mel Filterbank Analysis (Figure 3) through this HTK configuration setup. continuous speech Frame Blocking frame Windowing FFT Front-End spectrum mel cepstrum Cepstrum mel spectrum Mel-Frequency Warping Fig. 2: MFCC conversion process flow. Fig. 3: Conversion from waveform to MFCC with VTLN (Derived from [3], [4] and [6]). Fig. 4: Experimental training and testing flow (Derived from Young et al., 2006 [5]). 2.2. Training and Testing An experimental flow diagram is drawn to briefly present the way this experiment is conducted, as showed in Figure 4. This experimental flow is to be repeated done for each VTLN implementation. VTLN normalizes the speech signal and attempts to reduce the inter-speaker variability in speech signal by compensating for vocal tract length variation among speakers [8]. In HTK configuration setup, VTLN setting consists of warp factor parameter, lower and upper warping frequency cutoff parameter. These three parameters control the minimum and maximum frequency range subjected to be warped at factor α. The main approach of this VTLN implementation is to rescale the frequency axis within the defined frequency boundary on the speech spectrum according to the specified warp factor, α. This type of readjustment, also called piecewise linear warping, can be either stretching or compressing the speech spectrum at warp factor α. Since the suitable warp factor is unknown in this experiment, a range factor of 0.5 to 2.0 is used in trial-anderror approach, with the increment of 0.02 for each experiment. This warp factor range limit is reasonable as the spectrum will lost its information either after being compressed by half with factor 0.5 or after being stretched twice the size with factor 2.0. The trial-and-error approach also requires high computational resource in order to obtain every recognition performance results. By properly run experiment for each VTLN setting applied in TIMIT speech corpus, this will help to evaluate the result and identify the best setting for TIMIT speech corpus. 3. Results The experiment is focused on phoneme recognition, so the recognition performance result is measured by Phoneme Error (PER). Both language model, Unigram and Bigram, are used in this experiment and the recognition performance are recorded as well for comparison. 138

The best recognition performance result is selected among every single experiment from selected language model, as shown in Table 1 and 2. PER can be calculated from the number of substitution errors (S), deletion errors (D), insertion errors (I), total phoneme (N) and total correct (H). Each value from the tables below is calculated based on these equations below: H = N S D. (1) H Corr = 100%. (2) N H I Acc = 100%. (3) N PER = 100 % Acc. (4) Table 3 and 4 summarize the lowest PER achievement from 2 different language models, after VTLN implementation. Starting with warp factor 1.0 representing non-vtln implementation, the initial recognition performance is 38.83% for Unigram language model, and 29.57% for Bigram language model. Table 1: Phoneme Recognition Result with Warp Factor 1.38 for Unigram Language Model Table 2: Phoneme Recognition Results with Warp Factor 1.40 for Bigram Language Model N H D S I Correct 51681 38473 2393 10815 6477 61.91% 74.44% N H D S I Correct 51681 38230 4580 8871 1431 71.20% 73.97% Table 3: Recognition Performance for Unigram Language Model Table 4: Recognition Performance for Bigram Language Model Warp Factor Upper Warp Cutoff Frequency (Hz) Phoneme Error 1.00-61.17% 38.83% 1.38 3800 61.91% 38.09% Warp Factor Upper Warp Cutoff Frequency (Hz) Phoneme Error 1.00-70.43% 29.57% 1.40 5000 71.20% 28.80% 4. Discussion Warp factor of 1.00 equates to non-vtln implementation, making this warp factor value suitable to be treated as controlled variable. The performance result for warping factor 1.00 is used as initial reference to observe the performance changes for different warping factors. During the time this experiment is conducted, lower warp frequency cutoff value is fixed at 300Hz as it won t affect much to recognition performance. Bigram language model shows better phoneme recognition performance compare to Unigram language model, with more than 24% performance improvement. It is due to better state matching reference which compares the trained HMM model with two states of a phoneme test data instead of one state. Another noticeable similarity between two language models is the better accuracy rate with warp factor above 1.00. As the experiment setup is done with speaker independent mode in mind, the accuracy rate achieved is considered as the averaged recognition performance regardless of speaker s gender. 5. Conclusion This experiment shows that phoneme recognition performed well on TIMIT speech corpus when the warp factor value is more than 1.00. HTK performed best in Bigram language model when the warp factor is 1.40, with 28.8% PER. Although trial-and-error approach gives precise identification on the best warp factor, 139

further experiment need to be done on word level recognition performance, by applying the same setting for phoneme recognition. 6. References [1] J.S. Garofolo, L.F. Lamel, W.M. Fisher, J.G. Fiscus, D.S. Pallett, N.L. Dahlgren, DARPA TIMIT Acousticphonetic Continuous Speech Corpus, U.S. Department of Commerce, NIST, Gaithersburg, MD, 1993. [2] Rabiner, L., Juang, B-H. Fundamentals of Speech Recognition, Prentice-Hall International, 1993. [3] Lee, L., Rose, R.C. Speaker normalization using efficient frequency warping procedures, Proc. IEEE ICASSP 96. 1, 353-356. 1996. [4] Lee, L., Rose, R.C. A Frequency Warping Approach to Speaker Normalization, IEEE transactions on speech and audio processing. 6(1), 49-60. 1998. [5] Young, S. et al. The HTK Book, Cambridge University Engineering Department. (8th ed.). 2006 [6] Zhan, P., Waibel, A. Vocal Tract Length Normalization for Large Vocabulary Continuous Speech Recognition, CMU-CS-97-148, Carnegie Mellon University, Pittsburgh, PA. May. 1997 [7] Flanagan, J.L. Speech Analysis and Perception. (2nd ed.) Verlag, Berlin: Springer. 1965. [8] Giuliani, D., Gerosa, M. and Brugnara, F. Improved Automatic Speech Recognition through Speaker Normalization. Computer Speech & Language, 20 (1), pp. 107-123, Jan. 2006. [9] Lee, K.F., Hon, H.W. Speaker-independent phone recognition using hidden Markov models. IEEE Trans. Acoustics, Speech and Signal Processing 37(11), pp. 1641-1648, 1989. [10] Müller, F., Mertins, A., Robust speech recognition based on a certain class of translation-invariant transformations, in LNCS, Vol 5933, pp. 111-119, 2010. [11] Müller, F., Mertins, A., Invariant Integration Features Combined with Speaker-Adaptation Methods, in Proceedings of Int. Conf. Interspeech 2010, 2010. [12] Liu, M., Zhou, X., Hasegawa-Johnson, M., Huang, T.S., Zhang, Z.Y., Frequency Domain Correspondence for Speaker Normalization, in Proc. INTERSPEECH, 2007, pp. 274-277. [13] Davis, S.B., Mermelstein, P., Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences, in IEEE Trans. on Acoustics, Speech and Signal Processing, Vol 28, pp. 357-366, 1980. [14] Roger Jang, J.S., Audio Signal Processing and Recognition, available at the links for on-line courses at the author's homepage at http://www.cs.nthu.edu.tw/~jang. 140