Vowel place detection for a knowledge-based speech recognition system

Similar documents
Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

The Perception of Nasalized Vowels in American English: An Investigation of On-line Use of Vowel Nasalization in Lexical Access

Mandarin Lexical Tone Recognition: The Gating Paradigm

Speech Emotion Recognition Using Support Vector Machine

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

WHEN THERE IS A mismatch between the acoustic

Speaker recognition using universal background model on YOHO database

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

Voice conversion through vector quantization

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

A study of speaker adaptation for DNN-based speech synthesis

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Speaker Recognition. Speaker Diarization and Identification

Proceedings of Meetings on Acoustics

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

Segregation of Unvoiced Speech from Nonspeech Interference

Speaker Identification by Comparison of Smart Methods. Abstract

Human Emotion Recognition From Speech

Modeling function word errors in DNN-HMM based LVCSR systems

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Modeling function word errors in DNN-HMM based LVCSR systems

Learning Methods in Multilingual Speech Recognition

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Audible and visible speech

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

DEVELOPMENT OF LINGUAL MOTOR CONTROL IN CHILDREN AND ADOLESCENTS

SEGMENTAL FEATURES IN SPONTANEOUS AND READ-ALOUD FINNISH

Speech Recognition at ICSI: Broadcast News and beyond

THE RECOGNITION OF SPEECH BY MACHINE

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

On Developing Acoustic Models Using HTK. M.A. Spaans BSc.

Rachel E. Baker, Ann R. Bradlow. Northwestern University, Evanston, IL, USA

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Learners Use Word-Level Statistics in Phonetic Category Acquisition

A comparison of spectral smoothing methods for segment concatenation based speech synthesis

Perceptual scaling of voice identity: common dimensions for different vowels and speakers

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Word Segmentation of Off-line Handwritten Documents

On the Formation of Phoneme Categories in DNN Acoustic Models

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

Body-Conducted Speech Recognition and its Application to Speech Support System

SARDNET: A Self-Organizing Feature Map for Sequences

1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all

Acoustic correlates of stress and their use in diagnosing syllable fusion in Tongan. James White & Marc Garellek UCLA

The pronunciation of /7i/ by male and female speakers of avant-garde Dutch

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Fix Your Vowels: Computer-assisted training by Dutch learners of Spanish

Phonetics. The Sound of Language

Speech Recognition by Indexing and Sequencing

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Automatic Pronunciation Checker

Automatic intonation assessment for computer aided language learning

Perceptual Auditory Aftereffects on Voice Identity Using Brief Vowel Stimuli

age, Speech and Hearii

A Cross-language Corpus for Studying the Phonetics and Phonology of Prominence

Edinburgh Research Explorer

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

PHONETIC DISTANCE BASED ACCENT CLASSIFIER TO IDENTIFY PRONUNCIATION VARIANTS AND OOV WORDS

Quarterly Progress and Status Report. Voiced-voiceless distinction in alaryngeal speech - acoustic and articula

Python Machine Learning

Dyslexia/dyslexic, 3, 9, 24, 97, 187, 189, 206, 217, , , 367, , , 397,

Phonological Processing for Urdu Text to Speech System

LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization

Multi-Lingual Text Leveling

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

An Empirical Analysis of the Effects of Mexican American Studies Participation on Student Achievement within Tucson Unified School District

Modern TTS systems. CS 294-5: Statistical Natural Language Processing. Types of Modern Synthesis. TTS Architecture. Text Normalization

arxiv: v2 [cs.cv] 30 Mar 2017

Parallel Evaluation in Stratal OT * Adam Baker University of Arizona

Self-Supervised Acquisition of Vowels in American English

Assignment 1: Predicting Amazon Review Ratings

Calibration of Confidence Measures in Speech Recognition

Why Did My Detector Do That?!

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Support Vector Machines for Speaker and Language Recognition

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Word Stress and Intonation: Introduction

Speaker Recognition For Speech Under Face Cover

Disambiguation of Thai Personal Name from Online News Articles

Linking Task: Identifying authors and book titles in verbose queries

SOUND STRUCTURE REPRESENTATION, REPAIR AND WELL-FORMEDNESS: GRAMMAR IN SPOKEN LANGUAGE PRODUCTION. Adam B. Buchwald

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

Houghton Mifflin Reading Correlation to the Common Core Standards for English Language Arts (Grade1)

Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

To appear in the Proceedings of the 35th Meetings of the Chicago Linguistics Society. Post-vocalic spirantization: Typology and phonetic motivations

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Automatic segmentation of continuous speech using minimum phase group delay functions

Transcription:

Vowel place detection for a knowledge-based speech recognition system S. Lee and J.-Y. Choi Yonsei University, 134 Sinchon-dong, Seodaemun-gu, 120-749 Seoul, Republic of Korea pooh390@dsp.yonsei.ac.kr 2429

This work aims to detect vowel place as part of a knowledge-based speech recognition system. Vowel place was classified into 6 groups based on tongue advancement [Front/Back] and height [High/Mid/Low]. Experiments were performed using 660 /hvd/ utterance data from Hillenbrand [J. Acoust. Soc. Am. 97, 3099-3111] and 6600 TIMIT vowels. Features used include fundamental frequency (F0) and formant value (F1~F3), where formant measurements were classified into separate groups using F0 measurements. The nearest class was found using a simple Mahalanobis distance measure, and yielded a 92.0% classification rate for the /hvd/ data. The results for the TIMIT data were 65.7%, and error analysis with regard to adjacent segment manner and place was carried out to observe the effects of coarticulation, which was not observed in the /hvd/ data. 1 Introduction A knowledge-based speech recognition procedure can be considered as a type of distinctive feature based speech recognition, which has been considered by Stevens [10] and by Espy-Wilson [4] as an event-based speech recognition. In a knowledge-based approach, the primary purpose is modelling the human perception process. Current statistically based recognition methods face performance degradation under mismatched conditions, and a knowledge-based approach offers an alternative attempt. Because knowledge sources are made in a directed, meaningful manner, if they can be made to work well, they should be more robust against variability. From this point of view, the goal of this work is to detect vowel place as part of a knowledge-based speech recognition system. Numerous efforts have been made on analyse of vowels. Peterson and Barney (PB) [8] studied the acoustic characteristics of vowels using formant frequencies (F1~F3) and fundamental frequency (F0); also, Hillenbrand et al [2] extended the vowel acoustics in a similar way. In addition, Stevens [9] examined the acoustic correlation between formant frequency and vocal tract shape using resonator models. These results show vocal tract shape due to tongue movements are strongly related to vowel production and perception. Also in the past, Meng et al [5] attempted to classify vowels using distinctive features. Although Meng reported good performance, the study used many spectral and cepstral coefficients in their of knowledge-based approach. The purpose of this study, then, is to detect vowel place using primary acoustic features such as formant and fundamental frequency. Vowel place is represented using the following distinctive features: [high, low, back], and minimum distance measure was used to detect vowel place in 6600 vowels extracted from the TIMIT corpus and Hillenbrand s 660 /hvd/ data [2]. This paper will report preliminary work on vowel place detection using formant and fundamental frequencies. Firstly, we will describe the experimental methodology in detail. We will then present the results of vowel place detection and discuss the results. Finally, we will summarize the paper and consider future work. 2 Experimental methodology 2.1 Test signals Two different types of databases were used for these experiments. The first test signals consisted of 660 /hvd/ utterances recorded by Hillenbrand et al [2]. And 6600 vowels from the TIMIT corpus were also used in this experiment. The vowels chosen for these experiments are 11 monophthongs in American English such as and....the talker of /hvd/ data consist of 30 men and 30 women, so each vowel has 30 signals ( 30 2 11 = 660 ). 6600 vowels are randomly selected from the TIMIT corpus equally from each gender, and each vowel has 150 male data and 150 female data ( 300 2 11 = 6600 ). The diphthongs and schwa are excluded here. 2.2 Acoustic measurements The formant tracking methods for F1 ~ F3 were similar to the Entropic ESPS formant program, spaced every 10 ms, with linear predictive coding (LPC) resonances to find formant frequencies, and include dynamic programming. The formant frequencies are found at 50% of vowel duration which was found from the labels. Fundamental-frequency (F0) was measured using conventional autocorrelation method every 10 ms using 25 ms Hamming window. It was also sampled at 50% of vowel duration. Signal F0 Group 2 Group 1 Vowel place detection - minimum distance (F1~F3) Place Fig. 1 Vowel place detection process. The input signal is divided into two groups by F0. 2430

Table 1 The distinctive feature set of 11 vowels. 2.3 Feature analysis The features chosen to detect vowel place are [ ± high],[ ± low] and [ ± back], which are tongue body features [10]. Then, for each tongue body feature, the tokens are divided into two classes [ + feature] and [ feature]. In this paper, every vowel is classified into one of 6 groups depending on tongue body features. The vocal tract shape can be approximated roughly as a concatenation of tubes. The articulator movement, which can be modeled as concatenated tubes, lead to formantfrequency changes resulting from perturbations (local constrictions) of a tube resonator. The frequency of F1 is inversely related to tongue height (e.g., high vowels have a low F1 frequency), and the frequency of F2 is related to tongue advancement (e.g., front vowels have high F2 frequency). The [ ± high, ± low] features are related to tongue height. [ + high, - low],[ high, - low] and [ high, + low] represent vowel height that are pronounced with high, mid and back tongue position, respectively. Fig. 2 shows the Gaussian distribution of F1 of high/mid/low vowels for TIMIT data and /hvd/ data. As we expected, Fig. 2 representing tongue height (high/mid/low) is inversely related to F1. Tongue advancement is connected to [ ± back] features. The front/back vowel, which are pronounced with front/back tongue position, is represented as [ back] and [ + back]. Fig. 3 shows the Gaussian distribution of F2 of front/back vowels for TIMIT data and /hvd/ data. As we expected, tongue advancement (front/back) is related to F2. The feature values for these vowels are summarized in Table 1. Fig. 3 Gaussian distribution of F2 of front/back vowels for TIMIT data and /hvd/ data. 2.4 Classification strategy The classification strategy for the vowel place is divided into two steps: grouping and vowel place classification as shown in Fig.1. The grouping process separates input signals into two sets depending on F0. This process can compensate the differences of formant frequency due to the vocal tract length between male and female. The nearest class was found using a simple Mahalanobis distance measure in the vowel place classification process with F1 and F2. The Mahalanobis distance is defined as: D f = f Σ f (1) T 1 ( ) ( μ) ( μ) M where f is a formant vector, f = ( F1, F2), μ is formant mean, μ = ( μ, μ ), and covariance matrix Σ for a vector F1 F2 f. In addition, retroflexed vowel processing is performed using F3. Each of reference means and covariance matrices was calculated by training set of database. 3 Result Fig. 2 Gaussian distribution of F1 of high/mid/low vowels for TIMIT data and /hvd/ data Fig. 4 The overall detection results for TIMIT data and /hvd/ data. 2431

(a) (b) Table 2 The confusion matrix of tongue height features of (a) TIMIT data and (b) /hvd/ data. Table 5 The confusion matrix of vowel place of /hvd/ data (a) The vowel place detector is evaluated both for /hvd/ data and TIMIT data. The detection rate was determined by comparing the output of the detector with the labeled data. The overall detection results for the databases are summarized in Fig. 4. The detection results for tongue height are 76.2% and 93.0% in TIMIT data and /hvd/ data, respectively. The tongue height, which can be represented as [ ± high, ± low], are determined by F1, and it is classified into three classes: high/mid/low. Table 2 shows confusion matrix of tongue height. Most of the errors are found between high/low and mid with a few exceptions. The detection results for tongue advancement are 85.3% and 99.0% in TIMIT data and /hvd/ data, respectively. The tongue advancement, which can be represented as [ ± back], are determined by F2, and it is classified into two classes: front/back. Table 3 shows confusion matrix of tongue advancement. By comparing the results for tongue height and advancement, [ ± back] features show better performance compared to [ ± high, ± low] features. The overall detection results for vowel place are 65.7% and 92.0% in TIMIT and /hvd/ database, respectively. Every vowel is classified into six different classes based upon vowel place High/Front, High/Back, Mid/Front, Mid/Back, Low/Front, and Low/Back. Table 4 and table 5 show confusion matrix of vowel place for TIMIT data and /hvd/ data, respectively. With a few exceptions, most of errors are made between adjacent classes. (b) Table 3 The confusion matrix of tongue advancement features of (a) TIMIT data and (b) /hvd/ data. 4 Discussion To summarize briefly, the main purpose of this study was to detect vowel place using formant frequency and fundamental frequency as a part of a knowledge-based speech recognition system. The nearest class was found using a simple Mahalanobis distance measure, and yielded a 92.0% classification rate for the /hvd/ data from Hillenbrand. The results for the TIMIT data were 65.7%. Our research was partly motivated by the use of distinctive features for knowledge-based approach. From this point of view, acoustic characteristic that we have chosen are formant frequency and fundamental frequency. These features are no guarantee of detection performance, but they are intuitively reasonable and directly measurable. Fig 2 shows that the overall detection rate of TIMIT data is worse than /hvd/ data; it is mainly due to coarticulation effect in formant pattern. Hillenbrand have already pointed out that vowel formant patterns are strongly related to phonetic environment [3]. Since /hvd/ data was recorded h-v-d syllables, we can not observe coarticulation effects in formant pattern. TIMIT vowels, however, was extracted from various phonetic environment. Therefore, formant frequency was affected by adjacent phonetic environment. This result suggests that consonant environment on vowel is also significant cues to detect vowel place. Important areas for further study will comprise three issues. First, phonetic environment on vowel will be considered to overcome the formant pattern changes due to coarticulation effect. Second, the temporal movements of formant frequency will give more information, however one point sampled at 50% of vowel was used in this paper. Furthermore, the speaker normalization that can compensate the inter-speaker variability can be applied, even if fundamental frequency was used in this paper. This work has shown that use of acoustic attributes for vowel place detection is feasible. Although previous study [6] have shown that vowel detection with spectral coefficients, this paper limits to detecting vowel place with acoustic parameters. With a modification of the detection strategy using contextual information it would be expected that performance can be improved. Table 4 The confusion matrix of vowel place of TIMIT data 2432

References [1] Chomsky, N. and Halle, M., The Sound Pattern of English, Harper and Row, (1968) [2] Hillenbrand, J. M., Getty, L. A., Clark, M. J., and Wheeler, K. Acoustic characteristics of American English vowels, J. Acoust. Soc. Am. 97, 3099 3111 (1995) [3] Hillenbrand, J. M., Clark, M. J., and Nearey, T. M. Effects of consonant environment on vowel formant patterns, J. Acoust. Soc. Am. 109, 748-763 (2001) [4] Juneja, A., and Espy-Wilson, C. Y. An Event-based Acoustic phonetic Approach for Speech Segmentation and E-set Recognition, Proceedings of the International Congress of Phonetic Sciences, 1333-1336 (2003) [5] Kent, R. D. and Read, C., The Acoustic Analysis of Speech 2 nd edition, Thomson Learning. (2001) [6] Meng, H. and Zue, V., Signal representation comparison for phonetic classification, ICASSP, 285-288 (1991) [7] Park, C., Recognition of English vowels using topdown method, Master Thesis, M.I.T. (2004) [8] Peterson, G. E., Barney, H. L. Control methods used in a study of the vowels, J. Acoust. Soc. Am. 24, 175-184 (1952) [9] Stevens, K. N. Acoustic Phonetics. The MIT Press, (1991) [10] Stevens, K. N. Toward a model for lexical access based on acoustic landmarks and distinctive features, J. Acoust. Soc. Am. 111, 1872-1891 (2002) 2433