The Use of Dynamic Vocal Tract Model for constructing the Formant Structure of the Vowels

Similar documents
Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Consonants: articulation and transcription

Perceptual scaling of voice identity: common dimensions for different vowels and speakers

Speech Emotion Recognition Using Support Vector Machine

THE RECOGNITION OF SPEECH BY MACHINE

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Speaker Recognition. Speaker Diarization and Identification

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Body-Conducted Speech Recognition and its Application to Speech Support System

A comparison of spectral smoothing methods for segment concatenation based speech synthesis

Expressive speech synthesis: a review

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Phonetics. The Sound of Language

Speech Recognition at ICSI: Broadcast News and beyond

Rhythm-typology revisited.

age, Speech and Hearii

Quarterly Progress and Status Report. Voiced-voiceless distinction in alaryngeal speech - acoustic and articula

Voiceless Stop Consonant Modelling and Synthesis Framework Based on MISO Dynamic System

On Developing Acoustic Models Using HTK. M.A. Spaans BSc.

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Voice conversion through vector quantization

A Cross-language Corpus for Studying the Phonetics and Phonology of Prominence

Mandarin Lexical Tone Recognition: The Gating Paradigm

Speaker recognition using universal background model on YOHO database

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Online Publication Date: 01 May 1981 PLEASE SCROLL DOWN FOR ARTICLE

Speaker Identification by Comparison of Smart Methods. Abstract

Klaus Zuberbühler c) School of Psychology, University of St. Andrews, St. Andrews, Fife KY16 9JU, Scotland, United Kingdom

Segregation of Unvoiced Speech from Nonspeech Interference

A study of speaker adaptation for DNN-based speech synthesis

Evaluation of Various Methods to Calculate the EGG Contact Quotient

Software Maintenance

1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all

Audible and visible speech

Automatic segmentation of continuous speech using minimum phase group delay functions

Phonological Processing for Urdu Text to Speech System

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

The Perception of Nasalized Vowels in American English: An Investigation of On-line Use of Vowel Nasalization in Lexical Access

Human Emotion Recognition From Speech

Perceptual Auditory Aftereffects on Voice Identity Using Brief Vowel Stimuli

SEGMENTAL FEATURES IN SPONTANEOUS AND READ-ALOUD FINNISH

Quarterly Progress and Status Report. Sound symbolism in deictic words

WHEN THERE IS A mismatch between the acoustic

Statewide Framework Document for:

On the Combined Behavior of Autonomous Resource Management Agents

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Sample Goals and Benchmarks

Word Stress and Intonation: Introduction

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Dyslexia/dyslexic, 3, 9, 24, 97, 187, 189, 206, 217, , , 367, , , 397,

Lecture 1: Machine Learning Basics

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

Rachel E. Baker, Ann R. Bradlow. Northwestern University, Evanston, IL, USA

Problems of the Arabic OCR: New Attitudes

Learning Methods in Multilingual Speech Recognition

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Acoustic correlates of stress and their use in diagnosing syllable fusion in Tongan. James White & Marc Garellek UCLA

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

Constructing a support system for self-learning playing the piano at the beginning stage

Proceedings of Meetings on Acoustics

An Acoustic Phonetic Account of the Production of Word-Final /z/s in Central Minnesota English

Learners Use Word-Level Statistics in Phonetic Category Acquisition

The Structure of the ORD Speech Corpus of Russian Everyday Communication

Modern TTS systems. CS 294-5: Statistical Natural Language Processing. Types of Modern Synthesis. TTS Architecture. Text Normalization

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

Perceived speech rate: the effects of. articulation rate and speaking style in spontaneous speech. Jacques Koreman. Saarland University

THE PERCEPTION AND PRODUCTION OF STRESS AND INTONATION BY CHILDREN WITH COCHLEAR IMPLANTS

The Evolution of Random Phenomena

- INFORMATION TECHNOLOGIES AND TELECOMMUNICATION

Probability and Statistics Curriculum Pacing Guide

On the Formation of Phoneme Categories in DNN Acoustic Models

One major theoretical issue of interest in both developing and

Statistical Parametric Speech Synthesis

The Strong Minimalist Thesis and Bounded Optimality

D Road Maps 6. A Guide to Learning System Dynamics. System Dynamics in Education Project

The IRISA Text-To-Speech System for the Blizzard Challenge 2017

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Procedia - Social and Behavioral Sciences 146 ( 2014 )

Sound and Meaning in Auditory Data Display

9 Sound recordings: acoustic and articulatory data

Provisional. Using ambulatory voice monitoring to investigate common voice disorders: Research update

Modeling function word errors in DNN-HMM based LVCSR systems

Generating Test Cases From Use Cases

Major Milestones, Team Activities, and Individual Deliverables

Part I. Figuring out how English works

BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY

Journal of Phonetics

Automatic intonation assessment for computer aided language learning

SOUND STRUCTURE REPRESENTATION, REPAIR AND WELL-FORMEDNESS: GRAMMAR IN SPOKEN LANGUAGE PRODUCTION. Adam B. Buchwald

Australian Journal of Basic and Applied Sciences

The pronunciation of /7i/ by male and female speakers of avant-garde Dutch

Transcription:

The Use of Dynamic Vocal Tract Model for constructing the Formant tructure of the Vowels Vera V. Evdoimova Department of Phonetics, aint-petersburg tate University, aint-petersburg, Russia postmaster@phonetics.pu.ru Abstract This paper discusses the new method of constructing the dynamic vocal tract model. It consists of two dynamic parts: the voice source and filter components. Each of these parts has their own dynamic features and resonant frequencies. Their interaction leads to the short-term phonetic effects. The method of obtaining frequency characteristic of the filter component by processing the real speech data is suggested. It allows constructing the formant structure of the vowels and their variations. On the example of the realization of the stressed vowel /a/ the formant structure using the new method is detected.. Introduction The traditional approach to phonetic research of the vocal tract assumes dividing it into two parts: the source component (vocal chords (apparatus)) and the filter component (system of articulation). The vocal apparatus consists of the vocal chords (folds), trachea, bronchi, and larynx. It is the primary source of the glottal wave. This multifrequency acoustic signal includes the fundamental frequency and its high harmonics [,, 3]. The voice signal goes through the filter component set of pharynx, nasal and oral cavities. The filter component was the main object of the analysis and research in the studies of the process of speech generation for a long time. In the first physically based acoustic model of G. Fant [3] the filter component was considered as the dynamic system with the set of the resonant frequencies (the formant frequencies for vowels). The voice source signal goes to the input of this system. The voice source signal is the strongest acoustic signal in the human vocal tract. Almost all the internal organs are parts of the biomechanical oscillating system that generates the voice signal. This signal is individual and optimized by nature. The periodic sequence of lung pressure differences in larynx is called the glottal wave [4, 5]. The frequency of these pulses corresponds to the fundamental frequency in speech signal. The shape of a glottal pulse can be similar for different people, but it can also have some differences due to size, shape, and flexibility of vocal cords. The glottal wave generates acoustic voice signal. There is the set of poles on the plot of the spectral density of the voice source. The fundamental frequency is the lowest in frequency and the biggest in power. All other peas are high harmonics of it (timbre frequencies). These frequencies vary slowly through the words, phrases according to the intonation contour (except tonal languages: changing of the fundamental frequency in tonal languages is important within one vowel (or syllable) for semantic differences whereas it is not important for some other languages) [6, 7, 8]. In order to provide the input excitation to the filter component in the speech synthesis model it was important to give a good description of a voice source signal. It was suggested to replace the voice source model by the description of its output signal the glottal wave. Physiological and acoustic experiments gave an opportunity to determine the shape of a glottal pulse. The F-model of voice source was developed in 8-s by G. Fant [4, 5]. It describes the glottal wave as a sequence of pulses of the given shape. The frequency of these pulses is the fundamental frequency. Their shape is similar to the experimentally measured shape of glottal pulses. The spectral density of voice source from the experiments was the pattern for the choice of shape of the glottal pulses. The voice source constituents were obtained from the signal using the inverse filtering. Comparing the model with the pattern has shown that the voice signal can be modeled successfully by the derivative of glottal wave function. The glottal wave curve differs greatly from the ideal sinusoid because of the high harmonics of pitch. The glottal flow is described with four different parameters. Three of these pertain to the frequency, amplitude and the

exponential growth constant of a sinusoid. The fourth is the time constant of an exponential recovery. The four parameters are interrelated by a condition of net flow gain within a fundamental period which is usually set to zero. The choice of these four parameters provide for the production of individual voice source characteristic. The difference in quality of the speech synthesis system using the F-model lies in its property that not only the pitch but also its high harmonics are taen into account. The basis of interference of voice and filter components is maintained in the model. The intensity of glottal flow, phoneme durations, and fundamental frequency are set as time functions for phoneme production. The F-model imitates the voice signal and wors well for instrument text-to-speech synthesis system. However, it is more complicated to use it for the analysis of real speech data. The inverse problem should be solved in this case to determine the F-model parameters using the characteristics of the real speech. It is a very complicated tas with lots of calculations.. Modeling It can be suggested to use the method of elaboration of united human vocal tract model for the solution of the problem of analyzing the real speech data []. This model consists of two parts: the voice source model and the well-nown model of filter component. It is suggested to extend the G.Fant s method of elaboration the filter component model to elaborate the dynamic model of the voice source and the vocal tract on the whole. The voice source can be described as the dynamic filtered operation: () t ( t) Fig.. The voice source represented as a dynamic part. U (t)= (jω) (t), () () t - input signal; () t - glottal wave; jω - equivalent frequency domain transfer ( ) function of the voice source. ( t) is supposed to be an impact of muscular and pulmonary systems (lungs) and can be set as the white noise of the given frequency bandwidth. All the particular qualities are concentrated in the filtered operation and determined with the frequency-domain transfer function. The use of F-model gives the basis of description of the voice source dynamic part structure. The F-model of the glottal wave is presupposed to consist of the fundamental frequency constituent and high harmonics. Therefore it can be suggested that has several own resonant frequencies F, F... F n. The process of glottal wave generation can be regarded as forced oscillation arising on the resonant frequencies of fundamental frequency constituent and its high harmonics constituents under the influence of the air flow fluctuations. Therefore the human vocal tract can be regarded as a united dynamic system, consisting of two concatenated parts: voice source and filter component which have their own dynamic characteristics. Both parts are non-separable and interact. Fig.. Dynamic system of the human vocal tract consisting of two parts. (t) air flow pressure from the respiratory apparatus (the lungs), (jω) - transfer function of the source component that includes trachea, larynx and the vocal chords, (t) - output acoustic signal of the source component that includes the pitch ant its high harmonics, also it includes a lot of other frequencies which were reduced on that stage, W (jω) - transfer function of the articulation, U (t) speech signal.

This dynamic system, consisting of two parts can be presented using the following ratios [9, ]: (t)= (jω) (t), U (t)=w (jω) (t), () et us, for instance, consider the wor of the vocal tract of the vowel. The (t) signal on the input of the source component exists during the whole vowel because of the air flow from the lungs. It provides for the generation of all the frequencies in the speech signal but has no typical spectrum. The standard way of describing such signal is presenting it as a random function the white noise of the limited frequency bandwidth [3, ]. The limits of this band ω and ω are chosen to cover all the frequencies of the speech signal. For example in the research we can assume them to be: ω =π s, ω =π 4 s, The voice source with the transfer function W (jω), gains the fundamental frequency and its high harmonics. It also passes all the other frequencies, but it weaens them at the same time. ome of them are gained in the filter component having transfer function W (jω) (for example, the formants of the vowels). Therefore there are the constituents of the source component and of the filter component in the output speech signal. et us find the spectral densities of the signals: (ω)=/ / (ω), (3) U (ω)= /W (jω)/ (ω), U U (ω), (ω), (ω) - spectral densities of U U the U,, signals. We shall consider the procedure of detecting the parameters of the equivalent transfer functions by processing the real speech data and processing the obtained spectral densities of the U signal for the vowels. (ω)= / (jω)/ (ω), (4) U scale factor of the experiment. The coprocessing of several acoustic realizations helps to elaborate the methods of discrimination and modeling the transfer functions of the voice source and filter components of the vocal tract. It is important to process the speech signals that have different levels of influence of the two parts of the vocal tract. The examples of it are the processing of the periods of a vowel where we can find the formant frequencies and rather long utterance where the influence of the filter component is statistically reduced but the influence of the voice source is higher. The transfer functions of the vocal apparatus and system of articulation can be obtained through processing the experimental speech data using the ratios: / = (jω)/ =. U a U ( ω) / П ( ω) П / ( ω) a U, (5) ( ω), (6) U (ω) spectral density of the output speech signal obtained by processing the long utterance, U a (ω) spectral density of the output speech signal of the vocal tract obtained by processing several fundamental frequency periods of the vowel, и a - scale gain factors for U (ω) and (ω), U a W П (jω) transfer function of the filter component smoothed by statistical processing of the long speech signal. In order to get an adequate division of the voice source and filter components the described method must tae into account not only the main phonetic laws but also the particular qualities of the mathematic procedures application. The use of non-parametric methods in spectral density estimation, particularly the standard procedure of the periodogram estimation leads to the irregularity of the lines in the spectral density U (ω). That can lead to the mistaes in calculations.

It seems more convenient to use the parametric methods of signal processing for solving this tas. In this case the spectral analysis becomes the optimization tas, the search of the parameters of the model to mae it as close as possible to the real speech signal [7]. The autoregression and PC methods are used to detect the coefficients of the model. This method is nown to give the good results when the spectrum of the signal has distinct peas and high-frequency noise part. These ratios give an opportunity to model the amplitude-frequency characteristics of the source and filter components of the vocal tract. These amplitude-frequency characteristics describe the dynamics of the system and can be used as a starting material for solving the problem of modeling of these parts of the vocal tract. Fig. 3. pectral density of the speech signal obtained by processing of the rather long utterance (5 minutes) of the male-voice. There is the fundamental frequency pea. The formant structure is statistically reduced. Fig. 5. Amplitude-frequency characteristic of the speech signal obtained from the ratio (6), filter component transfer function /W (jω)/ of the stressed vowel /a/ ( ms). The formant structure is well-defined. Fig. 4. pectral density of the speech signal obtained by processing of the ms of the stressed vowel /a/. Fig. 6. Diagram of variations of the first three formants of the stressed vowel /a/ in the word /ina da/. This method allows describing the formant variations through the vowel. 3

There is no doubt that the obtained frequency characteristic does not only describe the transfer function of the filter component of the vocal tract but also contains some influence of the voice source part. There are two reasons for this. Firstly, the voice signal in the ms of one vowel is stronger than in the processing of the rather long utterance where the influence of the voice source is statistically reduced. econdly, the fundamental frequency of the part of the vowel is well-defined. In the processing of the rather long utterance it is statistically reduced. 3. Formant analysis The calculations carried out show that despite some assumptions the suggested method allows to describe fully the formant structure of the vowels. The amplitude-frequency characteristics of the transfer functions of the human vocal tract parts are given as an example. The results of the calculations show that the frequency of each of the first three formants changes during the vowel. The set of these three frequency ranges can be the distinctive feature of the vowel. The calculations justify the phenomena that the same phoneme can be obtained by different sets of frequencies [3, 4, 5]. F, 36-4 47-5 55-58 58-64 F, 9-4 33-43 6- - F3, 69-74 39-43 7-5 5-55 /slab j/ context /ina da/ /prar valis / /zahad as iva/ 64-66 -38 3-4 /nas / Fig. 7. The table of the set of frequency ranges for the three first formants. tressed vowel /a/. 4. Conclusions. The proposed method of describing the human vocal tract differs essentially from the wellnown descriptions that use the F-model. Firstly, it presents the voice source as an independent dynamic part with its own resonant frequencies. econdly, the coprocessing of the acoustic realizations of one person helps to elaborate the methods of discrimination and modeling the transfer functions of the voice source and filter components of the vocal tract.. The proposed method gives the opportunity of automatic discrimination of the formant structure of the vowels by processing the real speech data. 3. The constructed model of the filter part of the vocal tract completely corresponds to the basic phonetic statements and can be used for solving the specific problems of speech technologies such as automatic speech recognition and high-quality speech synthesis system elaboration. 5. References. Bondaro.V. Phonetics of Russian modern language, PbU, 998 (in Russian).. Kodzasov.V., Krivnova O.F. General Phonetics. Moscow,. 3. Fant G. Acoustic Theory of peech Production. Moscow, 964 (in Russian) 4. Fant G. The voice source in connected speech. peech Communication, 997, v.. 5. Fant G., iljencrants J., in Q. A four-parameter model of glottal flow. T-QPR, -3, 985 6. Bondareno V., Kotsubinsi V., Mescheriaov R. 4. Peculiarities of vocal generation at speech synthesis by rules. pecom 4, -Pb. 7. oroin V. The theory of speech production. Moscow, 985 (in Russian) 8. oroin V. peech ynthesis, 99. (in Russian) 9. Besseersy V.A., Popov E.P. Automatic control theory systems. Moscow, Naua, 97.. Evdoimova V.V. election of method of human vocal tract model construction // Intergral modeling of the sound form of natural languages. Pb, 5, p. 74-88. Hallahan W.I. DECtal oftware: Text-to- peech Technology and Implementation. //COMPAQ DIGITA Technical Journal, 996.. ergieno A.B. Digital signal processing. Moscow, 3. 3. Phonetics of the spontaneous speech. Pb., 988. 4. relin P.A. Phonetic aspects of speech technologies. Pb., 999. 5. Carlson R., Granstrom B., Karlsson I. Experiments with voice modeling in speech synthesis. peech Communication, 99,, p.48-489. 4