ANALYSIS OF VOICE REGISTER TRANSITION FOCUSED ON THE RELATIONSHIP BETWEEN PITCH AND FORMANT FREQUENCY

Similar documents
Evaluation of Various Methods to Calculate the EGG Contact Quotient

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

Human Emotion Recognition From Speech

Speaker Recognition. Speaker Diarization and Identification

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Provisional. Using ambulatory voice monitoring to investigate common voice disorders: Research update

Perceptual scaling of voice identity: common dimensions for different vowels and speakers

Speaker recognition using universal background model on YOHO database

Speech Emotion Recognition Using Support Vector Machine

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Klaus Zuberbühler c) School of Psychology, University of St. Andrews, St. Andrews, Fife KY16 9JU, Scotland, United Kingdom

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Voice conversion through vector quantization

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

THE RECOGNITION OF SPEECH BY MACHINE

Speaker Identification by Comparison of Smart Methods. Abstract

WHEN THERE IS A mismatch between the acoustic

A comparison of spectral smoothing methods for segment concatenation based speech synthesis

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

On the Formation of Phoneme Categories in DNN Acoustic Models

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Application of Virtual Instruments (VIs) for an enhanced learning environment

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Consonants: articulation and transcription

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Proceedings of Meetings on Acoustics

Segregation of Unvoiced Speech from Nonspeech Interference

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Phonetics. The Sound of Language

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Mandarin Lexical Tone Recognition: The Gating Paradigm

Body-Conducted Speech Recognition and its Application to Speech Support System

Quarterly Progress and Status Report. Voiced-voiceless distinction in alaryngeal speech - acoustic and articula

Robot manipulations and development of spatial imagery

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Modeling function word errors in DNN-HMM based LVCSR systems

ECE-492 SENIOR ADVANCED DESIGN PROJECT

Word Stress and Intonation: Introduction

Major Milestones, Team Activities, and Individual Deliverables

BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY

Five Challenges for the Collaborative Classroom and How to Solve Them

Probability and Statistics Curriculum Pacing Guide

Audible and visible speech

age, Speech and Hearii

Modeling function word errors in DNN-HMM based LVCSR systems

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Speech Recognition at ICSI: Broadcast News and beyond

A Privacy-Sensitive Approach to Modeling Multi-Person Conversations

Entrepreneurial Discovery and the Demmert/Klein Experiment: Additional Evidence from Germany

Case study Norway case 1

ENEE 302h: Digital Electronics, Fall 2005 Prof. Bruce Jacob

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Using EEG to Improve Massive Open Online Courses Feedback Interaction

Speak with Confidence The Art of Developing Presentations & Impromptu Speaking

A student diagnosing and evaluation system for laboratory-based academic exercises

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Automatic segmentation of continuous speech using minimum phase group delay functions

1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all

Circuit Simulators: A Revolutionary E-Learning Platform

Perceptual Auditory Aftereffects on Voice Identity Using Brief Vowel Stimuli

ME 443/643 Design Techniques in Mechanical Engineering. Lecture 1: Introduction

Visit us at:

Automatic intonation assessment for computer aided language learning

A Cross-language Corpus for Studying the Phonetics and Phonology of Prominence

Professor Christina Romer. LECTURE 24 INFLATION AND THE RETURN OF OUTPUT TO POTENTIAL April 20, 2017

Voiceless Stop Consonant Modelling and Synthesis Framework Based on MISO Dynamic System

Expressive speech synthesis: a review

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Journal of Phonetics

Revisiting the role of prosody in early language acquisition. Megha Sundara UCLA Phonetics Lab

Certified Six Sigma Professionals International Certification Courses in Six Sigma Green Belt

Stimulating Techniques in Micro Teaching. Puan Ng Swee Teng Ketua Program Kursus Lanjutan U48 Kolej Sains Kesihatan Bersekutu, SAS, Ulu Kinta

SIE: Speech Enabled Interface for E-Learning

The Good Judgment Project: A large scale test of different methods of combining expert predictions

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

Bluetooth mlearning Applications for the Classroom of the Future

A new Dataset of Telephone-Based Human-Human Call-Center Interaction with Emotional Evaluation

Trend Survey on Japanese Natural Language Processing Studies over the Last Decade

On Developing Acoustic Models Using HTK. M.A. Spaans BSc.

An Introduction to Simio for Beginners

Calibration of Confidence Measures in Speech Recognition

PRODUCT COMPLEXITY: A NEW MODELLING COURSE IN THE INDUSTRIAL DESIGN PROGRAM AT THE UNIVERSITY OF TWENTE

Green Belt Curriculum (This workshop can also be conducted on-site, subject to price change and number of participants)

University of Toronto Physics Practicals. University of Toronto Physics Practicals. University of Toronto Physics Practicals

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Conversation Starters: Using Spatial Context to Initiate Dialogue in First Person Perspective Games

AGENDA LEARNING THEORIES LEARNING THEORIES. Advanced Learning Theories 2/22/2016

Ansys Tutorial Random Vibration

Master s Programme in Computer, Communication and Information Sciences, Study guide , ELEC Majors

Online Publication Date: 01 May 1981 PLEASE SCROLL DOWN FOR ARTICLE

Simulation of Multi-stage Flash (MSF) Desalination Process

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

DIDACTIC MODEL BRIDGING A CONCEPT WITH PHENOMENA

Automatic Pronunciation Checker

Use and Adaptation of Open Source Software for Capacity Building to Strengthen Health Research in Low- and Middle-Income Countries

FUZZY EXPERT. Dr. Kasim M. Al-Aubidy. Philadelphia University. Computer Eng. Dept February 2002 University of Damascus-Syria

Transcription:

ANALYSIS OF VOICE REGISTER TRANSITION FOCUSED ON THE RELATIONSHIP BETWEEN PITCH AND FORMANT FREQUENCY Yasufumi Uezu and Tokihiko Kaburagi Kyushu University, Fukuoka, Japan 3DS146W@s.kyushu-u.ac.jp, kabu@design.kyushu-u.ac.jp ABSTRACT When the voice register transition (VRT) occurs, vocal-folds motion becomes unstable and the voice pitch jumps abruptly. In this article, we examine the relationship between the fundamental frequency f and the first-formant frequency F 1 in VRT to reveal the influence of the source filter interaction (SFI) on VRT. Five Japanese male speakers produced rising glissandos with vowels /a/ and /i/. The vibratory state of the vocal folds and the vocal tract resonances were measured simultaneously with an electroglottograph device and an external acoustic excitation method. We analyzed the temporal change in f from electroglottograph signals and in F 1 using acoustic response signals. The relationship between f and F 1 were then analyzed to determine the cause of VRT and abrupt f jump. As a result, f was very close to F 1 when VRT arose in /i/, indicating the influence of SFI as a cause of VRT. Keywords: voice register transition, source filter interaction 1. INTRODUCTION Voice register transition (VRT) is that the voice register suddenly switches from chest to falsetto because of discontinuous voice pitch jumping when the voice pitch is raised gradually from a lower pitch. Besides, when voice register changes from chest to falsetto or falsetto to chest, the voice pitch jumps discontinuously irrespective of how smoothly the vocal fold tension changes. Two mechanisms may cause voice register transition; one is changing in the tension and the effective vibratory mass of the vocal folds, another is the acoustic interaction between the voice-source system in the larynx and the acoustic filter of the vocal tract. The source filter interaction (SFI) is interpreted as an extension and generalization of Fant s source filter theory [2]. The voice-source system and vocal tract filter in vivo are not independent; they influence each other and the voicesource system in the larynx is influenced by the acoustic load of the vocal tract. This acoustic interaction can then make the vocal fold motion unstable, that is, acoustically induced vocal fold instabilities. Ishizaka and Flanagan [4] showed the effects of SFI by using a two-mass model of the vocal folds and speech-generation simulation. Titze [6] studied the SFI during phonation by simulating belting and high-pitched operatic male singing using a speechproduction model. In other studies, vocal fold motion and voice production were simulated where the fundamental frequency was changed in time such that modal and falsetto registers were connected under the influence of the SFI. Tokuda, et al. [7] used a four-mass model of the vocal folds in the simulation. Kaburagi [5] performed a computer simulation study using a voice-production model that integrated a boundary layer analysis of glottal flow and the mechanism of SFI. Results from these studies suggest that the SFI can cause voice register transition and unstable phonation when the fundamental frequency approaches the first-formant frequency. Thus, it is suggested that source-induced and acoustically induced instabilities of in vivo vocal fold cause voice register transition. Zañartu, et al. [8] showed acoustically induced instabilities for vowel /i/ and source-induced instabilities for vowel /ae/ by using one subject performing upward and downward pitch glides. Moreover, it was showed that acoustically induced instabilities appeared abruptly and caused greater frequency jump than source-induced. This suggestion, however, has not been confirmed sufficiently because such experiments did not conducted when different subjects produced glissandos with variety of vowels. Furthermore, such measurement is difficult because high fundamental frequency during voice register transition hinders the accurate measurement of formant frequencies using speech-signal processing such as linear predictive coding analysis because the harmonic components of speech are sparse. In this study, we investigate the relationship between the fundamental and first-formant frequencies to study the influence of SFI on the voice register transition. We measure simultaneously vocal fold motions and vocal tract resonances while subjects

perform glissandos with vowel /a/ and /i/. In addition, we statistically analyze fundamental and firstformant frequencies in voice register transition and pitch jump width. 2.1. Subject and task 2. EXPERIMENT Five Japanese male speakers who untrained singing techniques participated in this study. Table 1 shows each subject s overlap range, the pitch range where he can phonate both chest and falsetto registers. Measurement experiments were performed in a soundproof booth. Each subject was instructed to produce a rising glissando from chest to falsetto register following the chirp signal fed into the subject s ear as a guide sound. This chirp signal was designed so that its instantaneous frequency rose from 1 Hz to 5 Hz in two seconds. Each subject repeated such glissando trials more than twenty times with Japanese vowels /a/ and /i/, and then the vibratory state of vocal folds and the vocal tract acoustic characteristics were measured simultaneously to get the fundamental frequency and the first formant frequency. 2.2. Measurement method Fig. 1 shows the block diagram of the measurement system used in this study. The vocal tract acoustic characteristics was measured by using the external acoustic vocal tract excitation (EAVE) method as described by Epps, et al. [1]. The vocal tract has specific acoustic characteristics that comprise formants. In the EAVE method, the vocal tract is excited by an external excitation signal such as broadband white noise. The excitation signal is input from the mouth to the vocal tract while the subject is uttering sounds. Then, the acoustic response to the excitation signal is output from the vocal tract together with the subject s own speech and these signals are recorded by a microphone placed in front of the subject s mouth. Formant frequencies are derived by analyzing the Table 1: Subject number, one s age and overlap range (the pitch range where he can phonate both chest and falsetto registers) Subject Age Overlap range S1 27 B3 C5 S2 26 C4 F4 S3 24 A3 F 4 S4 23 C4 E5 S5 23 D4 E4 Figure 1: Block diagram of the measurement system used in this study.!"#$%&'"%() *+,-$."/) $12-/! *99! 34#+5$."/) 6+7/$%!!"#$%&'"$()! 5:',! *+',--.(#)! /(1-2'3"43#! 51&(-6)73#8"%3! ;(:'73#- <&%,-! 826"/62) 6+7/$%) 9:!"+#2;! 5:',! frequency characteristics of the response signal. The EAVE device in this study was built from a speaker unit (FF165WK; Fostex) and an exponential horn of 195 mm length connected to an flexible tube of 3 mm length and 7 mm inner radius. An excitation signal was amplified by a power amplifier (TA-V55ES; Sony) and fed to the EAVE device to drive the vocal tract. The excitation signal then traveled through the vocal tract and radiated from the mouth as the response. A half-inch condenser microphone (Type 4191; Brüel & Kjær), a preamplifier (Type 2669; Brüel & Kjær), and a conditioning amplifier (Nexus 269; Brüel & Kjær) were used to record the output acoustic signals. In preparation for measurement, the excitation signal was generated by a computer as follows. First, M-sequence signal with a bandwidth from 17 Hz to 6, Hz was generated.the sampling frequency was 16, Hz. Next, the frequency characteristics of the EAVE device were calibrated. The M-sequence signal was input into the EAVE device, and then the output signal from the flexible tube was recorded by a microphone placed 5 mm away from the tube. The frequency characteristics of the EAVE device, which included the frequency characteristics of the speaker, exponential horn, and tube, was obtained from this signal. A linear filter that had the inverse frequency characteristics of the output signal was then determined by using the LPC method to cancel the undesired peaks and dips in

Figure 2: The temporal variation of the vocaltract acoustic characteristics from 2 ms before VRT to 5 ms after VRT when subject S3 performed a rising glissando with the vowel /a/. Figure 3: The temporal variation of the vocaltract acoustic characteristics from 2 ms before VRT to 5 ms after VRT when subject S4 performed a rising glissando with the vowel /i/. the above frequency characteristics. Finally, the external excitation signal was generated by filtering the M-sequence signal with the inverse linear filter. In the experiment, the microphone was set 1 cm away from the outlet flexible tube. Approximately 3 cm of the flexible tube was inserted in subject s mouth. While the subject performed the tasks, EGG and acoustic signals were recorded simultaneously and stored in the computer. The acoustic signal contained both the vocal tract response to the excitation signal and the subject s own speech. A vocal fold motion was measured as an electric EGG signal by means of an EGG device (Model EG- 2; Glottal Enterprises) with a couple of EGG electrodes fixed on both sides of the subject s larynx. EGG and acoustic signals were gathered by a computer through an audio-interface device (Fast Track Ultra; M-AUDIO). This audio-interface device was also used to provide the broadband excitation signal to the EAVE device. 2.3. Analysis of the fundamental frequency The fundamental frequency f was obtained by applying the DECOM method to DEGG signals as described by Henrich, et al. [3]. First, DEGG signal was generated by filtering EGG signal with differentiator filter which attenuated frequency components more than the stopband frequency of 7 Hz. The glottal closure instant (GCI) was detected from the positive peaks of DEGG signals. An interval of adjacent GCI corresponds to a fundamental period T. Next, DEGG signal was separated into positive and negative parts and then T was estimated by calculating the autocorrelation of positive part. Finally, f was calculated from the inverse number of estimated T. The length of the hamming window was set adaptively to the quadruple of the T estimated from the previous analysis frame. The shift width of the analysis frame was set to the double of T. If T couldn t be estimated in the previous frame, window length and shift width were set to 4 ms and 5 ms each other. 2.4. Analysis of the vocal tract acoustic characteristics and the first formant frequency Vocal tract acoustic characteristics was analyzed from the measured acoustic signal, however, it contained the subject s own speech that was the undesired signal component to be eliminated. Here, cepstrum analysis and liftering process were applied to the acoustic signal so as to remove such signal component. First, logarithm of the power spectrum was calculated from a windowed segment of the acoustic signal, and then cepstral parameters were calculated. Next, the vocal tract acoustic characteristics was calculated from lower quefrency components less than a threshold value. Here, the length of the hamming window was 3 ms, the shift width was 5 ms and liftering threshold value was 2.5 ms. Finally, temporal pattern of the first-formant frequency was estimated from the vocal tract acoustic characteristics for each frame by using a peak-picking method. 3. RESULTS AND DISCUSSION Fig. 2 and Fig. 3 show the temporal variation of vocal-tract acoustic characteristics from 2 ms before VRT to 5 ms after VRT. Fig. 2 shows the result

Table 2: The analysis results of mean and standard deviation of F 1 just before VRT, mean and standard deviation of pitch f pre just before VRT and f post just after VRT, and f frequency jump width in all combinations of subjects and vowels. Number of F 1 (Hz) f pre (Hz) f post (Hz) f Jump Subject Vowel Data Mean S.D. Mean S.D. Mean S.D. Width (Cent) 1 a 9 673.5 27.4 33.5 18.6 415. 1.3 394.2 2 a 7 721.3 15.3 281.1 15.7 333.8 16.7 297.5 3 a 15 698.2 27.6 269. 13. 331.8 13.6 363.2 4 a 15 688.8 3.7 288.7 33. 375.1 27.3 453.2 5 a 4 645.1 27.2 229.7 19.1 279.6 14.1 34.3 1 i 1 27.4 16.4 31.4 18.8 378. 21. 392. 2 i 9 261.4 17.9 281.3 19.3 34.8 19.7 332.2 3 i 15 268.1 15. 278.3 17.9 366. 19.9 474.2 4 i 15 256.2 19.4 239.4 1.9 319. 11.5 497. 5 i 15 267.5 7. 272.7 8.7 329.5 19.3 327.6 when subject S3 performed a rising glissando with the vowel /a/, and Fig. 3 shows the result when subject S4 performed a rising glissando with the vowel /i/. In Fig. 2, it was found that peaks near 7 Hz shifted continuously along time, which means that these were F 1 of the vowel /a/. On the other hand, in Fig. 3, it was found that peaks near 3 Hz shifted continuously along time, that is, these were F 1 of the vowel /i/. Table 2 shows the results of mean and standard deviation of F 1 just before VRT, mean and standard deviation of pitch f pre just before VRT and f post just after VRT, and f frequency jump width in all combinations of subjects and vowels. Here, f jump width were worked out in Cent as: (1) 12log 2 ( f post f pre ). It was found that F 1 for vowel /a/ were from 64 Hz to 72 Hz and for vowel /i/ were from 25 Hz to 27 Hz in each subject. It was also found that frequency range where f jump occurred was from 2 Hz to 4 Hz, and frequency margin of f jump were from 5 Hz to 9 Hz. It was evident that there were two different types of the relationship between f and F 1 in voice register transition. In one case, f was obviously lower than F 1. From Table 2, data for the vowel /a/ were considered to correspond with this case. In another case, f was very adjacent to F 1. Such tendencies were found for the vowel /i/. In addition, it was found that f jump width for /i/ were from 4 to 1 Cent larger than those for /a/ except the cases of subjects S1 and S5. From previous studies [5, 6, 7, 8], it is known that the influence of SFI is particularly strong and causes vocal fold instabilities when f is very close to F 1. Such instabilities bring about greater frequency jump than the instabilities caused by variation of vocal fold tension. From the results, it is certainly that f jump width tended to be larger for vowel /i/ than for vowel /a/ in most subjects. Thus, it is considered that the effect of SFI depends on the type of vowels. These experimental results suggest that voice register transition is caused by not only source-induced instability but also acoustically induced instability by SFI, which intensify frequency jump. Hence, it was revealed that the SFI causes voice register transition in real speech, which supports previous studies [5, 6, 7, 8]. 4. CONCLUSIONS In this study, we investigated the relationship between the fundamental frequency f and the firstformant frequency F 1 in voice register transition through vocal fold and acoustic measurements. f was analyzed using the DECOM method from EGG signal, and F 1 was analyzed using the EAVE method. The cepstral analysis was also used to eliminate the subject s own speech. The relationship between f and F 1 values were then analyzed to determine the cause of voice register transition and abrupt f jump. As a result, Two patterns of the relationship between f and F 1 in voice register transition were found. Furthermore, f was very close to F 1 and f jump width tended to be larger when voice register transition took place for vowel /i/, indicating the influence of the SFI as a cause of voice register transition.

5. REFERENCES [1] Epps, J., Smith, J., Wolfe, J. 1997. A novel instrument to measure acoustic resonances of the vocal tract during phonation. Measurement Science and Technology 8(1), 1112. [2] Fant, G. 196. Acoustic Theory of Speech Production. The Hague: Mouton. [3] Henrich, N., d Alessandro, C., Doval, B., Castellengo, M. 24. On the use of the derivative of electroglottographic signals for characterization of nonpathological phonation. J. Acoust. Soc. Am. 115(3), 1321 1332. [4] Ishizaka, K., Flanagan, J. L. 1972. Synthesis of voiced sounds from a two-mass model of the vocal cords. Bell system technical journal 51(6), 1233 1268. [5] Kaburagi, T. 211. Voice production model integrating boundary-layer analysis of glottal flow and source-filter coupling. J. Acoust. Soc. Am. 129(3), 1554 1567. [6] Titze, I. R., Worley, A. S. 29. Modeling sourcefilter interaction in belting and high-pitched operatic male singing. J. Acoust. Soc. Am. 126(3), 153 154. [7] Tokuda, I. T., Zemke, M., Kob, M., Herzel, H. 21. Biomechanical modeling of register transitions and the role of vocal tract resonatorsa). J. Acoust. Soc. Am. 127(3), 1528 1536. [8] Zañartu, M., Mehta, D. D., Ho, J. C., Wodicka, G. R., Hillman, R. E. 211. Observation and analysis of in vivo vocal fold tissue instabilities produced by nonlinear source-filter coupling: A case study a). J. Acoust. Soc. Am. 129(1), 326 339.