VOQUAL Brad Story Dept. of Speech and Hearing Sciences University of Arizona

Similar documents
Consonants: articulation and transcription

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

Phonetics. The Sound of Language

Perceptual scaling of voice identity: common dimensions for different vowels and speakers

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Speaker Recognition. Speaker Diarization and Identification

On Developing Acoustic Models Using HTK. M.A. Spaans BSc.

Quarterly Progress and Status Report. Voiced-voiceless distinction in alaryngeal speech - acoustic and articula

Speech Emotion Recognition Using Support Vector Machine

Audible and visible speech

1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all

Proceedings of Meetings on Acoustics

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Rhythm-typology revisited.

Voice conversion through vector quantization

Body-Conducted Speech Recognition and its Application to Speech Support System

age, Speech and Hearii

The Perception of Nasalized Vowels in American English: An Investigation of On-line Use of Vowel Nasalization in Lexical Access

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Consonant-Vowel Unity in Element Theory*

9 Sound recordings: acoustic and articulatory data

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Klaus Zuberbühler c) School of Psychology, University of St. Andrews, St. Andrews, Fife KY16 9JU, Scotland, United Kingdom

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Evaluation of Various Methods to Calculate the EGG Contact Quotient

DEVELOPMENT OF LINGUAL MOTOR CONTROL IN CHILDREN AND ADOLESCENTS

A comparison of spectral smoothing methods for segment concatenation based speech synthesis

Self-Supervised Acquisition of Vowels in American English

Mandarin Lexical Tone Recognition: The Gating Paradigm

Christine Mooshammer, IPDS Kiel, Philip Hoole, IPSK München, Anja Geumann, Dublin

Application of Virtual Instruments (VIs) for an enhanced learning environment

THE RECOGNITION OF SPEECH BY MACHINE

Speaker recognition using universal background model on YOHO database

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Grade 6: Correlated to AGS Basic Math Skills

Expressive speech synthesis: a review

Self-Supervised Acquisition of Vowels in American English

Edinburgh Research Explorer

Radical CV Phonology: the locational gesture *

Contrasting English Phonology and Nigerian English Phonology

To appear in the Proceedings of the 35th Meetings of the Chicago Linguistics Society. Post-vocalic spirantization: Typology and phonetic motivations

Phonological and Phonetic Representations: The Case of Neutralization

Provisional. Using ambulatory voice monitoring to investigate common voice disorders: Research update

FOR TEACHERS ONLY. The University of the State of New York REGENTS HIGH SCHOOL EXAMINATION PHYSICAL SETTING/PHYSICS

Speaker Identification by Comparison of Smart Methods. Abstract

Revisiting the role of prosody in early language acquisition. Megha Sundara UCLA Phonetics Lab

SEGMENTAL FEATURES IN SPONTANEOUS AND READ-ALOUD FINNISH

Phonological Processing for Urdu Text to Speech System

Online Publication Date: 01 May 1981 PLEASE SCROLL DOWN FOR ARTICLE

The pronunciation of /7i/ by male and female speakers of avant-garde Dutch

Voiceless Stop Consonant Modelling and Synthesis Framework Based on MISO Dynamic System

On the Formation of Phoneme Categories in DNN Acoustic Models

Segregation of Unvoiced Speech from Nonspeech Interference

Prevalence of Oral Reading Problems in Thai Students with Cleft Palate, Grades 3-5

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Universal contrastive analysis as a learning principle in CAPT

SOUND STRUCTURE REPRESENTATION, REPAIR AND WELL-FORMEDNESS: GRAMMAR IN SPOKEN LANGUAGE PRODUCTION. Adam B. Buchwald

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Beginning primarily with the investigations of Zimmermann (1980a),

ME 443/643 Design Techniques in Mechanical Engineering. Lecture 1: Introduction

COMMUNICATION DISORDERS. Speech Production Process

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Probabilistic Latent Semantic Analysis

Timeline. Recommendations

Radius STEM Readiness TM

Probability and Statistics Curriculum Pacing Guide

Guidelines for blind and partially sighted candidates

Visit us at:

Pobrane z czasopisma New Horizons in English Studies Data: 18/11/ :52:20. New Horizons in English Studies 1/2016

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

Level 1 Mathematics and Statistics, 2015

Perceptual Auditory Aftereffects on Voice Identity Using Brief Vowel Stimuli

Understanding and Interpreting the NRC s Data-Based Assessment of Research-Doctorate Programs in the United States (2010)

Quarterly Progress and Status Report. Sound symbolism in deictic words

The analysis starts with the phonetic vowel and consonant charts based on the dataset:

EGRHS Course Fair. Science & Math AP & IB Courses

Perceived speech rate: the effects of. articulation rate and speaking style in spontaneous speech. Jacques Koreman. Saarland University

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Software Maintenance

A Cross-language Corpus for Studying the Phonetics and Phonology of Prominence

A study of speaker adaptation for DNN-based speech synthesis

Cambridgeshire Community Services NHS Trust: delivering excellence in children and young people s health services

Honors Mathematics. Introduction and Definition of Honors Mathematics

Learners Use Word-Level Statistics in Phonetic Category Acquisition

An Acoustic Phonetic Account of the Production of Word-Final /z/s in Central Minnesota English

Language Acquisition by Identical vs. Fraternal SLI Twins * Karin Stromswold & Jay I. Rifkin

Fix Your Vowels: Computer-assisted training by Dutch learners of Spanish

Acoustic correlates of stress and their use in diagnosing syllable fusion in Tongan. James White & Marc Garellek UCLA

Using Proportions to Solve Percentage Problems I

**Note: this is slightly different from the original (mainly in format). I would be happy to send you a hard copy.**

Human Emotion Recognition From Speech

Sound and Meaning in Auditory Data Display

Introductory Astronomy. Physics 134K. Fall 2016

Functional Skills Mathematics Level 2 assessment

Speech/Language Pathology Plan of Treatment

Journal of Phonetics

CS Machine Learning

Transcription:

Physical Modeling of Voice and Voice Quality VOQUAL 2003 Brad Story Dept. of Speech and Hearing Sciences University of Arizona

Acknowledgements NIH R01 DC04789-03

Physical Modeling 1.Voice source mechanics of vocal fold vibration, pitch control, tremor & vibrato, source-tract interaction. 2.Vocal tract area function modeling based on volumetric imaging, relation between tract shape and acoustics (static & time-varying cases).

Simple, physiologicallyelevant control parameters Model Realistic output signals

Low-dimensional model of vocal fold vibration Coronal view of vocal folds Three-mass model of the cover-body structure of the vocal folds

Control of Phonation Control Parameters: Normalized activation levels of laryngeal muscles Model Parameters: mass, stiffness, damping, length, thickness, depth. a CT a TA P L Parameter Transformation k u m u k l m l k b m b

Muscle Activation: Normalized activation levels of the cricothyroid (CT) muscle and the thyroarytenoid (TA) muscle Model Parameters: mass, stiffness, damping, length, thickness, depth.

Mechanics of Cartilage Motion

Rest Position Rotation and Translation L 0 L 1 TA TA slip joint L 2 CT 1 CT 2 CT 1 CT 2

Assume that the length change due to rotation is larger than that due to translation. Vocal fold strain = fractional length change

Vocal fold strain is based on activation levels of the CT and TA muscles (Titze et al., 1988).

Muscle Activation Plot (MAP) Allows for plotting some specific quantity as a function of the CT and TA activation levels.

Vocal Fold Length MAP L 0 = 1.6 cm Max length change constant length Increasing a TA : decreasing VF length

From length change to stress (stiffness) Passive stress-strain curves (based on Alipour and Titze, 1991 & Min et al. 1995)

Stress in the muscle has both a passive component and an active component. Total muscle stress = passive stress + active stress Stress is converted to the equivalent three-mass model parameters (based on Titze and Story, JASA, 2002)

Model s Output for a CT = 0.25, a TA = 0.30, P L = 8 cmh20

Fundamental Frequency (F0) MAP acheal pressure = 8 cmh20 Each line represents a continuum of CT and TA activation pairs that produce the same F0. Note: Stress-strain curves, G, and R are are likely to be speaker dependent

Simulation along the F0 = 115 Hz line

Glottal Airflow at two points along the 115 Hz line

Voice Tremor acheal pressure = 8 cmh20 Tremor can be produced by modulating CT, TA activities or Lung Pressure CT modulation Tremor Freq = 5.2 Hz Extent = 0.25

Change in F0: Multiple routes to achieve a goal

Model of vocal tract shape (Area function)

Static Speech Sounds 1. Vocal tract imaging 2. Characteristics/modifications to the vocal tract relevant to voice quality

Imaging

-D reconstruction f the vocal tract hape oft Tissue d Bone Vocal Tract CT images used for demo

CT: Vowel [a] male 1 Lips Pharynx Mouth Piriform Sinus Epi-Larynx Vocal Folds Trachea

Leakage into Nasal Tract Pharynx CT: Vowel [a] male 2 Mouth Lips Valleculae Piriform Sinus Epi-Larynx Vocal Folds Trachea

CT: Vowel [a] female 1 Mouth Pharynx Lips Valleculae Piriform Sinus Epi-Larynx Vocal Folds Trachea

Speakers*: 1996: Male: 10 vowels, 12 consonants 1998: Female: 10 vowels, 12 consonants 2001: Male and Female, 4 vowels, 4 voice qualities 2002: 3 Females, 11 vowels each 2003: 3 Males, 11 vowels each *Nasal tract & trachea for all speakers

i æ o u r l p t k m n s f MRI VT shape inventory for one male speaker

phonetic fonts not readable on the previous slide, example words e given here that correspond to each vocal tract shape. heed hid head had hut hot haw hoe hood who earth lead p t k m n sing s shout think f MRI VT shape inventory for one male speaker

. Tube geometry analysis Cross-sectional area

3-D shape Area Function Vocal Tract Trachea Glottis

ube models the vocal act shape

Images Models

Filter Output pressure signal Source (glottal flow) Vocal fold models, source models

Fundamental frequency Filter Transfer Function harmonics Output pressure spectrum = F1 F2 Source spectrum (glottal flow)

Where to from here? Vocal tract modifications, voice quality, vowel quality, source-tract interactions, etc. Time-varying (dynamic) vocal tract shape to produce connected speech Generate stimuli for perceptual experiments

ontributions of the Vocal ract to Voice Quality arge deformations of the vocal tract shape move F1 and F2 for appropriate vowel entification. Phonetic/voice quality Vowel Space

pper formant frequencies may carry formation concerning timbre Phonetic/voice quality Voice quality (timbre)

Example: Transformation of a speaker into a singer by creating a Singing Formant Epilarynx Nasal leakage and piriform sinuses are ignored for this example

Singing Formant (Sundberg, 1974) - Cluster of upper formant frequencies whose purpose is to enhance the harmonic amplitudes near 3000 Hz. From Sundberg (Science of the Singing Voice)

Conditions for a Singing Formant: 1. Need a tube-like epilarynx that produces a resonance in the 2800-4000 Hz range. 2. Cross-sectional area of the epilarynx tube should be about 6 times smaller than the lowest part of the pharynx. (i.e. 6:1 ratio) Le = 2 cm Ap = 3 cm2 Ae = 0.5 cm2

pproximate closed-open epilarynx tube: Frequency Response F4 F5 Approx 4375 Hz

What would this person sound like as a singer? All simulated sounds are produced with: 1. Parametric glottal area model based on Rosenberg (1973). Simple aerodynamic equations determine glottal flow. 2. Wave propagation through the vocal tract computed with a wave-reflection (Liljencrants, 1984) or digital waveguide (Smith, Stanford) approach. 3. Losses due to yielding walls, viscosity, and radiation are included. 4. Tracheal area function included.

Fundamental Frequency (F0) Contour Amplitude Contour (glottal area)

F4 F5 Singer s Formant too high?

Attempt to lower the Singing Formant by lengthening the epilarynx tube (usually by lowering the larynx) Le = 3 cm Approx 2916 Hz

Build the formant cluster with three formants instead of two. Need to modify cross-sectional areas. Modification is guided by sensitivity functions (Fant and Pauli, 1974). Sensitivity functions indicate the possible change in each formant frequency due to a small perturbation of cross-sectional area along the distance of the VT. KE = Kinetic Energy PE = Potential Energy

To get F3,F4, and F5 clustered together, F5 needs to decrease in frequency. F3 F4 F5 An iterative minimization technique was used that modified the area function based on sensitivity functions until the desired formants were achieved.

Original w/lengthened epilarynx New modification F3 F4 F5

Example: move cluster down in frequency Example: move cluster up in frequency

Example: detune the cluster

Summary F5 F4 F3 F2 F1 speech

Dynamic Speech (Real Speech!)

Control Parameters: Coefficients of orthogonal shaping functions, location and degree of consonantal constriction, length variation Control of Vocal Tract Shape Vocal Tract Area Function Lips q 1 q 2 l c Parameter Transformation s c Glottis

Parametric representation of the area function Principal Components Analysis Similar approaches: Meyer, P., Wilhelms, R., & Strube, H. W. (1989) A quasiarticulatory speech synthesizer for German language running in real time, J. Acoust. Soc. Am., 86(2), 523-539. Harshman, R., Ladefoged, P., & Goldstein, L. (1977) Factor analysis of tongue shapes, J. Acoust. Soc. Am., 62(3), 693-707. Maeda, S. (1990). Compensatory articulation during speech: evidence from the analysis and synthesis of vocal-tract shapes using an articulatory model. In Speech Production and Speech Modeling, W.J. Hardcastle and A. Marchal, eds., 131-149. Ru, P, Chi, T., & Shamma, S. (2003). The synergy between speech production and perception, JASA, 113, 498-515.

0 vowels

10 vowel area functions Convert areas to equivalent diameters & normalize length Principal Components Analysis

Mode Weights Frequency response of (π/4)ω 2 (x)

q 2 vs q 1 F2 vs F1

Articulatory to- Acoustic Mapping Coefficient Space F1-F2 Space

ransformation of ormant frequencies to ime-varying ommands for eforming the tube hape Ohio

V(x,t) = π/4 [Ω(x) + q 1 (t)ϕ 1 (x) + q 2 (t)ϕ 2 (x)] 2 me-varying ea function original simulation Flared epi-larynx

Area function model V(x,t) = π/4 [Ω(x) + q 1 (t)ϕ 1 (x) + q 2 (t)ϕ 2 (x)] 2 Speaker-specific: contains properties and/or settings unique to the speaker? (e.g. Laver, 1980) Common across speakers?? Superimposed on the underlying Ω(x)

V(x,t) = π/4 [q 1 (t)ϕ 1 (x) + q 2 (t)ϕ 2 (x) + Ω(x) ] 2 Ohio Substitute a different neutral shape

original modified

BrianNormal5.wav

V(x,t) = π/4 [Ω(x) + q 1 (t)ϕ 1 (x) + q 2 (t)ϕ 2 (x)] 2 Voice Source: Glottal area model based on Rosenberg s flow model. Original recording Area function synthesis Fricatives from original recording

Modification of Voice Quality: pharygealized Modify Ω(x) to be constricted in the pharynx and expanded in the oral cavity

Modification of Voice Quality: twangy Modify Ω(x) to be slightly constricted in the middle part of the tract and expanded at the lips

Modification of Voice Quality: velarized Modify Ω(x) to be slightly constricted in the middle part of the tract

BrianClos1 BrianSmil1

The End