Frequency shifts and vowel identification

Similar documents
Speech Emotion Recognition Using Support Vector Machine

Mandarin Lexical Tone Recognition: The Gating Paradigm

A study of speaker adaptation for DNN-based speech synthesis

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Voice conversion through vector quantization

The Perception of Nasalized Vowels in American English: An Investigation of On-line Use of Vowel Nasalization in Lexical Access

Quarterly Progress and Status Report. Voiced-voiceless distinction in alaryngeal speech - acoustic and articula

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Perceptual scaling of voice identity: common dimensions for different vowels and speakers

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Learning Methods in Multilingual Speech Recognition

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Rhythm-typology revisited.

Perceptual Auditory Aftereffects on Voice Identity Using Brief Vowel Stimuli

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Proceedings of Meetings on Acoustics

Modeling function word errors in DNN-HMM based LVCSR systems

Consonants: articulation and transcription

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Modeling function word errors in DNN-HMM based LVCSR systems

Word Segmentation of Off-line Handwritten Documents

/$ IEEE

Speech Recognition at ICSI: Broadcast News and beyond

Expressive speech synthesis: a review

THE RECOGNITION OF SPEECH BY MACHINE

Body-Conducted Speech Recognition and its Application to Speech Support System

Author's personal copy

SARDNET: A Self-Organizing Feature Map for Sequences

Physics 270: Experimental Physics

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Effects of Open-Set and Closed-Set Task Demands on Spoken Word Recognition

Segregation of Unvoiced Speech from Nonspeech Interference

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Revisiting the role of prosody in early language acquisition. Megha Sundara UCLA Phonetics Lab

Journal of Phonetics

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Speaker Identification by Comparison of Smart Methods. Abstract

Speaker recognition using universal background model on YOHO database

COMPUTER INTERFACES FOR TEACHING THE NINTENDO GENERATION

ecampus Basics Overview

STA 225: Introductory Statistics (CT)

THE PERCEPTION AND PRODUCTION OF STRESS AND INTONATION BY CHILDREN WITH COCHLEAR IMPLANTS

VOL. 3, NO. 5, May 2012 ISSN Journal of Emerging Trends in Computing and Information Sciences CIS Journal. All rights reserved.

InCAS. Interactive Computerised Assessment. System

SEGMENTAL FEATURES IN SPONTANEOUS AND READ-ALOUD FINNISH

Letter-based speech synthesis

Linking object names and object categories: Words (but not tones) facilitate object categorization in 6- and 12-month-olds

A Cross-language Corpus for Studying the Phonetics and Phonology of Prominence

Dyslexia/dyslexic, 3, 9, 24, 97, 187, 189, 206, 217, , , 367, , , 397,

TEKS Comments Louisiana GLE

Evaluation of Teach For America:

WiggleWorks Software Manual PDF0049 (PDF) Houghton Mifflin Harcourt Publishing Company

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Learners Use Word-Level Statistics in Phonetic Category Acquisition

Introduction to the Practice of Statistics

Experience College- and Career-Ready Assessment User Guide

Analysis of Enzyme Kinetic Data

Automatic segmentation of continuous speech using minimum phase group delay functions

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

DEVELOPMENT OF LINGUAL MOTOR CONTROL IN CHILDREN AND ADOLESCENTS

Investigation on Mandarin Broadcast News Speech Recognition

1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all

Rachel E. Baker, Ann R. Bradlow. Northwestern University, Evanston, IL, USA

Online Publication Date: 01 May 1981 PLEASE SCROLL DOWN FOR ARTICLE

Levels of processing: Qualitative differences or task-demand differences?

Acoustic correlates of stress and their use in diagnosing syllable fusion in Tongan. James White & Marc Garellek UCLA

Using GIFT to Support an Empirical Study on the Impact of the Self-Reference Effect on Learning

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Completing the Pre-Assessment Activity for TSI Testing (designed by Maria Martinez- CARE Coordinator)

Running head: DELAY AND PROSPECTIVE MEMORY 1

WHEN THERE IS A mismatch between the acoustic

CODE Multimedia Manual network version

School Year 2017/18. DDS MySped Application SPECIAL EDUCATION. Training Guide

Andrew S. Paney a a Department of Music, University of Mississippi, 164 Music. Building, Oxford, MS 38655, USA Published online: 14 Nov 2014.

Appendix L: Online Testing Highlights and Script

Individual Differences & Item Effects: How to test them, & how to test them well

English Language and Applied Linguistics. Module Descriptions 2017/18

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Audible and visible speech

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

An Acoustic Phonetic Account of the Production of Word-Final /z/s in Central Minnesota English

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Automatic Pronunciation Checker

Non-Secure Information Only

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Reflective Teaching KATE WRIGHT ASSOCIATE PROFESSOR, SCHOOL OF LIFE SCIENCES, COLLEGE OF SCIENCE

Phonetics. The Sound of Language

J j W w. Write. Name. Max Takes the Train. Handwriting Letters Jj, Ww: Words with j, w 321

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

On-Line Data Analytics

Massachusetts Institute of Technology Tel: Massachusetts Avenue Room 32-D558 MA 02139

Transcription:

Frequency shifts and vowel identification Peter F. Assmann (School of Behavioral and Brain Sciences, Univ. of Texas at Dallas, Box 830688, Richardson TX 75083) Terrance M. Nearey (Dept. of Linguistics, University of Alberta, Edmonton, Alberta, Canada T6E 2G2).

Introduction Listeners can understand frequency-shifted speech across a wide frequency range (Fu & Shannon, 1999). We hypothesize that this ability can be explained in terms of listeners sensitivity to statistical variation across talkers in natural speech. The aims of the present study were: 1. To study the effects of frequency shifts on the identification of vowels spoken by 2 men, 2 women and 2 children (age 7). 2. To test the predictions of a model of vowel perception that incorporates measures of fundamental frequency (F0) and formant frequencies (FF) associated with size differences in larynx and vocal tract across talkers

Co-variation of formant frequencies and F 0 in natural speech Mean log FF: Geometric mean of formant frequencies: F1,F2,F3 >3000 vowels in hvd words (Assmann & Katz, 2000)

Pattern recognition model Hillenbrand & Nearey (1999) dual-target model Parameters: duration, mean F 0, and F1, F2, F3 sampled at 20% and 80% points Training data: 3000+ vowels spoken by 10 men, 10 women and 30 children from the N. Texas region (Assmann & Katz, 2000) A posteriori probabilities derived from linear discriminant analysis for each stimulus vowel

Frequency shifts and vowel identification In a previous study (Assmann, Nearey & Scott, 2002) we confirmed that upward shifts in F 0 or formant frequencies (FF) resulted in lower vowel identification accuracy. However, combining upward shifts in F 0 with upward shifts in FF led to improved identification accuracy. The finding that vowel identification accuracy is higher with coordinated shifts in F 0 and FF is well predicted by the model of vowel identification outlined below, and supports the idea that listeners are sensitive to the pattern of co-variation of F 0 and FF in natural speech.

Vowel Identification Accuracy 100 Means and standard errors of 11 listeners Predicted means Identification accuracy (%) 80 60 40 20 0 1.00 1.25 1.50 1.75 2.00 Spectrum envelope scale factor F0*1 F0*2 F0*4 1.00 1.25 1.50 1.75 2.00 Spectrum envelope scale factor F0*1 F0*2 F0*4 (Assmann, Nearey, and Scott, ICSLP 2002).

Vowel Identification Experiment The present study examined effects of upward and downward frequency shifts on vowel identification. 11 vowels (/i/, / /, /e/, / /, /æ/, / /, / /, / /, /o/, / /, /u/) in hvd context spoken by 3 men, 3 women, and 3 children from the N. Texas region. Upward and downward frequency shifts were introduced by means of the STRAIGHT vocoder (Kawahara, 1997).

STRAIGHT vocoder High-resolution analysis of time-varying spectrum envelope Wavelet-based instantaneous frequency F 0 extraction Spectrum envelope (FF) scaling Fundamental frequency (F 0 ) scaling

Scale Factors FF scale factors 0.6 0.8 1.0 1.5 2.0 F 0 scale factors 0.5 1.0 4.0 For females and children, downward shifts tend to produce male-like voices; for adults, upward shifts heard as child-like voices.

Method Listeners were 14 Psychology undergraduates participating for partial course credit. Since the majority had no phonetics training, they first completed 3 practice sets: Set 1: passive listening with feedback (24 resynthesized but not frequency-shifted vowels; no response required). Set 2: practice identification (a different set of 24 vowels presented for identification; repeated until a score of 21/24 or better was obtained). Set 3: passive listening with feedback (24 frequency-shifted vowels; shift factors randomly chosen from the 15 conditions of the experiment; no response required)

Method Main experiment: 990 syllables (11 vowels x 2 talkers per group x 3 talker groups x 3 F 0 scale factors x 5 FF scale factors). All conditions randomly interspersed. Vowels were presented diotically over headphones in a double-walled sound booth. Listeners identified the vowels using an 11-button response box drawn on computer screen labeled with keywords for the vowel category.

Effects of FF shifts 100 Identification accuracy (%) 80 60 40 20 0 Men Women Children 0.6 0.8 1 1.5 2 Spectrum envelope scale factor

Interaction of FF and F0 shifts Identification accuracy (%) 100 50 0 100 50 0 Men 0.6 0.8 1.0 1.5 2.0 Children 0.6 0.8 1.0 1.5 2.0 100 50 Spectrum envelope (FF) scale factor 0 Women 0.6 0.8 1.0 1.5 2.0 F0 x 0.5 F0 x 1.0 F0 x 4.0 For men s vowels, accuracy is higher when upward shift in FF is accompanied by upward shift in F0 For women and children, there is a recovery from downward shifts in FF when F0 is also shifted down

Conclusions Identification accuracy drops significantly when vowels are shifted upward in formant frequency by a factor of 1.5 or more, or downward by a factor of 0.6 or less. Adult males are less susceptible to upward shifts than females and children, while children are less affected by downward shifts. In several conditions, the drop in intelligibility was reduced by combining formant shifts with corresponding changes in F 0. Pattern recognition models predicted the effects of frequency shifts on vowel identification, including the synergistic link between F 0 and formant frequency. A plausible account is that learned relationships between F 0 and spectral envelope cues are responsible for this interaction.

References 1. Assmann PF, Katz WF. (2000) Time-varying spectral change in the vowels of children and adults. J Acoust Soc Am. 108(4): 1856-1866. 2. Assmann, P.F., Nearey, T.M., and Scott, J.M. (2002) Modeling the perception of frequency-shifted vowels. Proceedings of the 7th International Conference on Spoken Language Processing, pp. 425-428. 3. Fu, Q-J. & Shannon, R.V. (1999). Recognition of spectrally degraded and frequency-shifted vowels in acoustic and electric hearing. J Acoust Soc Am. 105: 1889-1900. 4. Hillenbrand JM, Nearey TM. (1999) Identification of resynthesized /hvd/ utterances: effects of formant contour. J Acoust Soc Am. 105(6): 3509-3523. 5. Kawahara, H. (1997) Speech representation and transformation using adaptive interpolation of weighted spectrum: vocoder revisited. Proc. IEEE Int. Conf. on Acoustics, Speech & Signal Processing (ICASSP '97), vol.2, pp.1303-1306.

Lowered Male Base Male Lowered F&C Base F&C Raised Male Raised F&C Average correct ID per synthetic voice

Basketballs w legend

Disk and spoke plot Disks = Observed ID The colored disks represent listeners correct identification rate Blue:male speakers synthesized voices (scaled and unscaled) ; Red: female speakers; Green: child speakers; The position of the center of the disk indicates the average F0 and formant frequencies of the voice The area of each disk is proportional to the average % corrected identification by listeners of the voice The circles in the legend box indicate the correct identification rate

Disk and spoke plot Spokes = Predicted ID Length of the spokes indicate predicted ID rate by LDFA Trained on natural measurements of Assmann and Katz, predictions on scaled values of this experiment Patterns: Accurate predictions: Basketball spoke length matches disk radius Under predictions: Asterisks in disks, listeners do better Over predictions: Spiked disks, model does better than listeners

bad voice good voice % ID well predicted by smooth function of mean F0 and mean FF Accounts for 88% of variance