Intra-speaker variation and units in human speech perception and ASR

Similar documents
Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

English Language and Applied Linguistics. Module Descriptions 2017/18

The Perception of Nasalized Vowels in American English: An Investigation of On-line Use of Vowel Nasalization in Lexical Access

Mandarin Lexical Tone Recognition: The Gating Paradigm

Speech Recognition at ICSI: Broadcast News and beyond

Program Matrix - Reading English 6-12 (DOE Code 398) University of Florida. Reading

Learning Methods in Multilingual Speech Recognition

SEGMENTAL FEATURES IN SPONTANEOUS AND READ-ALOUD FINNISH

Rachel E. Baker, Ann R. Bradlow. Northwestern University, Evanston, IL, USA

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

Universal contrastive analysis as a learning principle in CAPT

Perceptual scaling of voice identity: common dimensions for different vowels and speakers

Dyslexia/dyslexic, 3, 9, 24, 97, 187, 189, 206, 217, , , 367, , , 397,

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Linguistics. Undergraduate. Departmental Honors. Graduate. Faculty. Linguistics 1

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Proceedings of Meetings on Acoustics

Phonological and Phonetic Representations: The Case of Neutralization

Modeling function word errors in DNN-HMM based LVCSR systems

Consonants: articulation and transcription

Florida Reading Endorsement Alignment Matrix Competency 1

Segregation of Unvoiced Speech from Nonspeech Interference

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

CEFR Overall Illustrative English Proficiency Scales

THE RECOGNITION OF SPEECH BY MACHINE

Linguistics. The School of Humanities

SOUND STRUCTURE REPRESENTATION, REPAIR AND WELL-FORMEDNESS: GRAMMAR IN SPOKEN LANGUAGE PRODUCTION. Adam B. Buchwald

Effects of Open-Set and Closed-Set Task Demands on Spoken Word Recognition

Journal of Phonetics

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

An Acoustic Phonetic Account of the Production of Word-Final /z/s in Central Minnesota English

Corpus Linguistics (L615)

Quarterly Progress and Status Report. Voiced-voiceless distinction in alaryngeal speech - acoustic and articula

Modeling function word errors in DNN-HMM based LVCSR systems

1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all

Perceived speech rate: the effects of. articulation rate and speaking style in spontaneous speech. Jacques Koreman. Saarland University

Rhythm-typology revisited.

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Review in ICAME Journal, Volume 38, 2014, DOI: /icame

CLASSIFICATION OF PROGRAM Critical Elements Analysis 1. High Priority Items Phonemic Awareness Instruction

A Cross-language Corpus for Studying the Phonetics and Phonology of Prominence

Improved Effects of Word-Retrieval Treatments Subsequent to Addition of the Orthographic Form

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

age, Speech and Hearii

The Up corpus: A corpus of speech samples across adulthood

THE PERCEPTION AND PRODUCTION OF STRESS AND INTONATION BY CHILDREN WITH COCHLEAR IMPLANTS

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Phonological Processing for Urdu Text to Speech System

ACCREDITATION STANDARDS

PHONETIC DISTANCE BASED ACCENT CLASSIFIER TO IDENTIFY PRONUNCIATION VARIANTS AND OOV WORDS

Speech Emotion Recognition Using Support Vector Machine

Demonstration of problems of lexical stress on the pronunciation Turkish English teachers and teacher trainees by computer

Arabic Orthography vs. Arabic OCR

Multi-Lingual Text Leveling

THE INFLUENCE OF TASK DEMANDS ON FAMILIARITY EFFECTS IN VISUAL WORD RECOGNITION: A COHORT MODEL PERSPECTIVE DISSERTATION

The pronunciation of /7i/ by male and female speakers of avant-garde Dutch

AN INTRODUCTION (2 ND ED.) (LONDON, BLOOMSBURY ACADEMIC PP. VI, 282)

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

The Effect of Discourse Markers on the Speaking Production of EFL Students. Iman Moradimanesh

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

Lecture Notes in Artificial Intelligence 4343

The Common European Framework of Reference for Languages p. 58 to p. 82

Learners Use Word-Level Statistics in Phonetic Category Acquisition

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

The Journey to Vowelerria VOWEL ERRORS: THE LOST WORLD OF SPEECH INTERVENTION. Preparation: Education. Preparation: Education. Preparation: Education

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Dialog Act Classification Using N-Gram Algorithms

COMPUTER INTERFACES FOR TEACHING THE NINTENDO GENERATION

REVIEW OF CONNECTED SPEECH

MASN: 1 How would you define pragmatics today? How is it different from traditional Greek rhetorics? What are its basic tenets?

Eyebrows in French talk-in-interaction

Stages of Literacy Ros Lugg

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

On the Formation of Phoneme Categories in DNN Acoustic Models

A study of speaker adaptation for DNN-based speech synthesis

Process Evaluations for a Multisite Nutrition Education Program

Language Acquisition by Identical vs. Fraternal SLI Twins * Karin Stromswold & Jay I. Rifkin

Teaching ideas. AS and A-level English Language Spark their imaginations this year

Linguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis

Letter-based speech synthesis

UC Berkeley Dissertations, Department of Linguistics

Revisiting the role of prosody in early language acquisition. Megha Sundara UCLA Phonetics Lab

Reading Horizons. A Look At Linguistic Readers. Nicholas P. Criscuolo APRIL Volume 10, Issue Article 5

Minimalism is the name of the predominant approach in generative linguistics today. It was first

Acoustic correlates of stress and their use in diagnosing syllable fusion in Tongan. James White & Marc Garellek UCLA

Voice conversion through vector quantization

A Case Study: News Classification Based on Term Frequency

One major theoretical issue of interest in both developing and

Curriculum Vitae. Sara C. Steele, Ph.D, CCC-SLP 253 McGannon Hall 3750 Lindell Blvd., St. Louis, MO Tel:

Approaches to Teaching Second Language Writing Brian PALTRIDGE, The University of Sydney

Phonological encoding in speech production

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

LEARNING A SEMANTIC PARSER FROM SPOKEN UTTERANCES. Judith Gaspers and Philipp Cimiano

Handbook for Graduate Students in TESL and Applied Linguistics Programs

LINGUISTICS. Learning Outcomes (Graduate) Learning Outcomes (Undergraduate) Graduate Programs in Linguistics. Bachelor of Arts in Linguistics

Beginning primarily with the investigations of Zimmermann (1980a),

Human Factors Engineering Design and Evaluation Checklist

Problems of the Arabic OCR: New Attitudes

Transcription:

SRIV - ITRW on Speech Recognition and Intrinsic Variation May 20, 2006 Toulouse Intra-speaker variation and units in human speech perception and ASR Richard Wright University of Washington, Dept. of Linguistics rawright@u.washington.edu

Talk Outline Word recognition task 2 types of variation Sources of Inter-speaker Sources of Intra-speaker Human speech perception and variation Importance of features for perception Implications for ASR

Word recognition task: spontaneous speech Buckeye corpus: The Ohio State University Depts. of Psychology, Computer Science, and Linguistics collaboration Conversational speech (informal interviews) high quality recordings 40 speakers from Columbus, Ohio all from Columbus (2 accent groups) stratified for age (under 30, over 40) and sex class not controlled for orthographically transcribed and phonetically labeled freely available (with reasonable restrictions) http://buckeyecorpus.osu.edu/

Word recognition task: single word 5000 0 Hz 0 ms 127

Word recognition task: two words 5000 0 Hz 0 ms 149

Word recognition task: three words 5000 0 Hz 0 ms 240

Word recognition task: four words 5000 0 Hz 0 ms 682

Word recognition task: six words 5000 0 Hz 0 ms 1276

Word recognition task: whole 5000 sentence 0 Hz 0 ms 2035

Word recognition task: summary The task demonstrates 3 aspects of human speech perception and word recognition that are still difficult for ASR to emulate: 1) Humans are able to use partial information to entertain a set of possible word candidates simultaneously without introducing confusions 2) Humans can recover gracefully from errors 3) Humans adapt dynamically to variation The task also demonstrates that humans use a combination of top-down and bottom up strategies in recognizing words

Types of variation Important advances in speech perception research: Variation in input is an integral part of perceptual category formation: variation is information not noise [1] Inter-speaker and intra-speaker variation are quite different in their causes and in their acoustic characteristics

Inter-speaker variation Results from two types of factors physiologic and anatomic factors [2] [3] size of vocal tract vocal fold mass and morphology mass and movement characteristics of articulators Social and experiential factors [4] gender (as opposed to sex) regional accent class affiliation native language, dialect, exposure to other languages

Inter-speaker variation Example: sex and gender differences in male and female speech are the results of both physiologic (sex) and sociologic (gender) factors While male-female acoustic differences are predicted from vocal tract differences they: emerge in children s speech well before the onset of puberty produces differences in vocal tract size [5] are greater than predicted by vocal tract differences [6] vary systematically by language [6]

Inter-speaker variation Largely static over the duration of a conversation Talkers generally don t change their gender, age, accent over the duration of a conversation Most of the dimensions are not unique to a single speaker but represent large sections of the population Larger corpora with appropriate samples of the population have brought improvements Better language models have also brought improvements (more appropriate phone sets or multiple word pronunciations) etc. [e.g. 7]

Inter-speaker variation In speech perception: we are able to understand a wide variety of accents that we have no experience with as long as they are similar to ones we know the greater the similarity to speakers we are familiar with, the lower the latency and the higher the accuracy an abrupt change from one talker to the next (even within accents) [8]

Inter-speaker variation In speech perception: as our experience with intra-speaker variables decreases (or as environmental noise increases) we rely on a coarser coding of the input implies a featural rather than strictly phone based lexical representation [9] [10]

Intra-speaker variation Multiple factors sociolinguistic [4] style shifts task: spontaneous speech, read sentences, etc. attitude of the speaker to the audience accent shifts as group affiliation shifts the accent may as well

Intra-speaker variation Continuous relationship between formality of the task and reduction in speech [11] least formal most reduced most formal most hyperarticulated casual conversation with a friend conversation with a stranger interviews formal speaking read texts read words in isolation

Intra-speaker variation Multiple factors Information based: the more predictable, the more reduced discourse: as a word s information load decreases, it becomes more reduced [12] [13] first introduced into the discourse, least reduced focus construction, less reduction lexical: base probabilities function words bear a much lower informational load than content words word frequency/familiarity confusability (perceptual similarity to other words)

Intra-speaker variation Multiple factors Information based factors interact with levels of formality at any one level of formality there are varying degrees of reduction based on informational factors the less formal the speech the greater the effect of discourse and lexical factors greater variability in pronunciation in spontaneous conversations than in read texts most of the variation is sub-phone in nature (reduction, increased coarticulation, etc.)

Intra-speaker variation Human speech perception Sociolinguistic and informational variation isn t noise, it s information: Humans rely on it to interpret the meaning of the utterance in its social context Humans use it to understand which words are important to the overall meaning of the utterance it encodes higher level syntactic and semantic structure Humans adapt dynamically to the variation delayed decisions underspecified inputs to lexical decision

Speech Perception Perceptual constancy in the face of highly variable input [1] the invariance is in the behavioral response not in the signal listeners use partial, feature-based information, to make lexical decisions: phones are filled in later listeners gain significant advantages from experience with specific talkers highly detailed representations include both linguistic information and non-linguistic information in the representation of lexical items indexical information listeners use both top-down and bottom-up information in interpreting utterances

Speech Perception Human perceptual behavior is best modeled by feature-based representation at the lexical level

Speculation on ASR Use of phonological features at the acoustic modeling stage provide [14] increased robustness in noise ability to adapt to highly variable inputs Use of phonological features at the lexical level [15] more efficient pronunciation modeling necessary for adaptation to variation in the signal

But which features? SPE features (and typical descendents from linguistics) hybrid system: acoustic, articulatory, phonological developed in mid 50s for 2 purposes (Jacobsen, Fant & Halle, 1957) universal system (language independent) of phoneme classification lexical contrasts (aspects of sounds that minimally differentiate words) describing allophonic variation within languages (at the time phoneme variants predicted by phonetic environment), grouping of sounds by patterns of variation (Natural Classes) good for abstract classification tasks - part of language model for grouping words by coarse similarity not particularly realistic model of human speech perception, and probably not the ideal features for ASR feature extraction

Conclusions Features play an important role in human speech perception Hold promise for ASR Caution about which features one chooses let the features fit the task

References 1. L. C. Nygaard, M. Sommers, and D. B. Pisoni. Effects of stimulus variability on perception and representation of spoken words in memory. Perception and Psychophysics, vol. 57, pp. 989 1001, 1995. 2. D. H. Klatt and L. Klatt. Analysis, synthesis, and perception of voice quality variations among female and male talkers. Journal of the Acoustical Society of America, vol. 87, pp. 820-857, 1990. 3. J. Gonzáles. Formant frequencies and body size of speaker: a weak relationship in adult humans. Journal of Phonetics, 32, 277-287, 2004. 4. P. Foulkes and G. Docherty. The social life of phonetics and phonology. Journal of Phonetics, in press, 2006. 5. C. Hasek,S. Singh, & T. Murry. Acoustic attributes of preadolescent voices. Journal of the Acoustical Society of America, 68, 1262 1265, 1980. 6. K. Johnson. Resonance in an exemplar-based lexicon: The emergence of social identity and phonology. Journal of Phonetics, in press, 2006. 7. D. Jurafsky, W. Ward, Z. Jianping, K. Herold, Y. Xiuyang, and Z. Sen, What Kind of Pronunciation Variation is Hard for Triphones to Model? Proceedings ICASSP 2001, Salt Lake City, USA, vol. 1, pp. 577 580, 2001 8. R.E. Remez, J.M. Fellowes and P.E. Rubin. Talker identification based on phonetic information. Journal of Experimental Psychology. Human Perception and Performance, vol 23, no. 3, pp. 651-666, 1997. 9. R. Herman and D. B. Pisoni. Perception of elliptical speech by an adult hearing impaired listener with a cochlear implant: some preliminary findings on coarse-coding in speech perception. Research on Spoken Language Processing: Progress Report, vol. 24. Bloomington, IN: Indiana University, 2000. 10. G. Webster and R. Wright, R. Noise, attention and context: some problems for a cue-based approach to speech perception. In N. Niedzielski (Ed.) Speech perception in context: Beyond acoustic pattern matching. New Jersey: LEA. (46 pages), forthcoming. 11. A. Bell. Language style as audience design. Language in Society 13, 2, 1984. 12. H.P. Grice. Presupposition and conversational implicature. In Radical Pragmatics, ed. P. Cole, pp. 183 98. New York: Academic Press, 1981. Reprinted in Studies in the Ways of Words, ed. H. P. Grice, pp. 269 282. Cambridge, MA: Harvard University Press (1989) 13. A. Bell, D. Jurafsky, E. Fosler-Lussier, C. Girand, M. Gregory, D. Gildea. Effects of disfluencies, predictability, and utterance position on word form variation in English conversation. 14. K. Kirchoff, Robust speech recognition using articulatory features, PhD Thesis, University of Bielefeld, Germany, 1999. 15. R. Bates. Speaker Dynamics as a Source of Pronunciation Variability for Continuous Speech Recognition Models, Ph.D. dissertation, University of Washington, Seattle, Washington, USA, 2004. 16. R. Jakobson, G. Fant, and M. Halle. Preliminaries to speech analysis. The distinctive features and their correlates. Acoustics Laboratory, Massachusetts Inst. of Technology, Technical Report No. 13. MIT press, seventh edition, 1967.