Intra-speaker variation and units in human speech perception and ASR

SRIV - ITRW on Speech Recognition and Intrinsic Variation May 20, 2006 Toulouse Intra-speaker variation and units in human speech perception and ASR Richard Wright University of Washington, Dept. of Linguistics rawright@u.washington.edu

Talk Outline Word recognition task 2 types of variation Sources of Inter-speaker Sources of Intra-speaker Human speech perception and variation Importance of features for perception Implications for ASR

Word recognition task: spontaneous speech Buckeye corpus: The Ohio State University Depts. of Psychology, Computer Science, and Linguistics collaboration Conversational speech (informal interviews) high quality recordings 40 speakers from Columbus, Ohio all from Columbus (2 accent groups) stratified for age (under 30, over 40) and sex class not controlled for orthographically transcribed and phonetically labeled freely available (with reasonable restrictions) http://buckeyecorpus.osu.edu/

Word recognition task: single word 5000 0 Hz 0 ms 127

Word recognition task: two words 5000 0 Hz 0 ms 149

Word recognition task: three words 5000 0 Hz 0 ms 240

Word recognition task: four words 5000 0 Hz 0 ms 682

Word recognition task: six words 5000 0 Hz 0 ms 1276

Word recognition task: whole 5000 sentence 0 Hz 0 ms 2035

Word recognition task: summary The task demonstrates 3 aspects of human speech perception and word recognition that are still difficult for ASR to emulate: 1) Humans are able to use partial information to entertain a set of possible word candidates simultaneously without introducing confusions 2) Humans can recover gracefully from errors 3) Humans adapt dynamically to variation The task also demonstrates that humans use a combination of top-down and bottom up strategies in recognizing words

Types of variation Important advances in speech perception research: Variation in input is an integral part of perceptual category formation: variation is information not noise [1] Inter-speaker and intra-speaker variation are quite different in their causes and in their acoustic characteristics

Inter-speaker variation Results from two types of factors physiologic and anatomic factors [2] [3] size of vocal tract vocal fold mass and morphology mass and movement characteristics of articulators Social and experiential factors [4] gender (as opposed to sex) regional accent class affiliation native language, dialect, exposure to other languages

Inter-speaker variation Example: sex and gender differences in male and female speech are the results of both physiologic (sex) and sociologic (gender) factors While male-female acoustic differences are predicted from vocal tract differences they: emerge in children s speech well before the onset of puberty produces differences in vocal tract size [5] are greater than predicted by vocal tract differences [6] vary systematically by language [6]

Inter-speaker variation Largely static over the duration of a conversation Talkers generally don t change their gender, age, accent over the duration of a conversation Most of the dimensions are not unique to a single speaker but represent large sections of the population Larger corpora with appropriate samples of the population have brought improvements Better language models have also brought improvements (more appropriate phone sets or multiple word pronunciations) etc. [e.g. 7]

Inter-speaker variation In speech perception: we are able to understand a wide variety of accents that we have no experience with as long as they are similar to ones we know the greater the similarity to speakers we are familiar with, the lower the latency and the higher the accuracy an abrupt change from one talker to the next (even within accents) [8]

Inter-speaker variation In speech perception: as our experience with intra-speaker variables decreases (or as environmental noise increases) we rely on a coarser coding of the input implies a featural rather than strictly phone based lexical representation [9] [10]

Intra-speaker variation Multiple factors sociolinguistic [4] style shifts task: spontaneous speech, read sentences, etc. attitude of the speaker to the audience accent shifts as group affiliation shifts the accent may as well

Intra-speaker variation Continuous relationship between formality of the task and reduction in speech [11] least formal most reduced most formal most hyperarticulated casual conversation with a friend conversation with a stranger interviews formal speaking read texts read words in isolation

Intra-speaker variation Multiple factors Information based: the more predictable, the more reduced discourse: as a word s information load decreases, it becomes more reduced [12] [13] first introduced into the discourse, least reduced focus construction, less reduction lexical: base probabilities function words bear a much lower informational load than content words word frequency/familiarity confusability (perceptual similarity to other words)

Intra-speaker variation Multiple factors Information based factors interact with levels of formality at any one level of formality there are varying degrees of reduction based on informational factors the less formal the speech the greater the effect of discourse and lexical factors greater variability in pronunciation in spontaneous conversations than in read texts most of the variation is sub-phone in nature (reduction, increased coarticulation, etc.)

Intra-speaker variation Human speech perception Sociolinguistic and informational variation isn t noise, it s information: Humans rely on it to interpret the meaning of the utterance in its social context Humans use it to understand which words are important to the overall meaning of the utterance it encodes higher level syntactic and semantic structure Humans adapt dynamically to the variation delayed decisions underspecified inputs to lexical decision

Speech Perception Perceptual constancy in the face of highly variable input [1] the invariance is in the behavioral response not in the signal listeners use partial, feature-based information, to make lexical decisions: phones are filled in later listeners gain significant advantages from experience with specific talkers highly detailed representations include both linguistic information and non-linguistic information in the representation of lexical items indexical information listeners use both top-down and bottom-up information in interpreting utterances

Speech Perception Human perceptual behavior is best modeled by feature-based representation at the lexical level

Speculation on ASR Use of phonological features at the acoustic modeling stage provide [14] increased robustness in noise ability to adapt to highly variable inputs Use of phonological features at the lexical level [15] more efficient pronunciation modeling necessary for adaptation to variation in the signal

But which features? SPE features (and typical descendents from linguistics) hybrid system: acoustic, articulatory, phonological developed in mid 50s for 2 purposes (Jacobsen, Fant & Halle, 1957) universal system (language independent) of phoneme classification lexical contrasts (aspects of sounds that minimally differentiate words) describing allophonic variation within languages (at the time phoneme variants predicted by phonetic environment), grouping of sounds by patterns of variation (Natural Classes) good for abstract classification tasks - part of language model for grouping words by coarse similarity not particularly realistic model of human speech perception, and probably not the ideal features for ASR feature extraction

Conclusions Features play an important role in human speech perception Hold promise for ASR Caution about which features one chooses let the features fit the task

References 1. L. C. Nygaard, M. Sommers, and D. B. Pisoni. Effects of stimulus variability on perception and representation of spoken words in memory. Perception and Psychophysics, vol. 57, pp. 989 1001, 1995. 2. D. H. Klatt and L. Klatt. Analysis, synthesis, and perception of voice quality variations among female and male talkers. Journal of the Acoustical Society of America, vol. 87, pp. 820-857, 1990. 3. J. Gonzáles. Formant frequencies and body size of speaker: a weak relationship in adult humans. Journal of Phonetics, 32, 277-287, 2004. 4. P. Foulkes and G. Docherty. The social life of phonetics and phonology. Journal of Phonetics, in press, 2006. 5. C. Hasek,S. Singh, & T. Murry. Acoustic attributes of preadolescent voices. Journal of the Acoustical Society of America, 68, 1262 1265, 1980. 6. K. Johnson. Resonance in an exemplar-based lexicon: The emergence of social identity and phonology. Journal of Phonetics, in press, 2006. 7. D. Jurafsky, W. Ward, Z. Jianping, K. Herold, Y. Xiuyang, and Z. Sen, What Kind of Pronunciation Variation is Hard for Triphones to Model? Proceedings ICASSP 2001, Salt Lake City, USA, vol. 1, pp. 577 580, 2001 8. R.E. Remez, J.M. Fellowes and P.E. Rubin. Talker identification based on phonetic information. Journal of Experimental Psychology. Human Perception and Performance, vol 23, no. 3, pp. 651-666, 1997. 9. R. Herman and D. B. Pisoni. Perception of elliptical speech by an adult hearing impaired listener with a cochlear implant: some preliminary findings on coarse-coding in speech perception. Research on Spoken Language Processing: Progress Report, vol. 24. Bloomington, IN: Indiana University, 2000. 10. G. Webster and R. Wright, R. Noise, attention and context: some problems for a cue-based approach to speech perception. In N. Niedzielski (Ed.) Speech perception in context: Beyond acoustic pattern matching. New Jersey: LEA. (46 pages), forthcoming. 11. A. Bell. Language style as audience design. Language in Society 13, 2, 1984. 12. H.P. Grice. Presupposition and conversational implicature. In Radical Pragmatics, ed. P. Cole, pp. 183 98. New York: Academic Press, 1981. Reprinted in Studies in the Ways of Words, ed. H. P. Grice, pp. 269 282. Cambridge, MA: Harvard University Press (1989) 13. A. Bell, D. Jurafsky, E. Fosler-Lussier, C. Girand, M. Gregory, D. Gildea. Effects of disfluencies, predictability, and utterance position on word form variation in English conversation. 14. K. Kirchoff, Robust speech recognition using articulatory features, PhD Thesis, University of Bielefeld, Germany, 1999. 15. R. Bates. Speaker Dynamics as a Source of Pronunciation Variability for Continuous Speech Recognition Models, Ph.D. dissertation, University of Washington, Seattle, Washington, USA, 2004. 16. R. Jakobson, G. Fant, and M. Halle. Preliminaries to speech analysis. The distinctive features and their correlates. Acoustics Laboratory, Massachusetts Inst. of Technology, Technical Report No. 13. MIT press, seventh edition, 1967.