Speech Perception NACS 642 01 April 2009
power/amplitude
frequency
+
+ =
Tonotopic Organization
Speech...
Source-Filter Model
Source-Filter Model
Source-Filter Model
Source-Filter Model
Frequency Time
Stop Consonants: [p b t d k g]
Fricatives: [!! f v s z " #]
The Problem of Speech Perception
Hypothesized Representational Format
Hypothesized Representational Format
Hypothesized Representational Format
How do we get from here to there
How do we get from here to there
How do we get from here to there
The simplest theory Hypothesis: There is a one-to-one relationship between pieces of acoustic information and the segmental information stored in our head
The simplest theory
The simplest theory
29
30
30
30
Different Acoustic Input: Same percept! 30
Front Back High Low
Front Back High Good! Low
37
38
39
40
41
42
Peterson & Barney (1952)
Obscured by phonetic context and speaker differences...
Simple One-to-One Mapping between acoustic cue and phoneme doesn t seem to exist...
From vibrations in the ear to abstractions in the brain
From vibrations in the ear to abstractions in the brain sounds words
From vibrations in the ear to abstractions in the brain sounds words
From vibrations in the ear to abstractions in the brain sounds words Continuously varying waveform with information on multiple time- and frequency scales must be encoded
From vibrations in the ear to abstractions in the brain sounds words Continuously varying waveform with information on multiple time- and frequency scales must be encoded
From vibrations in the ear to abstractions in the brain sounds words Continuously varying waveform with information on multiple time- and frequency scales must be encoded and decoded to make contact with the long-term linguistic representations in memory WORD
From vibrations in the ear to abstractions in the brain sounds words Continuously varying waveform with information on multiple time- and frequency scales must be encoded word and decoded to make contact with the long-term linguistic representations in memory WORD WORD
sincetherearenowordboundarysignsinspokenlanguagethedifficultywefeelinreading andunderstandingtheaboveparagraphprovidesasimpleillustrationofoneofthemaind ifficultieswehavetoovercomeinordertounderstandspeechratherthananeatlyseparat edsequenceofletterstringscorrespondingtothephonologicalformofwordsthespeech signalisacontinuousstreamofsoundsthatrepresentthephonologicalformsofwordsin additionthesoundsofneighboringwordsoftenoverlapwhichmakestheproblemofident ifyingwordboundariesevenharder
Why speech perception should not work
Why speech perception should not work linearity no straightforward mapping between stretches of sound and phonemes
Why speech perception should not work linearity no straightforward mapping between stretches of sound and phonemes
Why speech perception should not work linearity invariance no straightforward mapping between stretches of sound and phonemes no (obvious) invariant features identify a given phoneme in all contexts
Why speech perception should not work linearity invariance no straightforward mapping between stretches of sound and phonemes no (obvious) invariant features identify a given phoneme in all contexts
Why speech perception should not work linearity invariance perceptual constancy no straightforward mapping between stretches of sound and phonemes no (obvious) invariant features identify a given phoneme in all contexts we reliably identify speech despite tremendous variation across speakers (pitch, rate, accent, affect )
Why speech perception should not work linearity invariance perceptual constancy no straightforward mapping between stretches of sound and phonemes no (obvious) invariant features identify a given phoneme in all contexts we reliably identify speech despite tremendous variation across speakers (pitch, rate, accent, affect )
Why speech perception should not work linearity invariance perceptual constancy no straightforward mapping between stretches of sound and phonemes no (obvious) invariant features identify a given phoneme in all contexts we reliably identify speech despite tremendous variation across speakers (pitch, rate, accent, affect ) Halle and Stevens 1962 Chomsky and Miller 1963
Varies across: speakers, phonetic context, rate, etc. Stable across: speakers, phonetic context, rate, etc.
Varies across: speakers, phonetic context, rate, etc. What set of perceptual/ neural mechanisms mediate the mapping between acoustic input and long term memory representations? Stable across: speakers, phonetic context, rate, etc.
The Problem of Speech Perception [+ voiced] [+ continuant]
The Problem of Speech Perception [+ voiced] [+ continuant] What s involved in this mapping?
The Problem of Speech Perception 0.4258 0 0.7621 0-0.6509 0 2.56916 Time (s) 0.4674-0.8202 0 3.20771 Time (s) 0-0.6457 0 5.56735 Time (s)
Questions Cognitive Neuroscience can help answer: 1. What is the nature of stored mental representations? 2. What types of mechanisms are involved in mapping from acoustics to memory? 3. What brain areas are implicated in the perception of speech?
Questions Cognitive Neuroscience can help answer: 1. What is the nature of stored mental representations? 2. What types of mechanisms are involved in mapping from acoustics to memory? 3. What brain areas are implicated in the perception of speech?
Levels of Representation Acoustics: Variation in air pressure; Analog input to auditory system Phonetics: Language-specific categorization of different acoustic tokens; phonetic tokens Discriminability of different acoustic tokens relatively preserved Phonology: Abstract symbolic representations; Fine-grained distinctions irrelevant; All or nothing category membership phonemes English [p h at] pot [spat] spot Hindi [p h $l] fruit [p$l] moment English /p/ Hindi /p h / /p/
Phonetic Categories Map acoustic tokens into a multidimensional space There still may be speech specific processing t t t t t t t t t t t t t t But... representations are not discrete, abstract, etc. Store fine phonetic detail d d d d d d d d d Dennis Klatt, Stephen Goldinger, Peter Jusczyk, Jessica Maye, Keith Johnson
Voice Onset Time
Voice Onset Time The dot The tot
Voice Onset Time The dot The tot
Voice Onset Time The dot The tot Short VOT Long VOT
Voice Onset Time /da/ /ta/ 20 Nb of tokens produced 15 10 [d] [t] 5 0 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80 80-90 90-100 100-110 110-120 VOT (in ms)
Voice Onset Time
Categorical Perception [da] VOT: 20ms [ta] VOT: 80ms same different Discrimination Task
Categorical Perception [da] VOT: 20ms /t/ /d/ Identification Task
Voice Onset Time Identification RT of Identification from Phillips et al (2000)
Voice Onset Time Identification RT of Identification from Phillips et al (2000)
MMN = Mismatch Negativity ERP (event related potential) that reflects sensory discrimination Elicited by repeated presentation of a sound stimulus (standard) which is sometimes changed into a different sound (deviant): X X X X X X Y X X X X X Y X X X X Y
MMN = Mismatch Negativity ERP (event related potential) that reflects sensory discrimination Elicited by repeated presentation of a sound stimulus (standard) which is sometimes changed into a different sound (deviant): Elicited pre-attentively!
from Näätänen (1999)
= Standard - Deviant NOTICE: negative voltage up positive voltage down from Näätänen (1999)
N1 or N100 = Standard - Deviant NOTICE: negative voltage up positive voltage down from Näätänen (1999)
Obligatory ERP Reflects sensory encoding of auditory stimulus attributes
Discriminability (Methods) % Behavioral level: Categorical Perception % Electrophysiological level: MMN
Discriminability of phones by VOT % Behavioral level: Categorical Perception & % Electrophysiological level: MMN?
Looking at VOT: [dæ] vs [tæ] Behavioral data EEG: N1 (sensory encoding) EEG: MMN (sensory discrimination)
Sharma & Dorman 1999 Behavioral Experiment:
Sharma & Dorman 1999
Sharma & Dorman 1999 Discrimination: AX task
Sharma & Dorman 1999 Discrimination: AX task Performing at chance level
Sharma & Dorman 1999 MMN Experiment 30-50ms 60-80ms
Sharma & Dorman 1999 MMN Experiment
Level of representation Acoustics: % Variation in air pressure; % Analog input to auditory system Phonetics: % Language-specific categorization of different acoustic tokens; % Discriminability of different acoustic tokens relatively preserved Phonology: % Abstract symbolic representations; % Fine-grained distinctions irrelevant; % All or nothing category membership
Questions What kinds of representation is the MMN sensitive to? % Acoustic? % Phonetic? % Phonemic? How can we be sure it s not just acoustics?
Potential problem How can we be sure it s not just acoustics? There seems to be a difference between the 30-50 and the 60-80 MMN response; BUT, what if this difference has nothing to do with the phonetic category people perceive? Could it be that there is something special about the 30-50ms gap, for instance?
Potential problem How can we be sure it s not just acoustics? Could it be that there is something special about the 30-50ms gap, for instance?
Perception of VOT Identification RT of Identification from Phillips et al (2000)
Potential problem How can we be sure it s not just acoustics? Could it be that there is something special about the 30-50ms gap, for instance? If Chinchillas can show the same Categorical Perception behavior for the VOT continuum, this response is probably not based on phonetics
Potential problem Neuroscience evidence: VOT < 30ms and > 60ms have different neuronal population encoding in mammalian auditory system than VOT in the the 30ms-60ms range
Potential problem Could it be that there is something special about the 30-50ms gap, for instance? There is, apparently. How can we be sure it s not just acoustics? With these results alone, we can t.
Suggestions? Can we come up with ways to test whether or not we can test the MMN response to see if it is sensitive to the phonetic and phonological level of representations? Requirement: Many-to-one ratio XXXXY
Look at sounds that are phonemically in one language, but not in the other. % Näätänen et al (1997)
Na!a!ta!nen et al (1999) Looking for language-dependent memory traces for sounds Vowels Finnish Estonian
Vowels varying only in F2
Vowels varying only in F2 Estonian extra vowel F2 values
Vowels varying only in F2
MMN = Standard - Deviant NOTICE: negative voltage up positive voltage down from Näätänen (1999)
Pure Tones with freq = F2
Pure Tones with freq = F2
F2 Pure Tones vs Vowels
F2 Pure Tones vs Vowels Nonmonotonic increase; Drop Linear increase; No drop
Vowels: Finns vs Estonians
Vowels: Finns vs Estonians Drop No Drop
Finns vs Estonians MMN peak amplitude at Fz Finns (! blue) Estonians (' purple)
Finns vs Estonians MMN peak amplitude at Fz Finns (! blue) Estonians (' purple) Drop
MEG data - Dipole Model
MEG data - Dipole Model Drop
Conclusions Tone vs Vowel data is dissimilar for Finnish speakers, even though what s being varied in the two conditions is the exact same acoustic quantity % Finnish judge a vowel with an F2 of 1,311Hz to be a very bad instance of /ö/ % Estonians have a vowel /õ/, and judge a vowel with an F2 of 1,311Hz as a good instance of that /õ/ vowel
Conclusions % Finnish judge a vowel with an F2 of 1,311Hz to be a very bad instance of /ö/ % Estonians have a vowel /õ/, and judge a vowel with an F2 of 1,311Hz as a good instance of that /õ/ vowel % Estonian vowel MMN data is more in line with Finnish Tone data
Any problems? Are you convinced? Does this show that the MMN is indeed sensitive to phonemic categories? Could these results be explained on a purely acoustic basis? Could these results be explained on a purely phonetic basis?
Level of representation Acoustics: % Variation in air pressure; % Analog input to auditory system Phonetics: % Language-specific categorization of different acoustic tokens; % Discriminability of different acoustic tokens relatively preserved Phonology: % Abstract symbolic representations; % Fine-grained distinctions irrelevant; % All or nothing category membership
Phillips et al (2000) Question: Is the MMN sensitive to Phonological Categories? % Abstract symbolic representations; % Fine-grained distinctions irrelevant; % All or nothing category membership
Phillips et al (2000) Template of MMN design: X X X X X X Y X X X X X Y X X X X Y Sharma & Dorman (1999) - VOT values: 30 30 30 30 30 50 30 30 30 30 30 50 30 30 60 60 60 60 60 80 60 60 60 60 60 80 60 60
Phillips et al (2000) Template of MMN design: X X X X X X Y X X X X X Y X X X X Y Sharma & Dorman (1999) - VOT values: 30 30 30 30 30 50 30 30 30 30 30 50 30 30 60 60 60 60 60 80 60 60 60 60 60 80 60 60 Many-to-one ratio at all levels
Phillips et al (2000) Template of MMN design: X X X X X X Y X X X X X Y X X X X Y Sharma & Dorman (1999) - VOT values: 30 30 30 30 30 50 30 30 30 30 30 50 30 30 60 60 60 60 60 80 60 60 60 60 60 80 60 60 Many-to-one ratio at all levels Let s try to do the many-to-one ratio only at the phonological level
Phillips et al (2000) Template of MMN design: X X X X X X Y X X X X X Y X X X X Y Sharma & Dorman (1999) - VOT values: 30 30 30 30 30 50 30 30 30 30 30 50 30 30 60 60 60 60 60 80 60 60 60 60 60 80 60 60 Phillips et al. (2000) - VOT values: 8 16 0 24 16 48 0 24 16 0 24 8 64 16 8 56 0
Perception of VOT Identification RT of Identification from Phillips et al (2000)
Many-to-one only at P level Phillips et al. (2000) - VOT values: A: 8 16 0 24 16 48 0 24 16 0 24 8 64 16 8 56 0 P: D D D D D T D D D D D D T D D T D
Results Exp1
What if not PhonCat, but... What if the results are not due to Phonological categories, but to something prosaic as the VOT difference between adjacent sounds? From standard to standard, the VOT difference could span 0 to 24ms (mean 12) From standard to deviant, the VOT difference could go from 14 to 72ms (mean 40) How can we address this?
Exp. 2 - Acoustics Add 20ms VOT in all sounds, such that the relative distance between them remains the same, but the proportion of sounds falling on each side of the boundary change:
Exp. 2 - Acoustics Add 20ms VOT in all sounds, such that the relative distance between them remains the same, but the proportion of sounds falling on each side of the boundary change: No longer many-to-one relations at P level
What if not PhonCat, but...
No MMN for acoustic condition
Phillips et al (2000) Conclusion: MMN here is driven by phonological category membership, not acoustics.
Question Are you convinced? Can we be sure this result does not stem from acoustics? What about phonetic categories?
No Abstract Categories You simply map acoustic tokens into a multidimensional space There still may be speech specific processing t t t t t t t t t t t t t t But... representations are not discrete, abstract, etc. Store fine phonetic detail d d d d d d d d d Dennis Klatt, Stephen Goldinger, Peter Jusczyk, Jessica Maye, Keith Johnson
VOT Distribution DISTRIBUTION OF VOT 30 25 20 Frequency 15 10 5 0 5 15 25 35 45 55 65 75 85 95 105 115 125 135 145 More Voice Onset Time (in ms)
Do We Even Have Categories? Perhaps we should not even be asking if infants have well-formed phonetic categories, separated by boundaries, but rather if any language users do. In other words, the very concept of categories, and even more so of boundaries, needs to be reconsidered...we have no evidence that boundaries exist in the natural world, or any account of how or why they may have evolved by natural selection. To extend to them any degree of psychological reality is unsupportable, and deleterious to efforts to understand how phonetic structure is indeed instantiated and retrieved from the speech signal. Nittrouer (2001)
A Phonetic Explanation for Phillips, et al. 2000 MMN could be induced by sampling from the statistical distribution of phonetic categories No need to rely on abstract phonological categories if this is how we conceive of the phonetic space t t t t t t t t t t t t t t Sampling from/mapping into different distribution could elicit MMN d d d d d d d d d
A Phonetic Explanation for Phillips, et al. 2000 MMN could be induced by sampling from the statistical distribution of phonetic categories No need to rely on abstract phonological categories if this is how we conceive of the phonetic space t t t t t t t t t t t t t t t Sampling from/mapping into different distribution could elicit MMN d d d d d d d d d
A Phonetic Explanation for Phillips, et al. 2000 MMN could be induced by sampling from the statistical distribution of phonetic categories No need to rely on abstract phonological categories if this is how we conceive of the phonetic space t t t t t t t t t t t t t t t Sampling from/mapping into different distribution could elicit MMN d d d d d d d d d
A Phonetic Explanation for Phillips, et al. 2000 MMN could be induced by sampling from the statistical distribution of phonetic categories No need to rely on abstract phonological categories if this is how we conceive of the phonetic space t t t t t t t t t t t t t t t Sampling from/mapping into different distribution could elicit MMN d d d d d d d d d
A Phonetic Explanation for Phillips, et al. 2000 MMN could be induced by sampling from the statistical distribution of phonetic categories No need to rely on abstract phonological categories if this is how we conceive of the phonetic space t t t t t t t t t t t t t t t Sampling from/mapping into different distribution could elicit MMN d d d d d d d d d
A Phonetic Explanation for Phillips, et al. 2000 MMN could be induced by sampling from the statistical distribution of phonetic categories No need to rely on abstract phonological categories if this is how we conceive of the phonetic space t t t t t t t t t t t t t t t Sampling from/mapping into different distribution could elicit MMN d d d d d d d d d
A Phonetic Explanation for Phillips, et al. 2000 MMN could be induced by sampling from the statistical distribution of phonetic categories No need to rely on abstract phonological categories if this is how we conceive of the phonetic space t t t t t t t t t t t t t t t Sampling from/mapping into different distribution could elicit MMN d d d d d d d d d
A Phonetic Explanation for Phillips, et al. 2000 MMN could be induced by sampling from the statistical distribution of phonetic categories No need to rely on abstract phonological categories if this is how we conceive of the phonetic space Sampling from/mapping into different distribution could elicit MMN t t t t t t t t t t t t t t t MMN! d d d d d d d d d
Kazanina, et al. (2006)
Kazanina, et al. (2006)
Kazanina, et al. (2006)
Kazanina, et al. (2006)
Dupoux et al. 1999 Phonotactics seems to influence how people perceive phonetic sounds. % Look at these Japanese borrowed words:
Dupoux et al. 1999 Japanese has a restricted syllabic inventory when compared to languages such as English and French % V, CV, CVNasal, CVQ (Q = first half of a geminate consonant)
Dupoux et al. 1999 Look at these Japanese borrowed words: Production? or is it Perception? Orthography?
Dupoux et al. 1999 Exp 1 Use non-ambiguous stimuli, manipulate native language Hypothesis is that vowel epenthesis is perceptual phenomenon When presented with items like Ebzo, native French speakers would be ok, but Japanese speakers should report hearing a [u] sound.
Dupoux Dupoux et al. et 1999 al. 1999 Exp 1 Japanese speaker recorded pseudo words of the structure VCuCV middle [u] was spliced out to different degrees (from virtually erased to just a little) Subjects had to hear stimuli and say whether or not they heard [u]
Dupoux Dupoux et al. et 1999 al. 1999 Exp 1
Dupoux Dupoux et al. et 1999 al. 1999 Exp 1
Dupoux Dupoux et al. et 1999 al. 1999 Exp 1 Japanese participant reported many more [u]s when there was little or no [u] information in the signal, unlike French speakers BUT % Japanese speaker % Coarticulation cue in the preceding consonant?
Dupoux Dupoux et al. et 1999 al. 1999 Exp 1 Japanese speaker: Coarticulation cue in the preceding consonant? [u] is often reduced or devoiced in Japanese Japanese might be extra sensitive to subtle coarticulation cues indicating [u]
Dupoux Dupoux et al. et 1999 al. 1999 Exp 2 Japanese speaker: Coarticulation cue in the preceding consonant? -- Get a French speaker! Japanese might be extra sensitive to subtle coarticulation cues indicating [u] -- Make French speaker articulate true VCCVs as well as VCiCV The rest is the same as in Exp 1
Dupoux Dupoux et al. et 1999 al. 1999 Exp 2
Dupoux Dupoux et al. et 1999 al. 1999 Exp 2
Dupoux Dupoux et al. et 1999 al. 1999 Exp 2
Dupoux Dupoux et al. et 1999 al. 1999 Exp 2 Even with no coarticulation cue, Japanese speakers were reporting hearing [u] in VCCV nonwords.
How Early? (ERPs)
How Early? (ERPs)
Dehaene-Lambertz, et al. (2000) 164 ms
Dehaene-Lambertz, et al. (2000) 315 ms
Dehaene-Lambertz, et al. (2000) 531 ms
Dehaene-Lambertz, et al. (2000)
Dehaene-Lambertz, et al. (2000)
Dehaene-Lambertz, et al. (2000)
A quick word on cortical connectivity in speech perception...
Geschwind Model
Geschwind Model
Hickok & Poeppel (2007)
To wrap up...
To wrap up... 1. Speech perception involves a complex mapping between acoustic input and long term memory.
To wrap up... 1. Speech perception involves a complex mapping between acoustic input and long term memory. 2. Can use cognitive neuroscience methods to ascertain representational nature of speech segments.
To wrap up... 1. Speech perception involves a complex mapping between acoustic input and long term memory. 2. Can use cognitive neuroscience methods to ascertain representational nature of speech segments. 3. Understand how brain encodes speech representations.
To wrap up... 1. Speech perception involves a complex mapping between acoustic input and long term memory. 2. Can use cognitive neuroscience methods to ascertain representational nature of speech segments. 3. Understand how brain encodes speech representations. 4. Auditory cortex seems to store speech segments in phonemic form (at least in addition to phonetic representations).