Speech Perception NACS April PDF Free Download

Speech Perception NACS 642 01 April 2009

power/amplitude

frequency

+ =

Tonotopic Organization

Speech...

Source-Filter Model

Frequency Time

Stop Consonants: [p b t d k g]

Fricatives: [!! f v s z " #]

The Problem of Speech Perception

Hypothesized Representational Format

How do we get from here to there

The simplest theory Hypothesis: There is a one-to-one relationship between pieces of acoustic information and the segmental information stored in our head

The simplest theory

Different Acoustic Input: Same percept! 30

Front Back High Low

Front Back High Good! Low

Peterson & Barney (1952)

Obscured by phonetic context and speaker differences...

Simple One-to-One Mapping between acoustic cue and phoneme doesn t seem to exist...

From vibrations in the ear to abstractions in the brain

From vibrations in the ear to abstractions in the brain sounds words

From vibrations in the ear to abstractions in the brain sounds words Continuously varying waveform with information on multiple time- and frequency scales must be encoded

From vibrations in the ear to abstractions in the brain sounds words Continuously varying waveform with information on multiple time- and frequency scales must be encoded and decoded to make contact with the long-term linguistic representations in memory WORD

From vibrations in the ear to abstractions in the brain sounds words Continuously varying waveform with information on multiple time- and frequency scales must be encoded word and decoded to make contact with the long-term linguistic representations in memory WORD WORD

sincetherearenowordboundarysignsinspokenlanguagethedifficultywefeelinreading andunderstandingtheaboveparagraphprovidesasimpleillustrationofoneofthemaind ifficultieswehavetoovercomeinordertounderstandspeechratherthananeatlyseparat edsequenceofletterstringscorrespondingtothephonologicalformofwordsthespeech signalisacontinuousstreamofsoundsthatrepresentthephonologicalformsofwordsin additionthesoundsofneighboringwordsoftenoverlapwhichmakestheproblemofident ifyingwordboundariesevenharder

Why speech perception should not work

Why speech perception should not work linearity no straightforward mapping between stretches of sound and phonemes

Why speech perception should not work linearity invariance no straightforward mapping between stretches of sound and phonemes no (obvious) invariant features identify a given phoneme in all contexts

Why speech perception should not work linearity invariance perceptual constancy no straightforward mapping between stretches of sound and phonemes no (obvious) invariant features identify a given phoneme in all contexts we reliably identify speech despite tremendous variation across speakers (pitch, rate, accent, affect )

Varies across: speakers, phonetic context, rate, etc. Stable across: speakers, phonetic context, rate, etc.

Varies across: speakers, phonetic context, rate, etc. What set of perceptual/ neural mechanisms mediate the mapping between acoustic input and long term memory representations? Stable across: speakers, phonetic context, rate, etc.

The Problem of Speech Perception [+ voiced] [+ continuant]

The Problem of Speech Perception [+ voiced] [+ continuant] What s involved in this mapping?

The Problem of Speech Perception 0.4258 0 0.7621 0-0.6509 0 2.56916 Time (s) 0.4674-0.8202 0 3.20771 Time (s) 0-0.6457 0 5.56735 Time (s)

Questions Cognitive Neuroscience can help answer: 1. What is the nature of stored mental representations? 2. What types of mechanisms are involved in mapping from acoustics to memory? 3. What brain areas are implicated in the perception of speech?

Levels of Representation Acoustics: Variation in air pressure; Analog input to auditory system Phonetics: Language-specific categorization of different acoustic tokens; phonetic tokens Discriminability of different acoustic tokens relatively preserved Phonology: Abstract symbolic representations; Fine-grained distinctions irrelevant; All or nothing category membership phonemes English [p h at] pot [spat] spot Hindi [p h $l] fruit [p$l] moment English /p/ Hindi /p h / /p/

Phonetic Categories Map acoustic tokens into a multidimensional space There still may be speech specific processing t t t t t t t t t t t t t t But... representations are not discrete, abstract, etc. Store fine phonetic detail d d d d d d d d d Dennis Klatt, Stephen Goldinger, Peter Jusczyk, Jessica Maye, Keith Johnson

Voice Onset Time

Voice Onset Time The dot The tot

Voice Onset Time The dot The tot Short VOT Long VOT

Voice Onset Time /da/ /ta/ 20 Nb of tokens produced 15 10 [d] [t] 5 0 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80 80-90 90-100 100-110 110-120 VOT (in ms)

Voice Onset Time

Categorical Perception [da] VOT: 20ms [ta] VOT: 80ms same different Discrimination Task

Categorical Perception [da] VOT: 20ms /t/ /d/ Identification Task

Voice Onset Time Identification RT of Identification from Phillips et al (2000)

MMN = Mismatch Negativity ERP (event related potential) that reflects sensory discrimination Elicited by repeated presentation of a sound stimulus (standard) which is sometimes changed into a different sound (deviant): X X X X X X Y X X X X X Y X X X X Y

from Näätänen (1999)

= Standard - Deviant NOTICE: negative voltage up positive voltage down from Näätänen (1999)

N1 or N100 = Standard - Deviant NOTICE: negative voltage up positive voltage down from Näätänen (1999)

Obligatory ERP Reflects sensory encoding of auditory stimulus attributes

Discriminability (Methods) % Behavioral level: Categorical Perception % Electrophysiological level: MMN

Discriminability of phones by VOT % Behavioral level: Categorical Perception & % Electrophysiological level: MMN?

Looking at VOT: [dæ] vs [tæ] Behavioral data EEG: N1 (sensory encoding) EEG: MMN (sensory discrimination)

Sharma & Dorman 1999 Behavioral Experiment:

Sharma & Dorman 1999

Sharma & Dorman 1999 Discrimination: AX task

Sharma & Dorman 1999 Discrimination: AX task Performing at chance level

Sharma & Dorman 1999 MMN Experiment 30-50ms 60-80ms

Sharma & Dorman 1999 MMN Experiment

Level of representation Acoustics: % Variation in air pressure; % Analog input to auditory system Phonetics: % Language-specific categorization of different acoustic tokens; % Discriminability of different acoustic tokens relatively preserved Phonology: % Abstract symbolic representations; % Fine-grained distinctions irrelevant; % All or nothing category membership

Questions What kinds of representation is the MMN sensitive to? % Acoustic? % Phonetic? % Phonemic? How can we be sure it s not just acoustics?

Potential problem How can we be sure it s not just acoustics? There seems to be a difference between the 30-50 and the 60-80 MMN response; BUT, what if this difference has nothing to do with the phonetic category people perceive? Could it be that there is something special about the 30-50ms gap, for instance?

Potential problem How can we be sure it s not just acoustics? Could it be that there is something special about the 30-50ms gap, for instance?

Perception of VOT Identification RT of Identification from Phillips et al (2000)

Potential problem How can we be sure it s not just acoustics? Could it be that there is something special about the 30-50ms gap, for instance? If Chinchillas can show the same Categorical Perception behavior for the VOT continuum, this response is probably not based on phonetics

Potential problem Neuroscience evidence: VOT < 30ms and > 60ms have different neuronal population encoding in mammalian auditory system than VOT in the the 30ms-60ms range

Potential problem Could it be that there is something special about the 30-50ms gap, for instance? There is, apparently. How can we be sure it s not just acoustics? With these results alone, we can t.

Suggestions? Can we come up with ways to test whether or not we can test the MMN response to see if it is sensitive to the phonetic and phonological level of representations? Requirement: Many-to-one ratio XXXXY

Look at sounds that are phonemically in one language, but not in the other. % Näätänen et al (1997)

Na!a!ta!nen et al (1999) Looking for language-dependent memory traces for sounds Vowels Finnish Estonian

Vowels varying only in F2

Vowels varying only in F2 Estonian extra vowel F2 values

Vowels varying only in F2

MMN = Standard - Deviant NOTICE: negative voltage up positive voltage down from Näätänen (1999)

Pure Tones with freq = F2

F2 Pure Tones vs Vowels

F2 Pure Tones vs Vowels Nonmonotonic increase; Drop Linear increase; No drop

Vowels: Finns vs Estonians

Vowels: Finns vs Estonians Drop No Drop

Finns vs Estonians MMN peak amplitude at Fz Finns (! blue) Estonians (' purple)

Finns vs Estonians MMN peak amplitude at Fz Finns (! blue) Estonians (' purple) Drop

MEG data - Dipole Model

MEG data - Dipole Model Drop

Conclusions Tone vs Vowel data is dissimilar for Finnish speakers, even though what s being varied in the two conditions is the exact same acoustic quantity % Finnish judge a vowel with an F2 of 1,311Hz to be a very bad instance of /ö/ % Estonians have a vowel /õ/, and judge a vowel with an F2 of 1,311Hz as a good instance of that /õ/ vowel

Conclusions % Finnish judge a vowel with an F2 of 1,311Hz to be a very bad instance of /ö/ % Estonians have a vowel /õ/, and judge a vowel with an F2 of 1,311Hz as a good instance of that /õ/ vowel % Estonian vowel MMN data is more in line with Finnish Tone data

Any problems? Are you convinced? Does this show that the MMN is indeed sensitive to phonemic categories? Could these results be explained on a purely acoustic basis? Could these results be explained on a purely phonetic basis?

Phillips et al (2000) Question: Is the MMN sensitive to Phonological Categories? % Abstract symbolic representations; % Fine-grained distinctions irrelevant; % All or nothing category membership

Phillips et al (2000) Template of MMN design: X X X X X X Y X X X X X Y X X X X Y Sharma & Dorman (1999) - VOT values: 30 30 30 30 30 50 30 30 30 30 30 50 30 30 60 60 60 60 60 80 60 60 60 60 60 80 60 60

Phillips et al (2000) Template of MMN design: X X X X X X Y X X X X X Y X X X X Y Sharma & Dorman (1999) - VOT values: 30 30 30 30 30 50 30 30 30 30 30 50 30 30 60 60 60 60 60 80 60 60 60 60 60 80 60 60 Many-to-one ratio at all levels

Phillips et al (2000) Template of MMN design: X X X X X X Y X X X X X Y X X X X Y Sharma & Dorman (1999) - VOT values: 30 30 30 30 30 50 30 30 30 30 30 50 30 30 60 60 60 60 60 80 60 60 60 60 60 80 60 60 Phillips et al. (2000) - VOT values: 8 16 0 24 16 48 0 24 16 0 24 8 64 16 8 56 0

Perception of VOT Identification RT of Identification from Phillips et al (2000)

Many-to-one only at P level Phillips et al. (2000) - VOT values: A: 8 16 0 24 16 48 0 24 16 0 24 8 64 16 8 56 0 P: D D D D D T D D D D D D T D D T D

Results Exp1

What if not PhonCat, but... What if the results are not due to Phonological categories, but to something prosaic as the VOT difference between adjacent sounds? From standard to standard, the VOT difference could span 0 to 24ms (mean 12) From standard to deviant, the VOT difference could go from 14 to 72ms (mean 40) How can we address this?

Exp. 2 - Acoustics Add 20ms VOT in all sounds, such that the relative distance between them remains the same, but the proportion of sounds falling on each side of the boundary change:

Exp. 2 - Acoustics Add 20ms VOT in all sounds, such that the relative distance between them remains the same, but the proportion of sounds falling on each side of the boundary change: No longer many-to-one relations at P level

What if not PhonCat, but...

No MMN for acoustic condition

Phillips et al (2000) Conclusion: MMN here is driven by phonological category membership, not acoustics.

Question Are you convinced? Can we be sure this result does not stem from acoustics? What about phonetic categories?

No Abstract Categories You simply map acoustic tokens into a multidimensional space There still may be speech specific processing t t t t t t t t t t t t t t But... representations are not discrete, abstract, etc. Store fine phonetic detail d d d d d d d d d Dennis Klatt, Stephen Goldinger, Peter Jusczyk, Jessica Maye, Keith Johnson

VOT Distribution DISTRIBUTION OF VOT 30 25 20 Frequency 15 10 5 0 5 15 25 35 45 55 65 75 85 95 105 115 125 135 145 More Voice Onset Time (in ms)

Do We Even Have Categories? Perhaps we should not even be asking if infants have well-formed phonetic categories, separated by boundaries, but rather if any language users do. In other words, the very concept of categories, and even more so of boundaries, needs to be reconsidered...we have no evidence that boundaries exist in the natural world, or any account of how or why they may have evolved by natural selection. To extend to them any degree of psychological reality is unsupportable, and deleterious to efforts to understand how phonetic structure is indeed instantiated and retrieved from the speech signal. Nittrouer (2001)

A Phonetic Explanation for Phillips, et al. 2000 MMN could be induced by sampling from the statistical distribution of phonetic categories No need to rely on abstract phonological categories if this is how we conceive of the phonetic space t t t t t t t t t t t t t t Sampling from/mapping into different distribution could elicit MMN d d d d d d d d d

A Phonetic Explanation for Phillips, et al. 2000 MMN could be induced by sampling from the statistical distribution of phonetic categories No need to rely on abstract phonological categories if this is how we conceive of the phonetic space t t t t t t t t t t t t t t t Sampling from/mapping into different distribution could elicit MMN d d d d d d d d d

A Phonetic Explanation for Phillips, et al. 2000 MMN could be induced by sampling from the statistical distribution of phonetic categories No need to rely on abstract phonological categories if this is how we conceive of the phonetic space Sampling from/mapping into different distribution could elicit MMN t t t t t t t t t t t t t t t MMN! d d d d d d d d d

Kazanina, et al. (2006)

Dupoux et al. 1999 Phonotactics seems to influence how people perceive phonetic sounds. % Look at these Japanese borrowed words:

Dupoux et al. 1999 Japanese has a restricted syllabic inventory when compared to languages such as English and French % V, CV, CVNasal, CVQ (Q = first half of a geminate consonant)

Dupoux et al. 1999 Look at these Japanese borrowed words: Production? or is it Perception? Orthography?

Dupoux et al. 1999 Exp 1 Use non-ambiguous stimuli, manipulate native language Hypothesis is that vowel epenthesis is perceptual phenomenon When presented with items like Ebzo, native French speakers would be ok, but Japanese speakers should report hearing a [u] sound.

Dupoux Dupoux et al. et 1999 al. 1999 Exp 1 Japanese speaker recorded pseudo words of the structure VCuCV middle [u] was spliced out to different degrees (from virtually erased to just a little) Subjects had to hear stimuli and say whether or not they heard [u]

Dupoux Dupoux et al. et 1999 al. 1999 Exp 1

Dupoux Dupoux et al. et 1999 al. 1999 Exp 1 Japanese participant reported many more [u]s when there was little or no [u] information in the signal, unlike French speakers BUT % Japanese speaker % Coarticulation cue in the preceding consonant?

Dupoux Dupoux et al. et 1999 al. 1999 Exp 1 Japanese speaker: Coarticulation cue in the preceding consonant? [u] is often reduced or devoiced in Japanese Japanese might be extra sensitive to subtle coarticulation cues indicating [u]

Dupoux Dupoux et al. et 1999 al. 1999 Exp 2 Japanese speaker: Coarticulation cue in the preceding consonant? -- Get a French speaker! Japanese might be extra sensitive to subtle coarticulation cues indicating [u] -- Make French speaker articulate true VCCVs as well as VCiCV The rest is the same as in Exp 1

Dupoux Dupoux et al. et 1999 al. 1999 Exp 2

Dupoux Dupoux et al. et 1999 al. 1999 Exp 2 Even with no coarticulation cue, Japanese speakers were reporting hearing [u] in VCCV nonwords.

How Early? (ERPs)

Dehaene-Lambertz, et al. (2000) 164 ms

Dehaene-Lambertz, et al. (2000) 315 ms

Dehaene-Lambertz, et al. (2000) 531 ms

Dehaene-Lambertz, et al. (2000)

A quick word on cortical connectivity in speech perception...

Geschwind Model

Hickok & Poeppel (2007)

To wrap up...

To wrap up... 1. Speech perception involves a complex mapping between acoustic input and long term memory.

To wrap up... 1. Speech perception involves a complex mapping between acoustic input and long term memory. 2. Can use cognitive neuroscience methods to ascertain representational nature of speech segments. 3. Understand how brain encodes speech representations. 4. Auditory cortex seems to store speech segments in phonemic form (at least in addition to phonetic representations).

Speech Perception NACS April 2009