How does the brain acquire phonetic (and phonological) knowledge and where is it stored? Bernd J. Kröger. Thank you for the invitation!

How does the brain acquire phonetic (and phonological) knowledge and where is it stored? Bernd J. Kröger Neurophonetics Group Department of Phoniatrics, Pedaudiology, and Communication Disorders RWTH Aachen University, Germany and School of Computer Science and Technology Tianjin University, China Thank you for the invitation!

Preliminary Note This talk is mainly based on computer simulation experiments Using a neurocomputational model of speech production, perception, and acquisition (Kröger et al. 2009) Three working modes: Speech Acquisition: babbling and imitation (Kröger et al. 2012) phonetics Speech Production Speech Perception Hypotheses concerning brain regions physics, computer science cognitive sciences, neuroscience

Outline The Structure of the Model Speech Acquisition: How to feed in Knowledge? Related brain regions Further Work

Assumptions for Structure of the Model Four neural maps (layers): 4 diff. assemblies of model neurons ; Three state maps as parts of working memory (distributed motor and sensory representations) SOM as part of long term memory SOM learns: sensori-motor associations Training: leads to synaptic weight adj. random pattern generator (babbling training set) motor plan map long-term memory SOM (Kohonen) working memory t = 250 msec; auditory map execution: t = 12.5 msec somatosensory map sensory processing: t = 12.5 msec; then: temporal storage lower level productionperception loop vocal tract model Birkholz et al. (2007)

Structure of the Model vocal tract model Birkholz et al. (2007) SOM states are local: one model neuron represents a syllabic state; Case production: we need synaptic connections with same link weights back to state maps long-term memory SOM (Kohonen) working memory t = 250 msec; motor plan map auditory map somatosensory map execution: t = 12.5 msec sensory processing: t = 12.5 msec; then: temporal storage

face-to-face communication, trianngulation Structure of the Model One further extension of the model is needed: Connection between sensorimotor and cognitive modules -> four state maps Babbling: exploring my own vocal tract (learning sensorimotor-relations); three state maps (as introduced earlier) Imitation: acoustic data by external speaker + linguistic information (communication) motor plan map long-term memory SOM (Kohonen) phonemic map working memory t = 200 msec auditory map somatosensory map execution sensory processing vocal tract model external speaker

Structure of the Model After training: synaptic link weights represent the different states for each SOM neuron production: activate a SOM neuron (from top), co-activation of motor plan and auditory states perception: calculate a winner neuron (from bottom); coactivation of phonemic state knowledge is stored in the neural links

Model Neurons Model neurons: neural activation is quantified by mean activation rates within a specific time period (here: 250 msec; duration of a syllable) ; Activation rate models are simple but capable of modeling important aspects of working and long-term memory (Oberauer, 2009: memory capacity) In addition: a model neuron summarizes the activity of an assembly of real neurons (near in space, e.g. a cortical column? ); Thus: our model neurons average over space and time And: Cortical model neurons are ordered in 2D-maps map 1 map 2 map 2 (SOM) map 1 (Spitzer 2000, after Mumford 1992)

Learning Firstly, a winner neuron is identified for each training item; Hebbian learning: within a neighborhood kernel (center = winner neuron) synaptic weights w ij between SOM and state maps are updated: w ij (t+1) - w ij (t) = N j (t)*l(t)*(s i -w ij (t)) with N: neighborhood kernel around best matching unit BMU for a specific training stimulus S=(s 1, s 2,, s n ) constantly decreasing with time during learning L: learning factor constantly decreasing with time during learning i = 1,, N over all state maps (input); j = 1,, M for SOM synaptic weights w ij approach (generalized) stimulus activation pattern s i -> unsupervised learning SOM s s s Input across all state maps!

Outline Introduction: Speech is Movement! The Structure of the Model Speech Acquisition: How to feed in Knowledge? Related brain regions Further Work

Speech Acquisition Six simulation experiments for speech acquisition: Later: Three simulation experiments -> testing performance (speech production, speech perception)

List of simulation experiments: Speech Acquisition 1. Protovocalic babbling: 1076 training items; 15x15 SOM 2. Protoconsonantal babbling: 279 training items; 15x15 SOM 3. Vocalic imitation (model language) (5 vowels [i,e,a,o,u]): 500 training items; 15x15 SOM 4. Consonantal imitation (model language) (15 CV syllables [b,d,g]): 465 training items; 15x15 SOM 5. Imitation of a symmetrical model language (60 syllables: V, CV, CCV): 600 training items; 25x25 SOM [b,d,g, p,t,k, m,n, l], [bl,gl,pl,kl] (no generalization) 6. Imitation of natural language (200 most frequent syllables of Standard German): 703 training items; 25x25 SOM (no generalization) prelinguistic linguistic: artificial language linguistic: natural language 500 to 150 training cycles per experiment (babbling to imitation) (one cycle = random application of all training items) The main result: (1) association of sensory and motor states; (b) ordering of states (syllables) with respect to phonetic features (c) emergence of phoneme regions at SOM level

Experiment 1 and 3: Vowel Babbling and Imitation: Training Items [i] Red points: 1076 babbling items [a] [u]

Experiment 1 and 3: Vowel Babbling and Imitation: Training Items /i/ /e/ Red points: 1076 babbling items Imitation items: Blue squares, green diamonds : 100 realizations of each phoneme; Variability of phoneme realizations is adapted from natural data (overlap) /o/ /a/ /u/

Training Results: Phonetic Map for Vowels low high The phonetic map now associates motor plan, sensory, and phonemic states: back front /u/ /o/ /i/ /e/ /a/ After babbling: 1) An ordering occurs with respect to the vocalic dimensions back-front, low-high 2 ) an association of sensory and motor states occurs (grey bars, red lines) After imitation: Now in addition: Neuron (box) is outlined, if phonemic link weight value for a phoneme is > 0.8 (80%) That means: After imitation: in addition 3 ) an association with phonemic states occurs; phoneme regions occur (variation: exemplars)

Training Results: Phonetic Map of CV-items 15x15 phonetic map: each box represents one neuron api lab Grey bars and red lines represent neural link weights to state maps: Auditory link weights: formant transitions Motor plan link weights: 5 bars (grey) - first three: closure: lab/api/dors - last two: proto-vow.: back-front, lowhigh association of motor plan and sensory states occurs for each neuron an ordering occurs with respect to 1 ) place of articulation lab/api/dor dor

Training Results: Phonetic Map of CV-items 15x15 phonetic map: each box represents one neuron front back back low api back lab Grey bars and red lines represent neural link weights to state maps: Auditory link weights: formant transitions Motor plan link weights: 5 bars (grey) - first three: closure: lab/api/dors - last two: proto-vow.: back-front, lowhigh association of motor plan and sensory states occurs for each neuron low dor front an ordering occurs with respect to 1 ) place of articulation lab/api/dor 2 ) proto-vocalic dimensions: low-high/ front-back (for each consonantal place) low front

i e lab V Phonetic Map of Model Language CV CCV nas a u o dor lat api plos voiced voiceless Model Language: V = /i, e, a, o, u/ C = plosives /p, t, k, b, d, g/, C = nasals /m, n/, and C = lateral /l/ CCV: first C = plosives /b, g, p, k/; second C = lateral 60 syllables with 10 realizations per syllable 600 stimuli; exposed to the network 10 times each Result: strong phonetic ordering: 1 ) V-, CV-, and CCV-regions are separated ; 2 ) place and manner of articulation; 3 ) vowels, 4 ) voice Strong phonetic ordering results, because the model language is completely symmetric, i.e. same frequencies for all syllables / phoneme-combinations

Experiment 6: Training a Natural Language Children s book data base: 40 books (til 6 years): Standard German (transcription) 6513 sentences; 70512 words 8217 different words; 4763 different syllables 200 most frequent syllables realized by one speaker (sentences) 27 to one times 703 realizations (prop. to frequency) articulatory resynthesis 703 motor plan states and appropriate sensory states; 300 exposures to the network per training item 210900 training steps; Rank of Syllable Frequency in Corpus Number of Training- Items 1 2367 27 20 692 8 50 390 4 100 193 2 200 88 1

Hypermodal Phonetic Map phonetic map 25x25 SOM after training: zoom in Neurons are marked, if they represent a phonemic state: (exhibitory synaptic weight > 80% )

Hypermodal Phonetic Map @-cluster C1: place of articulation CVC-region More than one SOM neuron may represent a syllable (different realizations) C1: manner: plosive A weak ordering of syllables with respect to phon. features CV-region C1: manner: fricative C1: manner: nasal CCV-region phonetic features occur at the level of this SOM

Hypermodal Phonetic Map @-cluster C1: place of articulation stronger [e]-f2 CVC-region Display of link weights to auditory state map C1: manner: plosive less phonation stores how a syllable sounds (audit. memory) CV-region C1: manner: nasal C1: manner: fricative CCV-region

Hypermodal Phonetic Map @-cluster C1: place of articulation longer [e]-activation CVC-region less phonation C1: manner: plosive Display of link weights to motor plan state map (for the same SOM neurons) (motor plan repository) CV-region C1: manner: nasal C1: manner: fricative CCV-region

Hypermodal Phonetic Map phonetic map 25x25 SOM after training: A 2 nd training: learning parameters (learning rate and neighborhood kernel factor ) are slightly changed in orde to get less gaps in the map Link weights to phonemic map

number of neurons number of neurons Training Results: Exemplar Representation Number of SOM neurons representing a syllable is proportional to number of training items for that syllable (syllable frequency in target language): Neural plasticity: more stored exemplars for frequent syllables (require more space in the brain) number of syllables: number of training items number of training items Kannampuzha (2012)

number of syllables Learning Curve Number of syllables already learned by SOM as function of training cycles: should be a less abrupt increase need Growing SOMs number of cycles Kannampuzha (2012)

Training Results: Performance Training is stopped if production and perception is learned (i.e. each syllable is represented in SOM by phonemic link weight > 0.8) Production (states represented by SOM neurons): Identification rate of 96% for 50 most frequent syllables (done by one subject) Perception (done by the model itself): Here, test items, different from training items (same speaker) are identified: 92% identification rate for 50 most frequent syllables: identification rate drops for less frequent syllables (time normalization needs to be included) Results from optimal training data sets Perception: Replication of important behavioral phenomena: categorical perception is stronger for CV than for V (Kröger et al. 2009, Speech Communication 51: 793-809; needs: training of 20 different instances of the model; 20 different virtual listeners )

Categorical Perception nonlinear relation between acoustic and perceptual domain: regions with perceptual constancy -> preferred as phoneme regions Phoneme regions are identified by identification experiments Phoneme boundaries -> have better discrimination of equidistant stimuli (peaks) -> discrimination experiments (e.g. ABX-experiments: is A=X or B=X?) ga da ba

Categorical Perception Basis for experiments : an acoustically equidistant stimulus set / continuum (for V and CV) a pool of around 20 listeners for performing the experiments (we trained 20 instances of the model!) Modeling: Identification: a SOM winner neuron activates a phonemic state Discrimination is assumed to increase with increase in physical distance of activated states within the SOM

Categorical Perception Two stimulus continua for V: from /i/ to /a/ and for CV from /ba/ to /ga/ Typical results: stronger categorical perception for CV than for V (see phoneme boundaries from measured discrimination!) V = /i e a/ CV = /ba da ga/ discrimi -nation identification calculated discrimi -nation 13 V- and CV-Stimuli [i] [e] [a] [ba] [da] [ga] interpolation interpolation interpolation interpolation

Categorical Perception Behavioral data (Pompino-Marschall 1995): adapted from Stevens et al. (1969) Modeling (Kröger et al. 2009):

The V-stimuli are continuously distributed within the V-SOM space The CV-stimuli are more clusterd within the CV-SOM space (display of one of 20 brains ) May result from topological ordering of phonetic features: 3 feature dimensions in 2 anatomical brain dimensions for CV is difficult one big cluster supramodal phonetic map including phoneme regions Why? Answer from modeling: V = /i e a o u/ three small clusters CV = /ba da ga/ 13 V- and CV-stimuli [i] [e] [a] [ba] [da] [ga]

Outline Introduction: Speech is Movement! The Structure of the Model Speech Acquisition: How to feed in Knowledge? Related brain regions (not published thus far!) Further Work

face-to-face communication, trianngulation Related brain regions: Where are the maps located? A hypotheses! 4 state maps (working memory) One SOM (long term memory) motor plan map long-term memory SOM (Kohonen) phonemic map working memory t = 200 msec auditory map somatosensory map execution sensory processing vocal tract model external speaker

neural pathway; just copying / mirror activation patterns one to one long distance (arcuatis) Hypothetical Cortical Regions associated with specific neural maps in our model primary maps (in PAs) frontal state maps (in UAAs and HAAs) following Guenther (2006): error maps phonetic map (SOM): need to be close to all state maps (complex mappings); only solution: two Hubs: somatosensory parietal neural mappings each to each motor plan auditory phonemic occipital temporal LA: limbic area PA: primary area UA: unimodal association HA: heteromodal assoc. Prosiegel & Paulig (2002)

mirroring pathway (long distance) Structure of the Model Goal: Verification of the model structure by imaging experiments heteromodal cortical areas phonemic map unimodal association areas motor plan map frontal phonetic map temporal auditory map unimodal association areas primary cortical somatosensory map parietal peripheral vocal tract model external speaker

Outline Introduction: Speech is Movement! The Structure of the Model Speech Acquisition: How to feed in Knowledge? Related brain regions Further Work

Further Work More realistic (non-ideal) settings for getting training data: including imperfect imitation ; including different speakers for imitation (how does speaker normalization take place in the model?) Growing SOM approach! (acquisition -> maps grow with input) Underpinning the model by more behavioral and brain imaging data (e.g. imaging studies: Eckers, Heim, Kröger)

Acknowledgements Jim Kannampuzha (Dipl.- Inf.) programming Dept. Phoniatrics, Pedaudiology, and Communication Disorders, RWTH Aachen University; Now: Head Acoustics GmbH, Aachen Cornelia Eckers (M.Sc.) fmri experiments Dept. Phoniatrics, Pedaudiology, and Communication Disorders, RWTH Aachen University;

Please add a realistic brain model! Thank you! Literature: www.speechtrainer.eu