Neural Processing of What and Who Information during Spoken Language Processing

Save this PDF as:

Size: px
Start display at page:

Download "Neural Processing of What and Who Information during Spoken Language Processing"


1 Neural Processing of What and Who Information during Spoken Language Processing Bharath Chandrasekaran 1,AliceH.D.Chan 2, and Patrick C. M. Wong 3,4 Abstract Human speech is composed of two types of information, related to content (lexical information, i.e., what is being said, e.g., words) and to the speaker (indexical information, i.e., who is talking, e.g., voices). The extent to which lexical versus indexical information is represented separately or integrally in the brain is unresolved. In the current experiment, we use short-term fmri adaptation to address this issue. Participants performed a loudness judgment task during which single or multiple sets of words/pseudowords were repeated with single (repeat) or multiple talkers (speaker-change) conditions while BOLD responses were collected. As reflected by INTRODUCTION How the brain represents objects such as faces and words has been a topic of intense research in cognitive neuroscience. According to abstractionist models of object representation, only features necessary for immediate object recognition are extracted in regions crucial for object recognition, leading to insensitivity to exemplars of the object (Pallier, Colome, & Sebastian-Galles, 2001; Bowers, 2000; Medin & Smith, 1984; Reed, 1972). In contrast, exemplar-based models posit that features are stored integrally as a unique memory trace, leading to fine-grained gradient sensitivity to exemplars of the object (Nosofsky & Zaki, 2002; Nosofsky & Johansen, 2000; Goldinger, 1996; Tenpenny, 1995; Palmeri, Goldinger, & Pisoni, 1993). For example, processing in the fusiform face area in humans, a region involved in face recognition, is argued to be broad, abstract, and selective to faces irrespective of viewpoint, location, size, and illumination (Grill-Spector & Sayres, 2008), suggesting abstract representation of faces. Recent studies have, however, demonstrated narrow tuning to face exemplars in the fusiform face area, suggesting integral processing of object features and, therefore, sensitivity to exemplars (Gilaie-Dotan & Malach, 2007). The fundamental issue of abstraction versus exemplarbased models is relevant for all forms of object and conceptual representation in the brain, not just face perception. 1 University of Texas at Austin, 2 Nanyang Technological University, Singapore, 3 Communication Neural Systems Group, 4 Northwestern University, Evanston, IL adaptation fmri, the left posterior middle temporal gyrus, a crucial component of the ventral auditory stream performing sound-to-meaning computations ( what pathway), showed sensitivity to lexical as well as indexical information. Previous studies have suggested that speaker information is abstracted during this stage of auditory word processing. Here, we demonstrate that indexical information is strongly coupled with word information. These findings are consistent with a plethora of behavioral results that have demonstrated that changes to speaker-related information can influence lexical processing. Words are objects we commonly encounter. In spoken language, information related to content ( what, e.g., words or lexical information) and information related to speaker ( who, e.g., gender, emotional status, accent collectively called indexical information) are conveyed simultaneously. To what extent what and who information is integrally represented in our nervous system has been a subject of research over the past decades (Formisano, De Martino, Bonte, & Goebel, 2008; Bradlow, Nygaard, & Pisoni, 1999; Nygaard & Pisoni, 1998; Goldinger, 1996; Nygaard, Sommers, & Pisoni, 1994; Pisoni, 1993; Mullennix, Pisoni, & Martin, 1989). Although behavioral and neuropsychological methods have been used to address this question, neurophysiological evidence remains sparse. Neuropsychological evidence suggests that pure word deafness can exist independent of deficits in speaker identification (Gazzaniga, Glass, Sarno, & Posner, 1973). Similarly, deficits in voice identification can exist without concomitant issues in word recognition (Van Lancker, Cummings, Kreiman, & Dobkin, 1988). Extant neural models of speech processing argue that indexical information may not be retained when sound interfaces with meaning (Poeppel, Idsardi, & van Wassenhove, 2008). For example, processing in the posterior middle temporal gyrus (pmtg), a region known to be involved in lexical processing (Lau, Phillips, & Poeppel, 2008), is argued to be invariant to speaker-related information. However, behavioral research on speech perception has consistently demonstrated that speaker-related information, in fact, influences processing of linguistic information (Goldinger, Massachusetts Institute of Technology Journal of Cognitive Neuroscience X:Y, pp. 1 11

2 1996; Palmeri et al., 1993) and vice versa (e.g., Perrachione & Wong, 2007), suggesting that lexical and indexical information are represented integrally in the brain. Thus, the extent to which lexical and indexical information are represented independently or integrally in the brain remains an unresolved question. A previous study found fmri adaptation (i.e., reduced BOLD activity when stimuli are repeated) exclusively for words in the left pmtg, a region known to be important for lexical-semantic processing (Gagnepain et al., 2008). In fmri adaptation paradigms, a stimulus is repeated multiple times as BOLD responses are recorded. Repetition of identical stimuli results in a reduction in BOLD activity, called repetition suppression, in certain regions of the brain. Repetition suppression is argued to reflect the nature of neural representation to the repeating stimulus (Grill-Spector, Henson, & Martin, 2006). Features of the stimulus are then varied to examine the extent to which the underlying neural representation is sensitive to the change in stimulus properties. If repetition suppression occurs despite a change in feature, we can conclude that the feature change is irrelevant to the neural representation. In contrast, if repetition suppression does not occur following a change in feature, we can conclude that the feature is relevant to the neural representation of the stimulus. The purpose of the current study was to use fmri adaptation paradigm to examine the neural representation of lexical information and understand how changes in indexical information affected neurophysiological responses to lexical information. A 2 (word/pseudoword) 3 (repeat, item-change, speaker-change) factorial design was used to address this issue. Participants listened to sets of four words or pseudowords produced in one of three possible conditions, as BOLD responses were measured. In one condition, the same word or pseudoword was repeated by a single talker (repeat). In the second condition, different items (words or pseudowords) were repeated by a single talker (item-change). Finally, in the third condition, the same word or pseudoword was produced by different talkers (speaker-change). By using words and pseudowords in the same design, we can examine whether repeating words, relative to pseudowords, would result in a reduction in BOLD activity in regions that are important for lexical processing (i.e., left pmtg). Consistent with a previous study, we expect repetition suppression (reduction in BOLD response for repeating trials relative to item-change trials) exclusively for words in regions that are crucial for word processing (left pmtg). If change in speaker information does not interfere with the neural representation of words (as per abstractionist models), we expect repetition suppression even when the speaker information changes (i.e., reduction in BOLD response for speaker-change trials relative to item-change trials). In contrast, if speaker information is strongly coupled with words (as per exemplar models), we expect that a change in speaker would not result in repetition suppression for speaker-change trials relative to item-change trials. Our findings demonstrate that indexical information is coupled with words in the region of the brain that is involved in word processing (left pmtg). These findings provide a basis for understanding the integration of what and who information that has been demonstrated in previous speech perception literature and argues for a need to revise current neural models of speech perception that posit a loss of surface details that encode a speaker during word processing. METHODS Participants Participants were 14 adult native speakers of American English (mean age = 22.5 years; standard deviation = 2.3 years, seven men). All participants were college graduates and reported no neurological or speech language deficits and passed a hearing screening test (<25 db HL hearing threshold at 500, 1000, 2000, and 4000 Hz). In addition, participants were right handed, as assessed by the Edinburgh Handedness Inventory (Oldfield, 1971). Participants provided informed consent in accordance with the Northwestern University Institutional Review Board. All experimental procedures were approved by the Northwestern University Institutional Review Board. Stimuli The stimulus set consisted of 180 monosyllabic English words and 180 monosyllabic pseudowords. Pseudowords were selected to closely match words in terms of syllables, phoneme length, bigram, and trigram frequency. The 360 stimuli were further subdivided into 12 groups of 30 stimuli each (six groups of words and six groups of pseudowords). The word groups were additionally matched on concreteness and imagability using the MRC psycholinguistic database (Coltheart, 1981). An ANOVA was conducted with group as a fixed factor and each variable as dependent measures to ensure no significant differences between groups for words or pseudowords (F < 1). The stimuli were recorded on digital audio tape in a soundproof booth by eight male and eight female native speakers of American English at a sampling rate of 44.1 khz and edited using Praat software (Boersma, 2001). An experienced phonetician checked the stimuli for any significant deviation in pronunciations across talkers. All stimuli were RMS normalized using Level 16 to 70, 65, or 75 db (Tice & Carrell, 1998). They were then duration normalized to 500 msec using Praat (Boersma, 2001). Design Participants performed a loudness judgment task in the scanner while listening to words or pseudowords presented in six different conditions using a sparse-sampling design (see Figure 1). In repeat conditions, the same token (word 2 Journal of Cognitive Neuroscience Volume X, Number Y

3 Figure 1. fmri adaptation using sparse-sampling design: four stimuli were presented within a 10-sec silent period. The first and the third sounds were always normalized at 70 db. Participants had to judge if the following sound (presented 5 db louder or softer) was louder or softer than the first. Scanning occurred during the 2-sec interval following stimuli presentation. or pseudoword), (produced either a by male or female talker) was repeated four times, within the TR of 12 sec (e.g., chair, chair, chair, chair ). In item-change conditions (word item-change/pseudoword item-change), four different words/pseudowords, produced by the same talker (either a male or a female talker) were presented within the TR of 12 sec (e.g., croog, slove, glick, keet ). Thus, in the item-change conditions, the who information was constant, whereas what information varied within the TR. In speaker-change conditions (word speaker-change/ pseudoword speaker-change), the same word/pseudoword was repeated by four different talkers (from a pool of 14 talkers excluding the two talkers used for the adapt conditions, e.g., plub-talker3, plub-talker4, plub-talker6, plub-talker8 ). In these conditions, what information was constant, whereas who information varied. Irrespective of condition, the task was to indicate if the second stimulus was louder or softer than the first stimulus using a response pad. Participants were asked to respond as quickly and as accurately as possible by depressing button A to indicate softer and button B to indicate louder. The first and third stimuli in a trial were always 70 db (normalized). The second and the fourth stimuli were either 5 db softer or louder. Louder and softer trials had equal probability of occurrence. Stimuli were presented binaurally via headphones that were custom made for MRI experiments. Participants indicated their loudness judgments by pressing one of the two buttons. The stimuli for the six conditions were derived from the word/pseudoword lists described in Stimuli. 3-D volume was acquired axially (MP-RAGE; TR/TE = 2300 msec/3.36 msec, flip angle = 9, TI = 900 msec, matrix size = , field of view = 22 cm, slice thickness = 1 mm). The T1-weighted images were used in conjunction with the functional activation maps to localize the anatomic regions involved. The T2*-weighted images were acquired axially using a susceptibility weighted EPI pulse sequence while the participants perform the behavioral loudness judgment task (TE = 20 msec, TR = 12 sec, flip angle = 90, in-plane resolution = mm mm, 38 slices with a slice thickness = 3 mm (without gap between slices) were acquired in an interleaved measurement). A sparse sampling method was used, which allowed stimuli presentation in silence (Wong, Perrachione, & Parrish, 2007; Backes & van Dijk, 2002; Belin, Zatorre, Lafaille, Ahad, & Pike, 2000; Hall et al., 1999). This design allowed participants to hear the stimuli without the interference of scanner noise. In addition, the long TR (12 sec) provided adequate time for the scannernoise-induced hemodynamic response to reduce such that the peak of this hemodynamic response does not overlap with the response to the speech stimuli. For each condition, there were 30 blocks, the order of which was randomized throughout the experiment. In addition to the six conditions, there were 30 blocks of silent trials wherein no stimuli were presented. The null trials were used to establish baseline and create a mask (all conditions > silence) that was used to establish corrected thresholds for multiple comparisons. These null trials were randomly distributed in the experiment. MRI Acquisition Stimulus and Recording MRIs were acquired using a Siemens 3T Trio MRI scanner. For each participant, a high-resolution, T1-weighted fmri Data Analysis The fmris (time series) were analyzed using AFNI (Cox, 1996), similar to previous researches (Margulis, Mlsna, Uppunda, Parrish, & Wong, 2009; Wong et al., 2009; Wong Chandrasekaran, Chan, and Wong 3

4 et al., 2007). First, the images were corrected for motion and slice-time. Two participants were excluded because of excessive head motion (>2.5 mm). Hereafter, all analysis procedures were performed on 12 participants. After motion and slice-time corrections, spatial smoothing (FWHM 6 mm) was performed, followed by linear detrending and resampling to a resolution of 3 mm 3. After these preprocessing steps, hemodynamic responses were estimated. Square waves modeling the events of interest were created as extrinsic model waveforms of the task-related hemodynamic response. Although the TR is 12-sec long, images were acquired only during the initial 2 secof the TR, as against the entire TR. Thus, the images reflect either a stimulus event occurring at one of these time points or a null event (no stimulus presented). This procedure removed the need to convolve the task-related extrinsic waveforms with a hemodynamic response function before statistical analyses. The waveforms of the modeled events will then be used as regressors in a multiple linear regression of the voxel-based time series. Normalized beta values signifying the fit of the regressors to the functional scanning series, voxel-by-voxel for each condition, were used for group analyses. Anatomical and functional images from each subject were normalized to a standard stereotaxic template (ICBM 152). Activation from each participant was entered into an omnibus 2 (Stimulus-type: word, pseudoword) 3 (Condition: repeat, speaker-change, item-change) ANOVA implemented using the AFNI function 3dANOVA3. Only activation clusters exceeding the corrected threshold of p <.05 were considered significant. Corrected threshold was determined based on a Monte Carlo simulation (using AFNI function AlphaSim) of significant voxels within a task versus baseline mask (Figure 4). The contrast all listening conditions minus silent conditions was used to create the task versus baseline mask. This mask was thresholded at a whole-brain α level of less than 0.05 determined by Monte Carlo simulation using AFNI function AlphaSim, and negative values (i.e., deactivations) were removed. The deactivation network was excluded to avoid artifacts generated from significant deactivation during the listening task, consistent with a previous study that examined repetition suppression of auditory word information (Gagnepain et al., 2008). The pattern of activity in regions that showed a significant Stimulus-type Condition interaction effect was explored further by extracting the mean percent signal change for the six experimental conditions from the entire significant cluster and conducting statistical analyses using standard statistical software. Tukey HSD tests were conducted on the planned comparisons to determine the nature of the interaction effect(s) (Figure 2). Behavioral Data Analyses Participants completed loudness judgments on a custom button box held in their right hand by pressing a button under their index finger to indicate that the second sound was softer or a button with their middle finger to indicate that the second sound was louder. Accuracy and RTs (measured from the onset of the word/pseudoword) were recorded. Similar to the fmri statistical analyses, a repeated measures omnibus ANOVA (Stimulus-type Condition) was conducted separately on accuracy and RT measures. RESULTS Behavioral Results Overall, participants performed at ceiling levels at the loudness judgment task. A 2 3 repeated measures ANOVA was conducted separately on the accuracy (% correct) and RT information collected from the scanner. An ANOVA conducted on the accuracy measure yielded a main effect of condition (F(2, 22) = , p <.001), butnomaineffectsofstimulus-type(f(1, 11) = 0.27, Figure 2. Experimental design. ANOVA model examining fmri adaption to what information and who information. Examples of the stimuli are shown in gray boxes. 4 Journal of Cognitive Neuroscience Volume X, Number Y

5 Figure 3. Behavioral results from the loudness judgment task performed in the scanner. Participants showed near-ceiling performance across conditions. No differences between words and pseudowords were found for either accuracy or RT measures. A significant main effect of the condition was present for both accuracy and RT measures. Participants were more accurate (left column) and faster (right column) during the repeat conditions for words and pseudowords. p =.65) or interaction (F(2, 22) = 0.732, p =.485). Post hoc Tukey tests revealed significant differences between repeat and speaker-change condition ( p <.001) as well as between repeat and change conditions ( p <.001). No differences were found between item-change and speaker-change conditions ( p =.768). ANOVA conducted on RT revealed similar results with a main effect of condition (F(2, 22) = 4.11, p =.021) and no significant effects of stimulus-type ((F(1, 11) = 0.01, p =.91) or interaction (F(2, 22) = 0.65, p =.54). Post hoc Tukey tests reveal significant difference between repeat and speaker-change condition ( p =.036) as well as between repeat and item-change conditions ( p =.046).Incontrast, no differences were found between the two change conditions ( p =.994). Mean accuracy and RTs for each condition, across words and pseudowords, are shown in Figure 3. Thus, participants were slower and less accurate during the item-change and the speaker-change trials relative to the repeat trials, suggesting that they were aware of changes in who and what information during each trial. Crucially, accuracy and RTs did not differ by stimulus-type (word/ pseudoword), demonstrating that the loudness judgment task was not biased toward a particular stimulus-type. Taken together, these results suggest that performance was not affected by lexicality (word vs. pseudoword). Interaction between Stimulus-type and Condition A single region, the left pmtg (Table 1), showed a significant interaction effect between stimulus-type (word vs. pseudoword) and condition (repeat, item-change, speaker-change). To examine the nature of the interaction, percent signal change from the entire cluster was calculated for each participant across stimulus-type and condition. A 2 3 repeated measures ANOVA conducted on the percent signal change measure revealed a significant interaction effect (F(2, 22) = 8.298, p =.002, Figure 4) butnomaineffectofstimulus-type(f(1, 11) = 0.55, p =.47) or condition (F(2, 22) = 2.88, p =.08). Percent signal change for each stimulus-type (word, pseudoword) and condition (repeat, item-change, speaker-change) are displayed in Figure 5. Post hoc HSD Tukey tests were conducted to determine the nature of the two-way interaction. For words, there was a significant difference between repeat and item-change conditions ( p =.002)and between repeat and speaker-change conditions ( p =.009) but none between change and speaker-change conditions ( p =.84). In contrast, there were no differences in conditions for pseudowords ( p values for all comparisons were greater than 0.9). Thus, for words, changing who information modulated activity in the left pmtg, a region known to be responsive to word information. In contrast, changing talker information did not significantly modulate BOLD activity for pseudowords in this region. Main Effect of Condition Several clusters revealed significant main effect of condition (Figure 6 and Table 1). For the change > repeat comparison, two large clusters were found. This included a large cluster in the left STG that extended from posterior to anterior STG. A second cluster was located in the right mid STS/STG. Thespeaker-changerelativetorepeatcomparison yielded three clusters. These included a cluster that included the left STG extending to the left pmtg. A second cluster was found in the right STG extending from mid to anterior STG. The third cluster was located in the left Imaging Results We report voxel-wise random effect analyses and statistical analyses on significant clusters to clarify the nature of significant interaction effects. In particular, the interaction term is the most relevant as the experiment was designed to examine the interaction between who and what information. As per our hypotheses, if there is integration of who and what information, changing who information should affect fmri adaptation for words, although what information remains constant. Figure 4. fmri activations ( p <.05, corrected for multiple comparisons) for the main effect of the task (all conditions > silence), rendered onto a template brain. Analyses were conducted within the positive activation network (negative voxels were excluded). Chandrasekaran, Chan, and Wong 5

6 Table 1. GLM Stimulus-type Condition ANOVA Activation Peak Talairach Coordinates x, y, z Cluster Size (mm 3 ) Peak F Value/ tvalue Cluster-specific Statistics/ Post hoc Tests (a) Interaction (Stimulus-type Condition) Left middle temporal gyrus 62, 41, Interaction term: F(2, 22) = 8.298, p =.002 Post hoc tests: Word: Item-change > Repeat, p =.002 Word: Speaker-change > Repeat, p =.009 Word: Item-change > Speaker-change ( p =.84,ns) Pseudoword: Item-change > Repeat ( p >.90,ns Pseudoword: Speaker-change > Repeat ( p >.90,ns) Pseudoword: Item-change > Speaker-change ( p >.9,ns) (b) Main Effect of Condition: Item-change > Repeat Left superior temporal gyrus 61, 4, Right mid superior temporal gyrus 50, 30, (c) Main Effect of Condition: Speaker-change > Repeat Left superior temporal gyrus/ middle temporal gyrus 65, 29, Right superior temporal gyrus 58, 1, Left inferior frontal gyrus 53, 16, (d) Main Effect of Stimulus-type: Pseudoword > Word Right middle temporal gyrus 65, 30, inferior frontal gyrus. The increased BOLD activity for the two conditions (item-change, speaker-change) relative to the repeat conditions revealed nonspecific (to stimulustype) response suppression in these regions (Table 1). Main Effect of Stimulus-type A single cluster showed a main effect of stimulus-type (pseudoword > word). This cluster was found in the right MTG. The results are presented in Table 1. DISCUSSION We examined the extent to change in indexical information modulated the neural representation of words, as evidenced by fmri adaptation. Abstractionist models argue that the neural representation of words is devoid of speaker-related details. On the other hand, exemplar models argue that indexical information form a part of the neural representation of words. We found repetition suppression, exclusively for words, in the left pmtg, a region known to be important for lexical processing. When speaker information was changed (during the speaker-change condition), no repetition suppression was found in this region, suggesting that speaker-related information forms an integral part of the neural representation of words. Although our results do not rule out abstraction at a later stage of processing, these results appear consistent with exemplar models of speech processing that argue for a role of speaker-related information in word representation. Integral Processing of Lexical and Indexical Information in the Left pmtg Previous studies have demonstrated a role for the left pmtg in lexical retrieval (Damasio, Grabowski, Tranel, Hichwa, & Damasio, 1996), and a meta-analysis of neuroimaging studies reveals that this region is important for 6 Journal of Cognitive Neuroscience Volume X, Number Y

7 Figure 5. fmri adaptation to what and who information: Contrast between item-change and repeat condition (when what information is changed and who information is constant) revealed two regions that showed greater activity for item change conditions relative to repeat conditions. These regions include left STG (top row) and right mid STG (top row). When the speaker-change conditions were contrasted with repeat conditions ( who information changes, what information is constant), three significant clusters were found that showed greater activation for speaker-change relative to repeat conditions. These included left STG/MTG, right STG (extending to the anterior portion of STG), and left IFG. lexical processing (Lau et al., 2008). In current neural models of speech processing, speech sounds are transformed into a spectrotemporal representation at early stages of auditory processing and then mapped onto an intermediate abstract representation devoid of surface details related to the speaker. For words, this abstract representation is argued to interface with long-term stored representations located in the posterior left MTG (and to a lesser extent, the right MTG) along the ventral auditory stream (Poeppel et al., 2008). In the current study, we find that the left pmtg showed a significant reduction in BOLD response to repeating words (repetition suppression), relative to change in word information (item-change condition). Crucially, for pseudowords, no differential activation between conditions was found. These data reveal word-specific tuning in the left pmtg, consistent with a previous study that found repetition suppression for words in this region (Gagnepain et al., 2008). Taken together, this provides strong evidence for the growing viewpoint that the left pmtg is an integral part of the ventral stream that performs sound-to-meaning computations (Poeppel et al., 2008; Hickok & Poeppel, 2007). Interestingly, our study shows that the left pmtg is also activated when speaker information is changed while words are repeated, suggesting that tuning in this region is sensitive to indexical information as well. These results parallel a recent fmri adaptation study in the visual domain that found greater sensitivity for changes in a single letter for words relative to pseudowords in the left visual word form area (Glezer, Jiang, & Riesenhuber, 2009). Our results suggest a tight coupling for words and speaker information in the left pmtg and that both information need to be constant for neural adaptation in this region (see Figure 5). Repetition Effects and Spoken Language Processing A number of studies, especially in the visual domain, have explored the nature and plasticity of cortical representation Figure 6. Interaction between stimulus-type and condition in the left pmtg. A cluster in the left pmtg revealed a significant interaction effect between stimulus-type (word, pseudoword) and condition (item-change, speaker-change, repeat). Right column shows percent signal change (means and standard errors) for the three conditions across the two stimulus-types (words/ pseudowords). Arrows show the significant cluster plotted in the bar chart. Statistical analyses of the significant cluster demonstrated differences between item-change and repeat conditions as well as between speaker-change and repeat conditions, exclusively for words. For pseudowords, there were no differences between the three conditions. These data demonstrate that change in who and what information impacts neural processing in the left pmtg exclusively for words. Chandrasekaran, Chan, and Wong 7

8 using repetition priming (Grill-Spector, 2006). Whereas most studies have examined repetition effects using visual stimuli, recent studies have demonstrated consistent suppression/adaptation for auditory stimuli (Kouider, de Gardelle, Dehaene, Dupoux, & Pallier, 2009; Gagnepain et al., 2008; Dehaene-Lambertz et al., 2006; Orfanidou, Marslen-Wilson, & Davis, 2006; Belin & Zatorre, 2003). Examining the effect of stimulus repetition on the BOLD response to single versus multiple talkers, Belin and Zatorre (2003) found a single region in the anterior right STG that showed repetition suppression. This region corresponded with an area previously reported by Belin and colleagues that is selectively involved in processing human voices (Belin et al., 2000). In an experiment examining wordspecific repetition effects in young adults, Gagnepain et al. (2008) found several regions in the temporal cortex that showed repetition suppression exclusively for words. In particular, a cluster in the left pmtg showed a reduction in BOLD response to previously encountered words but did not significantly change response characteristics to previously encountered nonwords (Gagnepain et al., 2008). In contrast, several regions showed a repetition enhancement for nonwords, that is, an increase in the BOLD response to nonwords that had been encountered before, relative to unprimed nonwords. Few studies, however, have varied lexical and indexical information in the same design. In a study that found word-specific repetition suppression in the left pmtg, indexical information was not varied (Gagnepain et al., 2008). Studies that have examined lexical and indexical processing in the same design have found inconsistent results with respect to priming and neural adaptation effects (Kouider et al., 2009; Orfanidou et al., 2006). Orfanidou and colleagues found that speakerchanges do not impact neural response suppression. This study also did not find repetition effects that were specific to words (Orfanidou et al., 2006). In contrast, another study found evidence of voice-dependent as well as voiceindependent repetition suppression in key auditory areas (Kouider et al., 2009). The differences between these findings could be because of the tasks involved. Orifandou et al. used a lexical decision task that could have forced participants to focus on lexical rather than indexical information. In contrast, Kouider et al. used a subliminal priming task (participants were unaware of the fact that a word/ pseudoword was repeated) and consistent with the current study, found voice-dependent, word-specific response suppression. In the current study we find clear evidence for integral processing of lexical and indexical information. The task used in the current study (loudness judgment) is not biased toward lexical or indexical processing or toward words/pseudowords (reflected by lack of differences between words and pseudowords in accuracy and RT measures) and does not interfere with word priming (Church & Schacter, 1994). Previous studies examining processing of what and who information have required participants to specifically attend to lexical content (e.g., make lexicality judgments; Gagnepain et al., 2008; Orfanidou et al., 2006) or to speaker information (von Kriegstein, Eger, Kleinschmidt, & Giraud, 2003). The loudness judgment task required participants to focus on the acoustic information and ensured continuous attention while providing no particular advantage for who or what processing. Despite the orthogonal nature of the loudness judgment task, we find word-specific neural encoding in the left pmtg, suggesting that the ongoing task does not interfere with repetition effects. The neural bases of the reduced signal following repetition are unclear and much debated in literature (Grill- Spector et al., 2006). Several models have been proposed; among them, the sharpening model and the fatigue models are the most prominent (Grill-Spector et al., 2006). In the sharpening model, with stimulus repetition, only neurons that encode key properties of the stimulus are recruited (Desimone, 1996). In the facilitation model, repetition causes faster processing of stimuli, resulting in a reduction in BOLD activity (Grill-Spector et al., 2006). Both models could potentially explain our results. The left pmtg shows repetition suppression for words, but not pseudowords, suggesting that it may house neurons that are finely tuned to words. As per the sharpening model, repetition of words activates only neurons that are essential for representing the repeated word. If indexical information is an essential part of the word representation, changing either the word or the speaker could result in additional neurons being recruited, resulting in an increase in BOLD signal. As per the facilitation model, repeating words results in faster processing, causing a latency shift in the hemodynamic response. If either lexical or indexical information is changed, this may require additional processing (matching the input to stored representation), thereby increasing the latency of the hemodynamic response. In the current study, we could not examine repetition latency shifts because of the nature of the sparse-sample design. A previous study did demonstrate both latency and amplitude effects on the BOLD response for word repetition (Gagnepain et al., 2008). Future studies could use clustered acquisition methods (Edmister, Talavage, Ledden, & Weisskoff, 1999), which has the capability of capturing HRF shifts, hence could delineate between models (sharpening vs. facilitation). Implications for Current Neural Models of Speech Processing Using short-term fmri adaptation design, we demonstrate that speaker information is in fact retained during lexical processing in the left pmtg, a region in the ventral stream that exhibits word-specific encoding. Our data support exemplar models that argue for narrow tuning, gradient effects, and sensitivity to exemplars in regions crucial for word recognition. These results can explain the close link between speaker information and words found in previous behavioral literature on speech perception. Speaker 8 Journal of Cognitive Neuroscience Volume X, Number Y

9 variability impairs word identification in noise (Mullennix et al., 1989). Serial recall of words from a list spoken by 10 speakers is poorer than recall from a single-talker list (Martin, Mullennix, Pisoni, & Summers, 1989). In a recognition memory task, where participants had to judge if words they heard were repeated or novel, participants perform more poorly when there is a speaker change between first and second presentations (Palmeri et al., 1993). There are multiple benefits for retaining speaker information in the neural representation of words. Speaker-related information plays an important role in conveying communicative intent (e.g., conveying emotion) directly relevant to lexical-semantic processing. Second, speaker-related information can benefit processing of verbal content in challenging listening environments. For example, tagging the speaker is a useful strategy that is helpful in extracting the predictable signal from extraneous background noise (Chandrasekaran, Hornickel, Skoe, Nicol, & Kraus, 2009; Brokx & Nooteboom, 1982). Alternatively, indexical effects in regions that encode words may reflect a hybrid neural representation that is composed of a core abstract construct directly relating to word processing and a periphery that retains speakerrelated information (Poeppel et al., 2008). Such a representation may be efficient, yet the surface details related to the speaker benefit lexical processing during challenging listening conditions. A behavioral auditory priming study demonstrated that both abstract representation and episodes contribute to the extent of repetition priming for words, with the role of exemplar information decreasing with time (Kouider & Dupoux, 2009). We examined short-term neural adaptation, which may maximally reflect exemplar effects. A previous study that examined longterm fmri adaptation found no effect of speaker-change and concluded that word processing is essentially abstract in nature (Orfanidou et al., 2006). Thus, it is possible that neural representation of words is composed of relatively abstract features as well as features related to individual exemplars with differential temporal decay (however, see Goldinger, 1996, for evidence that voice details can be retained for a week). Future parametric studies are needed to address the exact nature of computations in this region, as well as the extent to which exemplar information decays with time. Our results also have implications for our understanding of the neural bases of indexical processing as well. A previous study found a single cluster in the right anterior STS/STG showed fmri adaptation to speaker-related information (Belin & Zatorre, 2003). We find multiple regions that show adaptation to speaker-related information, including bilateral STG and the left IFG (Figure 6). Our findings are consistent the emerging view that speaker information processing is carried out by a network of regions that include the right anterior STG among others (von Kriegstein, Smith, Patterson, Kiebel, & Griffiths, 2010; Dehaene-Lambertz et al., 2006). This network may convey information about the speaker beyond voice pitch (which is processed preferentially in the right anterior STG (Belin, 2006; Belin & Zatorre, 2003; Belin et al., 2000) and may reflect other details specific to the speaker including vocal tract size (von Kriegstein et al., 2010; von Kriegstein, Smith, Patterson, Ives, & Griffiths, 2007) and articulatory habits (Remez, Fellowes, & Nagel, 2007). Conclusions Speaker information is considered to be represented separately from abstract lexical information in current neural models of speech processing. As reflected by short-term fmri adaptation, we demonstrate that indexical information is retained in the neural representation of words in the left pmtg. UNCITED REFERENCES Belin, Fecteau, & Bedard, 2004 Narain et al., 2003 Rauschecker & Scott, 2009 Schneider, Eschman, & Zuccolotto, 2002 Warren, Scott, Price, & Griffiths, 2006 Acknowledgments This work was funded by National Science Foundation (R01DC01510) and National Institutes of Health (R01DC and R21DC009652). Reprint requests should be sent to Patrick Wong, 2240 Campus Dr., Evanston, IL 60208, or via REFERENCES Backes, W. H., & van Dijk, P. (2002). Simultaneous sampling of event-related BOLD responses in auditory cortex and brainstem. Magnetic Resonance in Medicine, 47, Belin, P. (2006). Voice processing in human and non-human primates. Philosophical Transactions of the Royal Society B, Biological Sciences, 361, Belin, P., Fecteau, S., & Bedard, C. (2004). Thinking the voice: Neural correlates of voice perception. Trends in Cognitive Sciences, 8, Belin, P., & Zatorre, R. J. (2003). Adaptation to speakerʼs voice in right anterior temporal lobe. NeuroReport, 14, Belin, P., Zatorre, R. J., Lafaille, P., Ahad, P., & Pike, B. (2000). Voice-selective areas in human auditory cortex. Nature, 403, Boersma, P. (2001). Praat, a system for doing phonetics by computer. Glot International, 5, Bowers, J. S. (2000). In defense of abstractionist theories of repetition priming and word identification. Psychonomic Bulletin & Review, 7, Bradlow, A. R., Nygaard, L. C., & Pisoni, D. B. (1999). Effects of talker, rate, and amplitude variation on recognition memory for spoken words. Perception & Psychophysics, 61, Brokx, J. P. L., & Nooteboom, S. G. (1982). Intonation and the perceptual separation of simultaneous voices. Journal of Phonetics, 10, Chandrasekaran, Chan, and Wong 9

10 Chandrasekaran, B., Hornickel, J., Skoe, E., Nicol, T., & Kraus, N. (2009). Context-dependent encoding in the human auditory brainstem relates to hearing speech in noise: Implications for developmental dyslexia. Neuron, 64, Church, B. A., & Schacter, D. L. (1994). Perceptual specificity of auditory priming: Implicit memory for voice intonation and fundamental frequency. Journal of Experimental Psychology: Learning, Memory, and Cognition, 20, Coltheart, M. (1981). The MRC psycholinguistic database. Quarterly Journal of Experimental Psychology, 33, Cox, R. W. (1996). AFNI: Software for analysis and visualization of functional magnetic resonance neuroimages. Computers and Biomedical Research, 29, Damasio, H., Grabowski, T. J., Tranel, D., Hichwa, R. D., & Damasio, A. R. (1996). A neural basis for lexical retrieval. Nature, 380, Dehaene-Lambertz, G., Dehaene, S., Anton, J. L., Campagne, A., Ciuciu, P., Dehaene, G. P., et al. (2006). Functional segregation of cortical language areas by sentence repetition. Human Brain Mapping, 27, Desimone, R. (1996). Neural mechanisms for visual memory and their role in attention. Proceedings of the National Academy of Sciences, U.S.A., 93, Edmister, W. B., Talavage, T. M., Ledden, P. J., & Weisskoff, R. M. (1999). Improved auditory cortex imaging using clustered volume acquisitions. Human Brain Mapping, 7, Formisano, E., De Martino, F., Bonte, M., & Goebel, R. (2008). Who is saying what? Brain-based decoding of human voice and speech. Science, 322, Gagnepain, P., Chetelat, G., Landeau, B., Dayan, J., Eustache, F., & Lebreton, K. (2008). Spoken word memory traces within the human auditory cortex revealed by repetition priming and functional magnetic resonance imaging. Journal of Neuroscience, 28, Gazzaniga, M. S., Glass, A. V., Sarno, M. T., & Posner, J. B. (1973). Pure word deafness and hemispheric dynamics: A case history. Cortex, 9, Gilaie-Dotan, S., & Malach, R. (2007). Sub-exemplar shape tuning in human face-related areas. Cerebral Cortex, 17, Glezer, L. S., Jiang, X., & Riesenhuber, M. (2009). Evidence for highly selective neuronal tuning to whole words in the visual word form area. Neuron, 62, Goldinger, S. D. (1996). Words and voices: Episodic traces in spoken word identification and recognition memory. Journal of Experimental Psychology: Learning, Memory, and Cognition, 22, Grill-Spector, K., Henson, R., & Martin, A. (2006). Repetition and the brain: Neural models of stimulus-specific effects. Trends in Cognitive Sciences, 10, Grill-Spector, K., & Sayres, R. (2008). Object recognition: Insights from advances in fmri methods. Current Directions in Psychological Science, 17, Hall, D. A., Haggard, M. P., Akeroyd, M. A., Palmer, A. R., Summerfield, A. Q., Elliott, M. R., et al. (1999). Sparse temporal sampling in auditory fmri. Human Brain Mapping, 7, Hickok, G., & Poeppel, D. (2007). The cortical organization of speech processing. Nature Reviews Neuroscience, 8, Kouider, S., de Gardelle, V., Dehaene, S., Dupoux, E., & Pallier, C. (2009). Cerebral bases of subliminal speech priming. Neuroimage, 49, Kouider, S., & Dupoux, E. (2009). Episodic accessibility and morphological processing: Evidence from long-term auditory priming. Acta Psychologica, 130, Lau, E. F., Phillips, C., & Poeppel, D. (2008). A cortical network for semantics: (De)constructing the N400. Nature Reviews Neuroscience, 9, Margulis, E. H., Mlsna, L. M., Uppunda, A. K., Parrish, T. B., & Wong, P. C. M. (2009). Selective neurophysiologic responses to music in instrumentalists with different listening biographies. Human Brain Mapping, 30, Martin, C. S., Mullennix, J. W., Pisoni, D. B., & Summers, W. V. (1989). Effects of talker variability on recall of spoken word lists. Journal of Experimental Psychology: Learning, Memory, and Cognition, 15, Medin, D. L., & Smith, E. E. (1984). Concepts and concept formation. Annual Review of Psychology, 35, Mullennix, J. W., Pisoni, D. B., & Martin, C. S. (1989). Some effects of talker variability on spoken word recognition. Journal of the Acoustical Society of America, 85, Narain, C., Scott, S. K., Wise, R. J., Rosen, S., Leff, A., Iversen, S. D., et al. (2003). Defining a left-lateralized response specific to intelligible speech using fmri. Cerebral Cortex, 13, Nosofsky, R. M., & Johansen, M. K. (2000). Exemplar-based accounts of multiple-system phenomena in perceptual categorization. Psychonomic Bulletin & Review, 7, Nosofsky, R. M., & Zaki, S. R. (2002). Exemplar and prototype models revisited: Response strategies, selective attention, and stimulus generalization. Journal of Experimental Psychology: Learning, Memory, and Cognition, 28, Nygaard, L. C., & Pisoni, D. B. (1998). Talker-specific learning in speech perception. Perception & Psychophysics, 60, Nygaard, L. C., Sommers, M. S., & Pisoni, D. B. (1994). Speech perception as a talker-contingent process. Psychological Science, 5, Oldfield, R. C. (1971). The assessment and analysis of handedness: The Edinburgh inventory. Neuropsychologia, 9, Orfanidou, E., Marslen-Wilson, W. D., & Davis, M. H. (2006). Neural response suppression predicts repetition priming of spoken words and pseudowords. Journal of Cognitive Neuroscience, 18, Pallier, C., Colome, A., & Sebastian-Galles, N. (2001). The influence of native-language phonology on lexical access: Exemplar-based versus abstract lexical entries. Psychological Science, 12, Palmeri, T. J., Goldinger, S. D., & Pisoni, D. B. (1993). Episodic encoding of voice attributes and recognition memory for spoken words. Journal of Experimental Psychology: Learning, Memory, and Cognition, 19, Perrachione, T. K., & Wong, P. C. M. (2007). Learning to recognize speakers of a non-native language: Implications for the functional organization of human auditory cortex. Neuropsychologia, 45, Pisoni, D. B. (1993). Long-term memory in speech perception: Some new findings on talker variability, speaking rate and perceptual learning. Speech Communication, 13, Poeppel, D., Idsardi, W. J., & van Wassenhove, V. (2008). Speech perception at the interface of neurobiology and linguistics. Philosophical Transactions of the Royal Society B: Biological Sciences, 363, Rauschecker, J. P., & Scott, S. K. (2009). Maps and streams in the auditory cortex: Nonhuman primates illuminate human speech processing. Nature Neuroscience, 12, Journal of Cognitive Neuroscience Volume X, Number Y

11 Reed, S. K. (1972). Pattern recognition and categorization. Cognitive Psychology, 3, Remez, R. E., Fellowes, J. M., & Nagel, D. S. (2007). On the perception of similarity among talkers. Journal of the Acoustical Society of America, 122, Schneider, W., Eschman, A., & Zuccolotto, A. (2002). E-Prime userʼs guide. Pittsburgh, PA: Psychology Software Tools. Tenpenny, P. L. (1995). Abstractionist versus episodic theories of repetition priming and word identification. Psychonomic Bulletin & Review, 2, Tice, R., & Carrell, T. (1998). Level 16 (version ) [computer software]. Lincoln, NE: University of Nebraska. Van Lancker, D. R., Cummings, J. L., Kreiman, J., & Dobkin, B. H. (1988). Phonagnosia: A dissociation between familiar and unfamiliar voices. Cortex, 24, von Kriegstein, K., Eger, E., Kleinschmidt, A., & Giraud, A. L. (2003). Modulation of neural responses to speech by directing attention to voices or verbal content. Brain Research, Cognitive Brain Research, 17, von Kriegstein, K., Smith, D. R., Patterson, R. D., Ives, D. T., & Griffiths, T. D. (2007). Neural representation of auditory size in the human voice and in sounds from other resonant sources. Current Biology, 17, von Kriegstein, K., Smith, D. R. R., Patterson, R. D., Kiebel, S. J., & Griffiths, T. D. (2010). How the human brain recognizes speech in the context of changing speakers. Journal of Neuroscience, 30, Warren, J. D., Scott, S. K., Price, C. J., & Griffiths, T. D. (2006). Human brain mechanisms for the early analysis of voices. Neuroimage, 31, Wong, P. C., Jin, J. X., Gunasekera, G. M., Abel, R., Lee, E. R., & Dhar, S. (2009). Aging and cortical mechanisms of speech perception in noise. Neuropsychologia, 47, Wong, P. C., Perrachione, T. K., & Parrish, T. B. (2007). Neural characteristics of successful and less successful speech and word learning in adults. Human Brain Mapping, 28, Chandrasekaran, Chan, and Wong 11