Proceedings of Meetings on Acoustics

Similar documents
Mandarin Lexical Tone Recognition: The Gating Paradigm

Proceedings of Meetings on Acoustics

Rhythm-typology revisited.

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

The Perception of Nasalized Vowels in American English: An Investigation of On-line Use of Vowel Nasalization in Lexical Access

WHEN THERE IS A mismatch between the acoustic

Speech Emotion Recognition Using Support Vector Machine

1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all

Running head: DELAY AND PROSPECTIVE MEMORY 1

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Learning Methods in Multilingual Speech Recognition

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

Dyslexia/dyslexic, 3, 9, 24, 97, 187, 189, 206, 217, , , 367, , , 397,

Human Emotion Recognition From Speech

AGENDA LEARNING THEORIES LEARNING THEORIES. Advanced Learning Theories 2/22/2016

LEGO MINDSTORMS Education EV3 Coding Activities

Automatic Pronunciation Checker

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Infants learn phonotactic regularities from brief auditory experience

learning collegiate assessment]

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Beginning primarily with the investigations of Zimmermann (1980a),

On the Combined Behavior of Autonomous Resource Management Agents

SEGMENTAL FEATURES IN SPONTANEOUS AND READ-ALOUD FINNISH

COMPUTER INTERFACES FOR TEACHING THE NINTENDO GENERATION

Automatic segmentation of continuous speech using minimum phase group delay functions

A comparison of spectral smoothing methods for segment concatenation based speech synthesis

Rote rehearsal and spacing effects in the free recall of pure and mixed lists. By: Peter P.J.L. Verkoeijen and Peter F. Delaney

Linking object names and object categories: Words (but not tones) facilitate object categorization in 6- and 12-month-olds

Evaluation of Various Methods to Calculate the EGG Contact Quotient

One major theoretical issue of interest in both developing and

Degeneracy results in canalisation of language structure: A computational model of word learning

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

CLASSIFICATION OF PROGRAM Critical Elements Analysis 1. High Priority Items Phonemic Awareness Instruction

Segregation of Unvoiced Speech from Nonspeech Interference

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Instructor: Mario D. Garrett, Ph.D. Phone: Office: Hepner Hall (HH) 100

Speaker recognition using universal background model on YOHO database

PHD COURSE INTERMEDIATE STATISTICS USING SPSS, 2018

Quarterly Progress and Status Report. Voiced-voiceless distinction in alaryngeal speech - acoustic and articula

SOFTWARE EVALUATION TOOL

Speech Recognition at ICSI: Broadcast News and beyond

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Speaker Recognition. Speaker Diarization and Identification

Fribourg, Fribourg, Switzerland b LEAD CNRS UMR 5022, Université de Bourgogne, Dijon, France

Forget catastrophic forgetting: AI that learns after deployment

An Evaluation of the Interactive-Activation Model Using Masked Partial-Word Priming. Jason R. Perry. University of Western Ontario. Stephen J.

Test Administrator User Guide

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Monitoring Metacognitive abilities in children: A comparison of children between the ages of 5 to 7 years and 8 to 11 years

A Comparison of the Effects of Two Practice Session Distribution Types on Acquisition and Retention of Discrete and Continuous Skills

Problems of the Arabic OCR: New Attitudes

Perceptual scaling of voice identity: common dimensions for different vowels and speakers

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Application of Virtual Instruments (VIs) for an enhanced learning environment

THE INFLUENCE OF TASK DEMANDS ON FAMILIARITY EFFECTS IN VISUAL WORD RECOGNITION: A COHORT MODEL PERSPECTIVE DISSERTATION

Phonological and Phonetic Representations: The Case of Neutralization

Body-Conducted Speech Recognition and its Application to Speech Support System

Author's personal copy

Learning By Asking: How Children Ask Questions To Achieve Efficient Search

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Lecture 2: Quantifiers and Approximation

Unit Selection Synthesis Using Long Non-Uniform Units and Phonemic Identity Matching

VIEW: An Assessment of Problem Solving Style

Statewide Framework Document for:

Non-Secure Information Only

Audible and visible speech

On-the-Fly Customization of Automated Essay Scoring

Conceptual and Procedural Knowledge of a Mathematics Problem: Their Measurement and Their Causal Interrelations

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Effects of Open-Set and Closed-Set Task Demands on Spoken Word Recognition

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge

9.85 Cognition in Infancy and Early Childhood. Lecture 7: Number

Understanding and Interpreting the NRC s Data-Based Assessment of Research-Doctorate Programs in the United States (2010)

Jacqueline C. Kowtko, Patti J. Price Speech Research Program, SRI International, Menlo Park, CA 94025

Examining Action Effects in the Execution of a Skilled Soccer Kick by Using Erroneous Feedback

Robot manipulations and development of spatial imagery

The Effect of Written Corrective Feedback on the Accuracy of English Article Usage in L2 Writing

Stages of Literacy Ros Lugg

Source-monitoring judgments about anagrams and their solutions: Evidence for the role of cognitive operations information in memory

Early Warning System Implementation Guide

The Timer-Game: A Variable Interval Contingency for the Management of Out-of-Seat Behavior

Certified Six Sigma Professionals International Certification Courses in Six Sigma Green Belt

The Role of Test Expectancy in the Build-Up of Proactive Interference in Long-Term Memory

Revisiting the role of prosody in early language acquisition. Megha Sundara UCLA Phonetics Lab

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

THE RECOGNITION OF SPEECH BY MACHINE

Using Proportions to Solve Percentage Problems I

A Cross-language Corpus for Studying the Phonetics and Phonology of Prominence

Interpreting ACER Test Results

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Greek Teachers Attitudes toward the Inclusion of Students with Special Educational Needs

Does the Difficulty of an Interruption Affect our Ability to Resume?

DIDACTIC MODEL BRIDGING A CONCEPT WITH PHENOMENA

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Transcription:

Proceedings of Meetings on Acoustics Volume 19, 2013 http://acousticalsociety.org/ ICA 2013 Montreal Montreal, Canada 2-7 June 2013 Speech Communication Session 4pSCa: Auditory Feedback in Speech Production II 4pSCa4. Intentionality and categories in speech motor control Takashi Mitsuya* and Kevin Munhall *Corresponding author's address: Psychology, Queen's University, Kingston, K7L 3E6, Ontario, Canada, takashi.mitsuya@queensu.ca Actions are organized around goals or intentions. In speech production, there has been no agreement on how best to discuss speech goals. However, the auditory feedback perturbation methodology provides a window into the nature of speech goals. To the extent that subjects are sensitive to variation in an acoustic attribute, this attribute must be part of the controlled intention of articulation. In this presentation, we will review a series of studies that speak to this issue. In one study, we examined how intentionality of speech production influences compensatory formant production by instructing subjects to use a cognitive strategy in order to make the feedback sound consistent with the intended vowel. In other studies, we have explored the specificity of vowel formant compensation by comparing cross-language differences. The results indicate that speech goals are 1) very specific, defined by a phonemic category and its relationship with neighboring categories, and 2) multivariate. We will discuss these results by contrasting compensatory behaviors in reaching and limb movements to those observed in speech studies. The presence of a system of categories in speech may result in differences in the way speech goals are represented. Published by the Acoustical Society of America through the American Institute of Physics 2013 Acoustical Society of America [DOI: 10.1121/1.4800727] Received 22 Jan 2013; published 2 Jun 2013 Proceedings of Meetings on Acoustics, Vol. 19, 060179 (2013) Page 1

INTRODUCTION The speech production process beings with speaker s intention to communicate. In most accounts the process of articulations includes a phase where a phonological representation in the mind is transformed into physical form (i.e., articulatory gestures and consequent sounds). The transformation of such categorical mental sound representations to movements is fundamentally different from other motor behaviors, such as reaching, whose goals are defined in the environment (e.g., specified in a visual plane) that are usually not categorical. These differences in the nature of motoric targets might be reflected in how people control behaviors. One way to examine how motoric goals are defined and achieved is to see how erroneous behaviors are corrected, using a real-time perturbation paradigm. In both visuomotor and auditory speech perturbations subjects generally compensate by moving opposite to the direction of the perturbations. Recently, Mitsuya et al. (2011) have shown that speech vowel compensations may be unique in that they are produced with respect to the vowel category and its local neighbors. In the present study we replicate an experiment carried out with visuomotor adaptation. Mazzoni and Krakauer (2006) reported that when subjects are given a strategic target to cancel the perturbation all at once, they were able to aim at the given target initially, but they slowly began to overshoot the correction. Taylor and Ivry (2012) observed that this overshoot did not persist; however, they too reported the same behavior shortly after the subjects started aiming at the strategy target. Here, we test whether the use of an explicit cognitive strategy to overcome the perturbation of formant frequencies will show a similar pattern to that observed for reaching. Mazzoni and Krakauer (2006) suggested that the observed overshoot in a reaching experiment might be due the motor system trying to resolve the difference between the predicted versus observed trajectories of movements. Given that speech goals seem to be represented differently as a system of targets, then explicit strategies using different vowel categories to overcome the perturbation may result in different patterns of compensation. METHODS Particiants Nineteen female students of Queen s University participated in the current experiment. The use of one gender was to reduce the differences in formant structure across participants. The average age was 19.6 (ranging from 18-21 years), and all of them learned English as their first language. Each participant was tested in a single session. No participants reported speech or language impairments and all had normal audiometric hearing thresholds over a range of 500-4000 Hz. Equipment Equipment used in this experiment was the same as the reported in Munhall et al. (2009), MacDonald et al. (2010, 2011) and Mitsuya et al. (2011). Speakers were tested in a sound attenuated booth in front of a computer monitor with a headset microphone (Shure WH20) and headphones (Sennheiser HD 265). The microphone signal was amplified (Tucker-David Technologies MA 3 microphone amplifier), low-pass filtered with a cutoff frequency of 4.5Hz (Hrohn-Hite 3384 filter), digitized at 10 khz and filtered in real-time to produce formant shifts (National Instruments PXI-8106 embedded controller). The manipulated speech signal was then amplified and mixed with speech noise (Madsen Midimate 622 audiometer). This signal was presented through the headphones that the speakers wore. The speech and noise were Proceedings of Meetings on Acoustics, Vol. 19, 060179 (2013) Page 2

presented at approximately 80 and 50 dba SPL respectively. Acoustic processing Voicing detection was done using a statistical amplified-threshold technique, and the real-time formant shifting was done using an IIR filter. An iterative Burg algorithm (Orfanidis, 1988) estimated formant frequencies every 900 μs. Prior to the experimental data collection, a parameter, the model order to determine the number of coefficients used in the auto-regressive analysis was estimated by collecting seven English vowels /i, I, e,e, æ, O, o, u/ were presented in an /hvd/ context ( heed, hid, hayed, head, had, hawed, who d ). These words were randomly presented on a computer screen in front of the speakers, and they were instructed to say the prompted word without gliding the tone, or pitch. These utterances were analyzed with model orders ranging from 8 to 12. For each speaker, the best model order was selected based on minimum variance in formant frequency over a 25 ms segment in the middle portion of the vowel (MacDonald et al., 2010). For offline formant analysis, an automated process estimated the vowel boundaries in each utterance, based on the harmonicity of the power spectrum. These estimates were then manually inspected and corrected if required. Procedure Speakers produced 100 utterances of the word head (/hed/) with a visual prompt on the screen in front of them. The prompt lasted 2.5 s with the inter-trial interval of approximately 1.5 s. The 100 utterance-session consisted of three experimental phases. In the first phase, Baseline (utterances 1-20), speakers received normal feedback through the headphones (i.e., amplified and noise added but with no change in formant frequency). In the second phase, Perturbation (utterance 21-60), speakers received altered feedback in which F1 was increased by 200 Hz and F2 was decreased by 250 Hz. This perturbation made the feedback sound more like had (/hæd/). Immediately after the 23rd trial, the experiment was paused and the experimenter instructed the speaker to say hid (/hid) to make the sound they heard through the headphones more consistent with the sound of the word head. Then the experiment was resumed. In the third phase, Return (utterances 61-100), the perturbation was removed abruptly and the feedback went back to normal. Formant shift [Hz] 250 200 150 100 50 0-50 -100-150 -200-250 -300 0 20 40 60 80 100 Utterance F1 F2 FIGURE 1: Feedback shift applied to the first formant (dotted line) and second formant (solid line). The vertical dashed lines denote the boundaries of the three phases: Baseline, Perturbation, and Return. Proceedings of Meetings on Acoustics, Vol. 19, 060179 (2013) Page 3

RESULTS The baseline average of F1 was calculated for each speaker, based on the last 15 utterances of the Baseline phase (i.e., utterances 6-20), then the raw F1 value in Hz was normalized by subtracting the speaker s baseline average from F1 value of each utterance. Figure 1 shows the overall average of normalized Formants. As can be seen, immediately after the perturbation was introduced, speakers already started to adjust their formant production (utterance 22 and 23). When the instruction was given after the 23rd utterance, 17 speakers correctly followed the instruction and produced hid at utterance 24 when the experiment was resumed. The remaining 3 speakers started saying hid at utterance 25. Normalized Formants [Hz] 300 200 100 0-100 -200 F2 F1-300 0 20 40 60 80 100 Utterance FIGURE 2: Averaged normalized F1 (solid circles) and F2 (open cerciles). The vertical dashed lines denote the boundaries of the three phases: Baseline, Perturbation, and Return. The question we were examining was whether speakers would change the production of strategy vowel. In order to verify this, the average magnitude of compensation was compared across three points in the experiment, 1) the Perturbation phase after the cognitive strategy was given (utterances 25-40), 2) the last part of the Perturbation phase (utterances 46-60), and 3) the last part of the Return phase (utterances 86-100). An analysis of variance (ANOVA) was conducted with the three time points as within-subject factors, and it was significant on both F1 (F[2, 36]= 12.34, p < 0.05) and F2 (F[2, 36]= 15.17, p < 0.05). This significance is solely due to the fact that speakers production differed between the Perturbation and Return phases because post hoc analyses revealed that there was no difference between the two points within the Perturbation phase (F1: t[18]= 1.15, p > 0.05; F2: t[18]= 1.02, p > 0.05). The results of Mazzoni and Krakauer (2006) and Taylor and Ivry (2012) imply that the adaptation to the visual rotation was implicitly global. The introduction of a cognitive strategy to resolve the discrepancy still resulted in a perturbation and compensation situation. In our experiment, it is possible that the cognitive strategy of producing "hid" during the Perturbation phase might have been affected by the introduction of perturbation. In order to examine whether the vowel /I/ was produced differently from speaker s resting state, we compared the formant values of /I/ produced in the Perturbation phase and those collected during the prescreening session. The analysis revealed that speakers production did not differ with both F1 (t[18] 1.12, p > 0.05) and F2 (t[18]= 1.53, p > 0.05), indicating that the perturbation did not induce an implicit learning on the vowel /I/. It is important to note that the reason why the group average formant values did not go back to the resting point in the Return phase was that some speakers continued to say hid until the end of the experiment, failing to make the feedback consistent with the sound head. Proceedings of Meetings on Acoustics, Vol. 19, 060179 (2013) Page 4

We separated these people from the speakers who switched back to saying "head" in the Return phase. This yielded 11 speakers who kept saying "hid" (Stay group), and 8 speakers who switched back (Switch group). Normalized Formants [Hz] 400 300 200 100 0-100 -200 F2-Switch F2-Stay F1-Switch F1-Stay -300 0 20 40 60 80 100 Utterance FIGURE 3: Normalized F1 (solid) and F2 (open) production averaged across speakers in Switch group (circles) and Stay group (diamond) Clearly, some speakers were cognizant of the task making the auditory feedback consistent with a particular vowel by producing another vowel, while others ignored the auditory feedback altogether and just produced the cognitive target as a new target regardless of its relationship with the feedback target at all. All of these results indicate no overshoot or implicit global adaptation for perturbation. However, the difference in the two types of observed behavior might have interacted with the overshoot effect somehow, thus we separated the groups and compared the group average of the magnitude of compensation during the perturbation phase; however, the groups did not differ (F1: t[17]=.69, p > 0.05; F2: t[17]= 1.88, p > 0.05). Moreover, the Stay group s formant values during Perturbation and Return phases did not differ F1: t[10]= -.87, p > 0.05; F2: t[10]=.87, p > 0.05), indicating there was no change 1) in the way the speakers were adapting to the perturbation with the strategy vowel regardless of whether they were attending to the feedback or just focusing on producing the strategy vowel and 2) in the production of the strategy vowel regardless of the introduction and removal of the perturbation. DISCUSSION The current study was set up to examine the difference between the compensatory behavior of auditory perturbations of formant production and the visuomotor adaptation when people were given a specific strategy to correct their behavior for the perturbation given (Mazzoni and Krakauer, 2006; Taylor and Ivry, 2012). Unlike the results reported in these visuomotor adaptation studies, we did not observe an overshoot due to the cognitive strategy target as if subjects were implicitly adapting to the perturbation regardless of the target. Instead, they were able to reliably produce the vowel without the influence of perturbation. These results imply that perturbation applied to a vowel does affect the production of another vowel around the auditory target, suggesting adaptation is not global. Mazzoni and Krakauer (2006) postulated that the motor system s intolerance for two simultaneous targets is predicated on the assumption that the rotation of visual space is applied globally. But this does not seem to be the case with speech production goals and how they are represented in the F1/F2 acoustic space. Resistance to perturbation while producing a vowel that is given as a cognitive target Proceedings of Meetings on Acoustics, Vol. 19, 060179 (2013) Page 5

indicates that 1) representation of vowels is more than acoustic attributes, at least the properties that were perturbed in the current study and 2) speakers intention to produce a phonological category plays an important role in the stable production of the category, rather than the controlling acoustic attributes independently. ACKNOWLEDGMENTS This research was supported by Natural Sciences and Engineering Research Council of Canada. REFERENCES MacDonald, E. N., Goldberg, R., and Munhall, K. G. (2010). Compensation in response to real-time formant perturbations of different magnitude, The Journal of the Acoustical Society of America 127, 1059 1068. MacDonald, E. N., Purcell, D. W., and Munhall, K. G. (2011). Probing the independence of formant control using altered auditory feedback, The Journal of the Acoustical Society of America 129, 955 966. Mazzoni, P. and Krakauer, J. W. (2006). An implicit plan overrides an explicit strategy during visuomotor adaptation, The Journal of Neuroscience 26, 3642 3645. Mitsuya, T., MacDonald, E. N., Purcell, D. W., and Munhall, K. G. (2011). A cross-language study of compensation in response to real-time formant perturbation, The Journal of the Acoustical Society of America 130, 2978 2986. Munhall, K. G., MacDonald, E. N., Byrne, S. K., and Johnsrude, I. (2009). Speakers alter vowel production in response to real-time formant perturbation even when instructued to resist compensation, The Journal of the Acoustical Society of America 125, 384 390. Orfanidis, S. J. (1988). Optimum Signal Processing: An Introduction (McGraw-Hill, New York, NY). Taylor, J. A. and Ivry, R. B. (2012). The role of strategies in motor learning, Annals of the New York Academy of Sciences 1251, 1 12. Proceedings of Meetings on Acoustics, Vol. 19, 060179 (2013) Page 6