Proceedings of Meetings on Acoustics

Similar documents
Proceedings of Meetings on Acoustics

Beginning primarily with the investigations of Zimmermann (1980a),

Christine Mooshammer, IPDS Kiel, Philip Hoole, IPSK München, Anja Geumann, Dublin

1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all

Audible and visible speech

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Mandarin Lexical Tone Recognition: The Gating Paradigm

Edinburgh Research Explorer

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Speaking Rate and Speech Movement Velocity Profiles

Multi-View Features in a DNN-CRF Model for Improved Sentence Unit Detection on English Broadcast News

Human Emotion Recognition From Speech

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Consonants: articulation and transcription

Speech Emotion Recognition Using Support Vector Machine

A comparison of spectral smoothing methods for segment concatenation based speech synthesis

Body-Conducted Speech Recognition and its Application to Speech Support System

DEVELOPMENT OF LINGUAL MOTOR CONTROL IN CHILDREN AND ADOLESCENTS

Self-Supervised Acquisition of Vowels in American English

Voice conversion through vector quantization

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

The Perception of Nasalized Vowels in American English: An Investigation of On-line Use of Vowel Nasalization in Lexical Access

On the Formation of Phoneme Categories in DNN Acoustic Models

Quarterly Progress and Status Report. Voiced-voiceless distinction in alaryngeal speech - acoustic and articula

Learners Use Word-Level Statistics in Phonetic Category Acquisition

Universal contrastive analysis as a learning principle in CAPT

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

NIH Public Access Author Manuscript Lang Speech. Author manuscript; available in PMC 2011 January 1.

Segregation of Unvoiced Speech from Nonspeech Interference

Self-Supervised Acquisition of Vowels in American English

Perceptual scaling of voice identity: common dimensions for different vowels and speakers

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

A study of speaker adaptation for DNN-based speech synthesis

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Spinners at the School Carnival (Unequal Sections)

Phonetics. The Sound of Language

Articulatory Distinctiveness of Vowels and Consonants: A Data-Driven Approach

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

The Acquisition of English Intonation by Native Greek Speakers

Revisiting the role of prosody in early language acquisition. Megha Sundara UCLA Phonetics Lab

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

The Bruins I.C.E. School

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Curriculum Design Project with Virtual Manipulatives. Gwenanne Salkind. George Mason University EDCI 856. Dr. Patricia Moyer-Packenham

Speaker recognition using universal background model on YOHO database

Clinical Review Criteria Related to Speech Therapy 1

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Greek Teachers Attitudes toward the Inclusion of Students with Special Educational Needs

Listening and Speaking Skills of English Language of Adolescents of Government and Private Schools

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

SEGMENTAL FEATURES IN SPONTANEOUS AND READ-ALOUD FINNISH

Using EEG to Improve Massive Open Online Courses Feedback Interaction

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

**Note: this is slightly different from the original (mainly in format). I would be happy to send you a hard copy.**

Rachel E. Baker, Ann R. Bradlow. Northwestern University, Evanston, IL, USA

The lab is designed to remind you how to work with scientific data (including dealing with uncertainty) and to review experimental design.

WHEN THERE IS A mismatch between the acoustic

Mathematics Success Level E

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Level 1 Mathematics and Statistics, 2015

Journal of Phonetics

Perceived speech rate: the effects of. articulation rate and speaking style in spontaneous speech. Jacques Koreman. Saarland University

Prevalence of Oral Reading Problems in Thai Students with Cleft Palate, Grades 3-5

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Problems of the Arabic OCR: New Attitudes

Fix Your Vowels: Computer-assisted training by Dutch learners of Spanish

LEARNING A SEMANTIC PARSER FROM SPOKEN UTTERANCES. Judith Gaspers and Philipp Cimiano

Summary / Response. Karl Smith, Accelerations Educational Software. Page 1 of 8

READ 180 Next Generation Software Manual

Introduction to the Practice of Statistics

A Cross-language Corpus for Studying the Phonetics and Phonology of Prominence

Prototype Development of Integrated Class Assistance Application Using Smart Phone

Learning Methods in Multilingual Speech Recognition

Automatic intonation assessment for computer aided language learning

Speech/Language Pathology Plan of Treatment

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

VOL. 3, NO. 5, May 2012 ISSN Journal of Emerging Trends in Computing and Information Sciences CIS Journal. All rights reserved.

EDCI 699 Statistics: Content, Process, Application COURSE SYLLABUS: SPRING 2016

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Eyebrows in French talk-in-interaction

Chapter 1 Analyzing Learner Characteristics and Courses Based on Cognitive Abilities, Learning Styles, and Context

Appendix L: Online Testing Highlights and Script

One major theoretical issue of interest in both developing and

Trend Survey on Japanese Natural Language Processing Studies over the Last Decade

STUDENT MOODLE ORIENTATION

LEGO MINDSTORMS Education EV3 Coding Activities

CLASSIFICATION OF PROGRAM Critical Elements Analysis 1. High Priority Items Phonemic Awareness Instruction

WiggleWorks Software Manual PDF0049 (PDF) Houghton Mifflin Harcourt Publishing Company

Robot manipulations and development of spatial imagery

A new Dataset of Telephone-Based Human-Human Call-Center Interaction with Emotional Evaluation

Application of Virtual Instruments (VIs) for an enhanced learning environment

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading

age, Speech and Hearii

Word Segmentation of Off-line Handwritten Documents

Rhythm-typology revisited.

THE RECOGNITION OF SPEECH BY MACHINE

Transcription:

Proceedings of Meetings on Acoustics Volume, 213 http://acousticalsociety.org/ ICA 213 Montreal Montreal, Canada 2-7 June 213 Speech Communication Session 2aSC: Linking Perception and Production (er Session) 2aSC. An electromagnetic articulography-based articulatory feedback approach to facilitate second language speech production learning Atsuo Suemitsu*, Takayuki Ito and Mark Tiede *Corresponding author's address: JAIST, Nomi, 91292, Ishikawa, Japan, sue@jaist.ac.jp When acquiring a second language (L2), learners have difficulty in achieving native-like production even if they receive instruction on how to position the speech articulators for correct production. A principal reason is that learners lack information on how to modify their articulation to produce correct L2 sounds. A visual feedback method using Electromagnetic Articulography (EMA) has been previously implemented for this application with some success [Levitt et al. (21)]. However, because this approach provided tongue tip position only, it is unsuitable for vowels and many consonants. In this work we have developed a more general EMA-based articulatory feedback system that provides real-time visual feedback of multiple head movement-corrected sensor positions, together with target articulatory positions specific to each learner. We have used this system to improve the production of the unfamiliar vowel /ae/ for Japanese learners of American English. We predicted an appropriate speaker-specific /ae/ position for each Japanese learner using a model trained on previously collected kinematic data from 49 native speakers of American English, based on vowel positions for the overlapping /iy/, /aa/, and /uw/ vowels found in both languages. Results comparing formants pre- and post-feedback training will be presented to show the efficacy of the approach. Published by the Acoustical Society of America through the American Institute of Physics 213 Acoustical Society of America [DOI: 1.11/1.4866] Received 22 Jan 213; published 2 Jun 213 Proceedings of Meetings on Acoustics, Vol., 663 (213) Page 1

INTRODUCTION In recent years, developments in Computer-Assisted Language Technologies have offered new opportunities facilitating second language (L2) learning [1]. However, learners continue to have difficulty in achieving native-like production even if they receive instruction on how to correctly position the speech articulators. A principal reason is that the learners lack appropriate information on how to modify their articulation to produce correct L2 sounds. In other words, it is difficult for them to know the state of their articulators and how to position the articulators for correct sounds. To overcome the problem, an Electromagnetic Articulographic (EMA)-based visual feedback approach has been proposed by Levitt and Katz [2]. They showed that kinematic feedback with EMA facilitated the acquisition and maintenance of the Japanese flap consonant. However, because their approach provided tongue tip position only, it is unsuitable for vowels and many consonants. In this study, we describe a more general EMA-based articulatory feedback system that can provide real-time visual feedback of multiple sensor positions corrected for head movement, together with target articulatory positions specific to each learner. We examine the efficacy of our proposed approach by using this system to improve the production of the unfamiliar vowel /ae/ for Japanese learners of American English (AE). EMA-BASED ARTICULATORY VISUAL FEEDBACK SYSTEM An overview of the developed system is shown in Fig. 1. This system displays real-time multiple sensor positions on the midsagittal plane in a learner s coordinate system by rotating the recorded data to the occlusal plane and correcting for head movement relative to reference sensors. In addition, the target articulatory positions estimated by a prediction model and the previously recorded palate shape are superimposed on the display. Figure 2 shows an example of the real-time visual feedback display (x-axis: posterior-anterior direction; y-axis: inferior-superior direction), where the red color indicates the target articulatory positions, the cyan the actual sensor positions, the yellow the palate shape. Here the tongue surface contour is calculated by spline interpolation. EMA Control Server AG5 sentation PC Head Movement Correction Microphone Amplifier diction Model XRMB+EMA Corpus FIGURE 1: Overview of the proposed system. The prediction model was constructed from the acoustic and kinematic data of 49 native AE speakers (24 males and females) from the University of Wisconsin X-ray microbeam (XRMB) speech production corpus [3] and our EMA corpus. Specifically, in order to predict an appropriate speaker-specific /ae/ position for each Japanese learner, we developed multiple linear regression models with stepwise selection using the F1 and F2 values and the x and y coordinate values of the tongue tip (TT), tongue blade (TB), tongue dorsum (TD), lower incisor Proceedings of Meetings on Acoustics, Vol., 663 (213) Page 2

Inferior y [mm] Superior 4 2 2 4 ear actual positions target positions TD palate 14 12 1 8 6 4 2 2 4 erior x [mm] Anterior TB TT LI UL LL FIGURE 2: Example of the real-time visual feedback display. (LI), upper lip (UL), and lower lip (LL) for the overlapping vowels as a predictor, based on vowel positions for the overlapping /aa/, /iy/, and /uw/ vowels found in both languages. METHODS The subjects were 3 male (A, B, C) and 2 female (D, E) native speakers of Japanese, aged between 22 and 35, with no self-reported hearing deficits or speech disorders. Six sensors were placed on the articulator positions TT, TB, TD, LI, UL, and LL for the articulatory visual feedback, and 4 additional sensors (the upper incisor, bridge of the nose, left and right mastoid processes behind the ears) were used as references to correct for head movement (Fig. 3). Articulatory movement and speech data were recorded simultaneously at sampling rates of 2 Hz and 16 khz, respectively, using 3D EMA (Carstens AG5). In the real-time articulatory visual feedback process, all position data were processed at 2 Hz. For alignment across subjects, the occlusal plane was estimated using a biteplate with 3 additional sensors. TD TB TT LI UL LL label TT TB TD LI UL LL location tongue tip tongue blade tongue dorsum lower incisor upper lip lower lip nose bridge right and left ears upper incisor FIGURE 3: Sensor coil locations. The experimental procedure consisted of four phases: preparation, pre-test, training, and post-test. In the preparation phase, articulatory positions and speech data were collected for prediction of the /ae/ position when the subject produced the isolated Japanese vowels /aa/, /iy/, and /uw/. The palatal shape and occlusal plane were also recorded. In the pre-test and post-test Proceedings of Meetings on Acoustics, Vol., 663 (213) Page 3

phases, the subject was asked to produce a sustained vowel /ae/ and 3 CVC words ( back, sad, and had ) in response to the speech sound of an AE speaker of the same gender as the subject, which was selected from the XRMB corpus. Each stimulus was presented 5 times in random order. Only in post-test phase, we instructed the subjects to reproduce the articulatory movement that was learned in the training phase. During the training phase, the real-time articulatory visual feedback and articulatory target for /ae/ were presented on a display. The subject was first asked to try to fit his/her tongue contour, UL, LL, and LI to the articulatory target without speech production for about 5 minutes while watching the display screen as in Fig. 2. Then, the subject was asked to match the articulators with the target as closely as possible on the display and then to produce the vowel /ae/, following the speech sound of the same speaker as the test phase. This task was repeated 2 times. In order to assess the effect of articulatory training, the first (F1) and second (F2) formants of the vowel /ae/ were extracted from the acoustic measurements for pre-test and post-test phases using a 16th order LPC analysis over a ms window with a ms overlap, and averaged over 5 frames centered at the middle frame of a spectrally stable part of the /ae/ segment for each utterance. Taking the human hearing system and the anisotropy of the F1-F2 space into account, all formant values were converted to the Equivalent Rectangular Bandwidth (ERB) scale defined by E =.4log 1 (4.37f + 1), (1) where E represents the number of ERBs and f is frequency in khz [4]. RESULTS Figure 4 shows the distribution of the produced /ae/ sounds in the F1-F2 space for each subject before and after training, in which a purple ellipse represents the 95% confidence limits for the /ae/ distribution for the same gender AE speakers obtained from the XRMB and EMA corpus, and vertical and horizontal lines indicate the median of the F1 and F2 values of the native /ae/ distribution, respectively. From this figure, a tendency can be seen that the produced /ae/ sounds are distributed closer to the center of the native /ae/ distribution after articulatory training. To statistically evaluate differences between pre-test and post-test, the median of the native /ae/ distribution was defined as a reference sound and the Euclidean distances between the produced and reference sounds were calculated in the F1-F2, F1, and F2 spaces, respectively. Figure 5 compares the average distance of the pre-test and post-test for all subjects, where error bars represent standard error of the mean (n = 2). A one-way within-subjects repeated measures ANOVA revealed a significant difference between pre-test and post-test in the F1-F2 (F(1,4) = 7.964, p <.5) and F2 (F(1,4) =.65, p <.5) spaces. This result shows that articulatory training using our system facilitates improvements in speech production learning of a non-native vowel /ae/ for Japanese learners. The reason for no improvement in F1 may be because there is little acoustic difference between Japanese vowel /aa/ and non-native vowel /ae/ for this formant. CONCLUDING REMARKS We have developed a more general EMA-based articulatory visual feedback system and demonstrated that in speech production learning of a non-native vowel /ae/, our system can help Japanese learners improve their speech production. This result suggests that this system might be useful in facilitating L2 speech production learning in other language contexts if there are some overlapping vowels between the native language and L2. Moreover, this system might be Proceedings of Meetings on Acoustics, Vol., 663 (213) Page 4

(a) A(M) (b) B(M) (c) C(M) (d) D(F) (e) E(F) FIGURE 4: Scatterplots of utterances for all subjects in the F1-F2 space (ERB scale). 3 2.5 (a) F1 F2 3 2.5 (b) F1 3 2.5 (c) F2 Distance [ERB] 2 1.5 1 Distance [ERB] 2 1.5 1 Distance [ERB] 2 1.5 1.5.5.5 A(M) B(M) C(M) D(F) E(F) A(M) B(M) C(M) D(F) E(F) A(M) B(M) C(M) D(F) E(F) FIGURE 5: Distribution of the Euclidean distance between the produced and reference sounds at the pre-test and post-test phases in (a) F1-F2, (b) F1, and (c) F2 spaces. applicable to the rehabilitation of patients after oral surgery if speech therapists are able to provide support in terms of articulatory knowledge. To further validate the efficacy of this system, it is necessary to investigate relationships between the achieved articulatory positions and target positions estimated by the prediction model. From a communication point of view, perceptual evaluation by native AE speakers will also be explored. The application of the proposed approach to other vowels and consonants as well as other languages is left for future work. Proceedings of Meetings on Acoustics, Vol., 663 (213) Page 5

ACKNOWLEDGMENTS This work was partly supported by the Institutional Program for Young Researcher Overseas Visits from the Japan Society for the Promotion of Science (JSPS). REFERENCES [1] M. Eskenazi, An overview of spoken language technology for education, Speech Comm. 51, 832 844 (29). [2] J. Levitt and W. Katz, The effect of EMA-based augmented visual feedback on the English speakers acquisition of the Japanese flap: a perceptual study, in Proceedings of Interspeech21, 1862 1865 (International Speech Communication Association) (21). [3] J. Westbury, X-ray microbeam speech production database user s handbook (University of Wisconsin) (94). [4] B. Glasberg and B. Moore, Derivation of auditory filter shapes from notched-noise data, Hear. Res. 47, 13 138 (9). Proceedings of Meetings on Acoustics, Vol., 663 (213) Page 6