effects observed in consonant and v

Similar documents
Japanese Language Course 2017/18

What is the status of task repetition in English oral communication

Fluency is a largely ignored area of study in the years leading up to university entrance

The Interplay of Text Cohesion and L2 Reading Proficiency in Different Levels of Text Comprehension Among EFL Readers

Teaching intellectual property (IP) English creatively

JAPELAS: Supporting Japanese Polite Expressions Learning Using PDA(s) Towards Ubiquitous Learning

Challenging Assumptions

<September 2017 and April 2018 Admission>

Mandarin Lexical Tone Recognition: The Gating Paradigm

Emphasizing Informality: Usage of tte Form on Japanese Conversation Sentences

My Japanese Coach: Lesson I, Basic Words

THE PERCEPTIONS OF THE JAPANESE IMPERFECTIVE ASPECT MARKER TEIRU AMONG NATIVE SPEAKERS AND L2 LEARNERS OF JAPANESE

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

The Perception of Nasalized Vowels in American English: An Investigation of On-line Use of Vowel Nasalization in Lexical Access

Adding Japanese language synthesis support to the espeak system

1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all

Essential Cellular and Molecular Life Sciences Collection A

CJS was honored to have Izukura share his innovative techniques with the larger UHM community, where he showcased indoor and outdoor

Rhythm-typology revisited.

Frequencies of the Spatial Prepositions AT, ON and IN in Native and Non-native Corpora

Lecturing Module

Add -reru to the negative base, that is to the "-a" syllable of any Godan Verb. e.g. becomes becomes

Fountas-Pinnell Level P Informational Text

What Can Near Synonyms Tell Us? 1

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Visual processing speed: effects of auditory input on

3 Character-based KJ Translation

Quarterly Progress and Status Report. Voiced-voiceless distinction in alaryngeal speech - acoustic and articula

Phonological and Phonetic Representations: The Case of Neutralization

Taught Throughout the Year Foundational Skills Reading Writing Language RF.1.2 Demonstrate understanding of spoken words,

On the Formation of Phoneme Categories in DNN Acoustic Models

Infants learn phonotactic regularities from brief auditory experience

Author: Justyna Kowalczys Stowarzyszenie Angielski w Medycynie (PL) Feb 2015

**Note: this is slightly different from the original (mainly in format). I would be happy to send you a hard copy.**

Proceedings of Meetings on Acoustics

WiggleWorks Software Manual PDF0049 (PDF) Houghton Mifflin Harcourt Publishing Company

Linking object names and object categories: Words (but not tones) facilitate object categorization in 6- and 12-month-olds

University of Indonesia

REVIEW OF CONNECTED SPEECH

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Learners Use Word-Level Statistics in Phonetic Category Acquisition

Rachel E. Baker, Ann R. Bradlow. Northwestern University, Evanston, IL, USA

Dublin City Schools Broadcast Video I Graded Course of Study GRADES 9-12

An Evaluation of the Interactive-Activation Model Using Masked Partial-Word Priming. Jason R. Perry. University of Western Ontario. Stephen J.

Florida Reading Endorsement Alignment Matrix Competency 1

THE WEB 2.0 AS A PLATFORM FOR THE ACQUISITION OF SKILLS, IMPROVE ACADEMIC PERFORMANCE AND DESIGNER CAREER PROMOTION IN THE UNIVERSITY

Universal contrastive analysis as a learning principle in CAPT

Linguistics 220 Phonology: distributions and the concept of the phoneme. John Alderete, Simon Fraser University

Assessing speaking skills:. a workshop for teacher development. Ben Knight

SURVIVING ON MARS WITH GEOGEBRA

Summary / Response. Karl Smith, Accelerations Educational Software. Page 1 of 8

THE USE OF ENGLISH MOVIE IN TEACHING AUSTIN S ACT

Speech Recognition at ICSI: Broadcast News and beyond

English-Taught Courses at Wuhan University

How Does Physical Space Influence the Novices' and Experts' Algebraic Reasoning?

Understanding and Supporting Dyslexia Godstone Village School. January 2017

The Implementation of Interactive Multimedia Learning Materials in Teaching Listening Skills

Transfer of Training

Cambridgeshire Community Services NHS Trust: delivering excellence in children and young people s health services

Pobrane z czasopisma New Horizons in English Studies Data: 18/11/ :52:20. New Horizons in English Studies 1/2016

Different Task Type and the Perception of the English Interdental Fricatives

The Effect of Discourse Markers on the Speaking Production of EFL Students. Iman Moradimanesh

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Creating Travel Advice

Running head: DELAY AND PROSPECTIVE MEMORY 1

Consonants: articulation and transcription

Contrastiveness and diachronic variation in Chinese nasal codas. Tsz-Him Tsui The Ohio State University

Number of students enrolled in the program in Fall, 2011: 20. Faculty member completing template: Molly Dugan (Date: 1/26/2012)

Student Name: OSIS#: DOB: / / School: Grade:

Understanding the Relationship between Comprehension and Production

Language Arts: ( ) Instructional Syllabus. Teachers: T. Beard address

raıs Factors affecting word learning in adults: A comparison of L2 versus L1 acquisition /r/ /aı/ /s/ /r/ /aı/ /s/ = individual sound

Applying ADDIE Model for Research and Development: An Analysis Phase of Communicative Language of 9 Grad Students

Dyslexia/dyslexic, 3, 9, 24, 97, 187, 189, 206, 217, , , 367, , , 397,

Evidence-Centered Design: The TOEIC Speaking and Writing Tests

Why Pay Attention to Race?

Houghton Mifflin Reading Correlation to the Common Core Standards for English Language Arts (Grade1)

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

Prevalence of Oral Reading Problems in Thai Students with Cleft Palate, Grades 3-5

TOEIC Bridge Test Secure Program guidelines

Stages of Literacy Ros Lugg

Pearson Longman Keystone Book D 2013

A Cross-language Corpus for Studying the Phonetics and Phonology of Prominence

Evaluation of a Simultaneous Interpretation System and Analysis of Speech Log for User Experience Assessment

DIBELS Next BENCHMARK ASSESSMENTS

Part I. Figuring out how English works

TEKS Comments Louisiana GLE

MADERA SCIENCE FAIR 2013 Grades 4 th 6 th Project due date: Tuesday, April 9, 8:15 am Parent Night: Tuesday, April 16, 6:00 8:00 pm

Ecosystem: Description of the modules:

Revisiting the role of prosody in early language acquisition. Megha Sundara UCLA Phonetics Lab

Age Effects on Syntactic Control in. Second Language Learning

CEFR Overall Illustrative English Proficiency Scales

Perceptual scaling of voice identity: common dimensions for different vowels and speakers

A Correlation of. Grade 6, Arizona s College and Career Ready Standards English Language Arts and Literacy

Form, Meaning and Learners Dictionaries

Alberta Police Cognitive Ability Test (APCAT) General Information

MFL SPECIFICATION FOR JUNIOR CYCLE SHORT COURSE

L1 Influence on L2 Intonation in Russian Speakers of English

K 1 2 K 1 2. Iron Mountain Public Schools Standards (modified METS) Checklist by Grade Level Page 1 of 11

Transcription:

Title Author(s) Citation Cross cultural studies on audiovisu effects observed in consonant and v Rahmawati, Sabrina; Ohgishi, Michit Proceedings of 2011 6th Internation Systems, Services, and Applications Issue Date 2011 Type Journal Article Text version publisher URL http://hdl.handle.net/2297/30109 Right *KURA に登録されているコンテンツの著作権は, 執筆者, 出版社 ( 学協会 ) などが有します *KURA に登録されているコンテンツの利用については, 著作権法に規定されている私的使用や引用などの範囲内で行ってください * 著作権法に規定されている私的使用や引用などの範囲を超える利用を行う場合には, 著作権者の許諾を得てください ただし, 著作権者から著作権等管理事業者 ( 学術著作権協会, 日本著作出版権管理システムなど ) に権利委託されているコンテンツの利用手続については, 各著作権等管理事業者に確認してください http://dspace.lib.kanazawa-u.ac.jp/dspac

Cross Cultural Studies on Audiovisual Speech Processing: The Mcgurk Effects Observed in Consonant and Vowel Perception Sabrina Rahmawati School of Electrical Engineering and Informatics Bandung Institute of Technology Bandung, Indonesia sabrina_rahmawati@yahoo.com Michitaka Ohgishi College of Science and Engineering Kanazawa University Kanazawa, Japan ohgishi@kenroku.kanazawa-u.ac.jp Abstract Twenty-eight University students participated in the experiment on speech perception. The typical McGurk Effects were observed in consonant, vowel and word perception. In addition to consonant condition which had been observed in previous studies, cross-cultural differences can also be found in vowel perception. If the temporal discrepancies between auditory and visual information are large, the McGurk effects decreased. Keywords speech perception; the McGurk effect; telecommunication I. INTRODUCTION This study is different from the previous ones because in this study we have two speakers from two different country and three nationalities participants. A. Visual Communication As the information technology has developed, we can easily communicate with each other through the internet and cellular phones. Recently it has become possible to talk watching other persons face when we use internet communication media. Human faces convey not only information about the identification and the facial expression of a person but also information about speech. Before the media of the telecommunication were developed, we could not use visual information such as facial expression and movement of mouths when we talked with other people in other places. But as the media developed, the importance of visual information is increasing. B. The McGurk Effect The McGurk effect [1] is a perceptual illusion which demonstrates an interaction between audio information and visual information in speech perception. Because of this illusion, listening to a speech sound while watching lip movements that is corresponding to a different utterance leads to a dramatic alteration of what is actually heard. The typical McGurk Effect corresponds to consonant condition. In this condition, when an audio information /ba/ is presented while visual information /ga/ is shown, the subjects are likely to think that they heard /da/ sound. In cross cultural studies on the McGurk Effect [2, 3], it is found out that speech perception involves the integration of auditory and visual information. The McGurk Effect proved it by showing that mouth movement had some influence on the detection of acoustic speech signal. In face-to-face conversation, speech perception is affected by various bimodal integration effects. This integration of auditory and visual information affects phonetic identification. C. Temporal Order (SOA) When we communicate through the internet, time-lag discrepancies between visual and auditory information sometimes occur. Sometimes the discrepancy causes the information to be not fully delivered so there will be some loss in receiving of essential information. It had become the reason why in the present study the SOA paradigm is employed. Temporal order judgement (TOJ) task is one of the classic paradigms used by researchers to investigate temporal perception [4]. In a typical TOJ experiment, a pair of stimuli at varying stimuli onset asynchronies (SOAs) is presented, and participants are required to judge which stimulus was presented first (or second). Multisensory TOJ was employed in this study because there are little number of studies on temporal ordering in speech perception. It is important to get derivation of robust and accurate estimation of people s sensitivity to audiovisual synchrony for some real-world applications such as designing of hearing aids and the derivation of guidelines for satellite telecommunications broadcasting, as well as for the development of new virtual-conferencing technologies. Another reason why we employed TOJ paradigm is to make participants concentrate on the speaker s mouth movements. 978-1-4577-1442-9/11/$26.00 2011 IEEE 59

The 6th International Conference on Telecommunication Systems, Services, and Applications 2011 D. Purpose The main purpose of the current study is to clarify the nature of telecommunication from the viewpoint of the nature of speakers information. To study the nature of speakers information, cognitive psychological experiments on the McGurk effects were conducted. Another purpose of these experiments are to see if the McGurk Effect is also occurs under Vowel and Word condition in addition to Consonant condition as occurred in the typical McGurk effect and to see whether the cross cultural differences occurs in the McGurk effect. II. METHOD The experiments were conducted individually. A. Speakers The speakers were two male students, a native Japanese master student, 25 years (index ) and a native American undergraduate student, 20 years (index ). Picture 1 American Speaker Picture 2 Japanese Speaker B. Participants Total of 28 participants from Kanazawa University community, 6 of American nationality, 11 of Japanese nationality and 11 of Indonesian nationality were recruited. Most of Indonesian participants have TOEFL score (paper based) around 550 and have been studying Japanese for 6 months. Most of American participants have been studying Japanese for 2 years or more, in their home country and in Japan. Most of Japanese students had never taken such English proficiency test and some who already took the test have TOEIC score around 475. The Indonesian students are master course students, the American students are undergraduate students and the Japanese students are either master or undergraduate students. These subjects did this experiment voluntarily, without being paid. The participants age varies between 18 to 35 years old, with average of 23 years old. C. Material There are three kinds of condition used in this experiment: Vowel, Consonant and Word condition. On each condition three stimuli were used: /ba/, /bo/ and /bu/ for Vowel condition, /ba/, /da/ and /ga/ for Consonant condition, and /but/, /duck/ and /gut/ for Word condition. Each of the stimuli was pronounced by both American and Japanese speaker. Each speaker s face and voice were digitally recorded while pronouncing the syllables. The three stimuli in each speaker and condition was dubbed with each other, and combined with SOA variation 600 ms, -300 ms, +000 ms, +300 ms and +600 ms, which means in total there are 270 video combinations (2 speakers x 3 conditions x 3 visual stimuli for each condition x 3 audio stimuli x 5 SOA). The minus SOA means that the auditory information precedes the visual information while the plus SOA means vice versa. For example, +600 ms SOA means the visual information precedes the auditory information for 600 ms. On the final video clips, each stimulus occurred in a 3-sec trial in which the video channel included 1 sec of black frames presented before the tone burst, and 60 frames of the speaker s face. The stimuli itself contains still image and the video clip of the speaker s pronouncing a syllable. For stimuli with +600 SOA, there were 12 frames of still image, a video clip of the speakers pronouncing syllable, followed with another still image until total stimuli of 60 frames. For stimuli with +300 SOA, there were 21 frames of still image, video of speakers pronouncing syllable, and another still image. For stimuli with +000 SOA, the still image was presented for 30 frames, followed by the syllable, and another still image. In -300 ms and -600 ms SOA, there were different numbers of frames presented before the video clip. D. Apparatus A digital video camera (Sony TRV70K) was used to record the speakers face, and a desktop PC (Dell Dimension 9100) 978-1-4577-1442-9/11/$26.00 2011 IEEE 60

and a 17 inch CRT monitor (Sony Trinitron) were used to present stimuli. E. Procedure There were 270 trials in total, which were divided into three sections of condition. At first they were provided with instruction sheet to read. Before the trial, 10 practice trials of vowel stimuli were presented. While doing the first trial on practice trial, oral instruction was given. After answering the practice trial, the participant was asked whether they had gotten accustomed to the trials or not. If they had not been accustomed yet, the practice trials were supposed to be repeated. After the participant confirmed that they had gotten accustomed to answering the practice trial, the first condition was presented. The order of the conditions presentation was 90 trials of Vowel condition, followed by 90 trials of Consonant condition and 90 trials of Word condition. In each section, a break was inserted every 20 trials. If the participant wanted to continue answering the trials when they already had enough break, they had to press the space key. The participant was seated in front of the monitor. After a stimulus had been presented in the monitor, the participant was supposed to decide which of the audio information or visual information appeared first, before deciding what sound was heard. Questions and options were appeared on the monitor. The questions for Japanese subjects were presented in Japanese, and for American and Indonesian subject in English. To choose if the auditory information appears first, the key 1 on the right side of the keyboard should be pressed. To choose if the visual information appears first, the key 2 should be pressed. After either of the buttons was pressed, there appeared a question of which sound was heard, along with the options of the answer. For Vowel condition, the answer options were /ba/, /bo/ and /bu/, which means that key 1 was needed to be pressed if the participant think /ba/ syllable had been heard, key 2 for /bo/ syllable and key 3 for /bu/. For Consonant condition, key 1 would be for /ba/, key 2 for /da/ and key 3 for /ga/. For word condition, key 1 for /but/, key 2 for /duck/ and key 3 for /gut/. Vowel Condition III. RESULT AND DISCUSSION ANOVA for Vowel condition First, a 4-factor analysis of variance, speaker x visual information x auditory information x SOA, was conducted. There was a mistake occurred in this experiment that a wrong video clip was presented three times. A stimulus of audio information /ba/, visual information /ba/ and SOA +000 was presented when the stimulus with audio information /da/, Visual information /ba/ and SOA -600 was supposed to be presented. This fault had caused three participant picked answers of /ba/ instead of /da/. These mistakes were ignored because it was not significant (three mistakes out of 28 trials in total). The main effects of speaker, visual information and auditory information are significant (F[1, 25] = 25.175, p.<.001; F[2, 50] = 3.479, p.<.05; F[2, 50] = 31.980, p.<.001) while the main effect of SOA is not significant. Visual information x auditory information is significant (F[4, 100] = 5.236, p.<.001). Further analysis showed that there was a significant difference between three visual stimuli under the presentation of auditory information /bo/ as shown in Figure 1. Figure 1 The mean number of accurate responses (Maximum is 1) in the vowel condition for visual information x auditory information This result suggests that the McGurk effects occurred under the vowel presentation as expected. Japanese pronunciation of /bo/ is in intermediate between their pronunciation of /ba/ and /bu/, so we can predict that the McGurk Effect might appear while audio information /ba/ is combined with visual information /bu/. With this combination, the subject might guess that they heard /bo/ sound. But it turned out that the experiment result for this condition was not significant. 978-1-4577-1442-9/11/$26.00 2011 IEEE 61

Figure 2 The mean number of accurate responses (Maximum is 5) in the Vowel condition for speaker s nationality x participant s nationality Since the main effect of SOA is not significant, a four factor ANOVA, participants nationality x speakers nationality x visual information x auditory information was conducted by eliminating the factor of SOA. An interaction of participants nationality x speakers nationality was significant (F[2, 23] = 6.294, p.<.01), showing that Japanese participants are better at discriminating vowels than the other two groups of the participants. This interaction can be seen in Figure 2. An interaction of participants nationality x auditory information was significant (F[4, 46] = 3.526, p.<.05), showing that there is a significant difference between three group of participants under the presentation of the auditory stimuli /bo/. shows the auditory information, while the color differences distinguish the participants nationalities. From Figure 3, we can see that Indonesian participants made more mistakes in deciding which sound they had heard than Japanese and American participants. The reason why such thing occurred was probably because most of Indonesian participants had just been learning Japanese for 3-4 months, so they are not used to Japanese pronunciation yet, while Japanese speakers are certainly used to their native pronunciation and most of the American participants had been learning Japanese for 2 years or more. The other thing that can be concluded from Figure 3 is that Indonesian participants made more mistakes in deciding which sound they had heard while listening to auditory information /bo/ than while listening to /bu/ or /ba/. The reason was probably because most of Indonesian participants are not used to /bo/ pronunciation that was pronounced by American and Japanese speaker, as there is only one way to pronounce /bo/ in Indonesian language which are different from Japanese pronunciation of /bo/. For Indonesian participants, the Japanese speaker s pronunciation of /bo/ is somehow similar to /bu/ pronunciation in Indonesian language. In addition, as explained before, Japanese pronunciation of /bo/ is in intermediate between their pronunciation of /ba/ and /bu/. The American participants are better in judging these syllables because there are several kinds of /bo/ and /bu/ pronunciation in English, so that American subjects are used to the pronunciation used in this experiment. Consonant Condition Figure 3 The mean number of accurate responses (Maximum is 5) in the vowel condition for Participant s nationality x Auditory information Figure 3 shows the relationship between participant s nationality and auditory information. The x-axis of the figure Figure 4 The mean number of correct responses (Maximum is 1) in Consonant condition for Visual information x Auditory information x SOA A 4-factor analysis of variance, speaker s nationality x visual information x auditory information x SOA, was conducted for Consonant condition. The main effect was only in the auditory factor (F[2, 54] = 4.500, p.<.05). An 978-1-4577-1442-9/11/$26.00 2011 IEEE 62

interaction was significant (F[16, 432] = 1.696, p.<.05).this interaction shows that the typical McGurk Effect occurred in this condition (visual information /ga/, audio information /ba/, SOA +000) (see Figure 4). The results of Consonant condition differ from Vowel condition, that is, the factor of the participants nationality did not affect the results of Consonant condition very much while it did have significant effect in Vowel condition. It is because vowel pronunciation is different in every language but consonant pronunciation does not really much different with each other. Word Condition The ANOVA shows that there is a significant interaction between visual and auditory information (F[4, 108] = 3.293, p.<.05). This results indicate that there is a tendency that the word but was heard as duck when visual information was gut, suggesting that we can observe the McGurk effect in the word condition. IV. CONCLUSION The present study confirmed that the McGurk effects were observed in the perception of vowels and words as well as consonants. Cross-cultural differences were also observed only in vowel perception. If the temporal discrepancies between auditory and visual information are large, the McGurk effects decreased. These findings contribute to designing telecommunication systems. V. REFERENCES [1] H. McGurk., and J. MacDonald, Hearing lips and seeing voices, Nature, 1976, pp. 264, 746 748. [2] K. Sekiyama, Cultural and linguistic factors in audiovisual speech processing: The McGurk effect in Chinese subject, Perception & Psychophysics, 1997, pp. 59, 73-80. [3] H. Traunmüller, Factors affecting visual influence on heard vowel roundedness: Web experiments with Sweds and Turks, Proceedings, FONETIC 2009, 2009. [4] M. Zmpini, D. I. Shore, and C. Spence, Audiovisual temporal order judgments, Experimental Brain Research, 2003, pp. 152, 198-210. Figure 5 The mean number of accurate responses (Maximum is 1) in the Word condition for visual information x auditory information 978-1-4577-1442-9/11/$26.00 2011 IEEE 63