Structural representation of pronunciation and its use in pronunciation training

Similar documents
The Journey to Vowelerria VOWEL ERRORS: THE LOST WORLD OF SPEECH INTERVENTION. Preparation: Education. Preparation: Education. Preparation: Education

Vowel mispronunciation detection using DNN acoustic models with cross-lingual training

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Fix Your Vowels: Computer-assisted training by Dutch learners of Spanish

Mandarin Lexical Tone Recognition: The Gating Paradigm

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Body-Conducted Speech Recognition and its Application to Speech Support System

Voice conversion through vector quantization

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Quarterly Progress and Status Report. Voiced-voiceless distinction in alaryngeal speech - acoustic and articula

SEGMENTAL FEATURES IN SPONTANEOUS AND READ-ALOUD FINNISH

Linguistics 220 Phonology: distributions and the concept of the phoneme. John Alderete, Simon Fraser University

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Classroom Connections Examining the Intersection of the Standards for Mathematical Content and the Standards for Mathematical Practice

Universal contrastive analysis as a learning principle in CAPT

A Cross-language Corpus for Studying the Phonetics and Phonology of Prominence

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

The Perception of Nasalized Vowels in American English: An Investigation of On-line Use of Vowel Nasalization in Lexical Access

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Measurement. When Smaller Is Better. Activity:

Rachel E. Baker, Ann R. Bradlow. Northwestern University, Evanston, IL, USA

Learning Methods in Multilingual Speech Recognition

Speech Emotion Recognition Using Support Vector Machine

Automatic Pronunciation Checker

Revisiting the role of prosody in early language acquisition. Megha Sundara UCLA Phonetics Lab

Proceedings of Meetings on Acoustics

The Up corpus: A corpus of speech samples across adulthood

Hardhatting in a Geo-World

Evaluation of a Simultaneous Interpretation System and Analysis of Speech Log for User Experience Assessment

Word Stress and Intonation: Introduction

1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all

Rhythm-typology revisited.

Statewide Framework Document for:

Individual Differences & Item Effects: How to test them, & how to test them well

Self-Supervised Acquisition of Vowels in American English

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

The pronunciation of /7i/ by male and female speakers of avant-garde Dutch

Letter-based speech synthesis

CLASSIFICATION OF PROGRAM Critical Elements Analysis 1. High Priority Items Phonemic Awareness Instruction

Modeling function word errors in DNN-HMM based LVCSR systems

Radius STEM Readiness TM

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

THE RECOGNITION OF SPEECH BY MACHINE

Arabic Orthography vs. Arabic OCR

Speaker recognition using universal background model on YOHO database

Section 3.4. Logframe Module. This module will help you understand and use the logical framework in project design and proposal writing.

Books Effective Literacy Y5-8 Learning Through Talk Y4-8 Switch onto Spelling Spelling Under Scrutiny

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Project Based Learning Debriefing Form Elementary School

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

The Acquisition of English Intonation by Native Greek Speakers

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Grade 6: Correlated to AGS Basic Math Skills

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Florida Reading Endorsement Alignment Matrix Competency 1

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

A study of speaker adaptation for DNN-based speech synthesis

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

This Performance Standards include four major components. They are

PHONETIC DISTANCE BASED ACCENT CLASSIFIER TO IDENTIFY PRONUNCIATION VARIANTS AND OOV WORDS

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Phonological Processing for Urdu Text to Speech System

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

Linguistics. Undergraduate. Departmental Honors. Graduate. Faculty. Linguistics 1

ANGLAIS LANGUE SECONDE

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading

Aviation English Training: How long Does it Take?

DIBELS Next BENCHMARK ASSESSMENTS

Investigation on Mandarin Broadcast News Speech Recognition

Florida Mathematics Standards for Geometry Honors (CPalms # )

Creating Travel Advice

The influence of metrical constraints on direct imitation across French varieties

Pobrane z czasopisma New Horizons in English Studies Data: 18/11/ :52:20. New Horizons in English Studies 1/2016

Functional Skills Mathematics Level 2 assessment

Mixed Accents: Scottish Children with English Parents

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

SARDNET: A Self-Organizing Feature Map for Sequences

Program in Linguistics. Academic Year Assessment Report

Different Task Type and the Perception of the English Interdental Fricatives

**Note: this is slightly different from the original (mainly in format). I would be happy to send you a hard copy.**

Perceptual scaling of voice identity: common dimensions for different vowels and speakers

Changes to GCSE and KS3 Grading Information Booklet for Parents

Segregation of Unvoiced Speech from Nonspeech Interference

Understanding the Relationship between Comprehension and Production

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Modeling function word errors in DNN-HMM based LVCSR systems

Introducing the New Iowa Assessments Mathematics Levels 12 14

Building Text Corpus for Unit Selection Synthesis

Phonological and Phonetic Representations: The Case of Neutralization

School Competition and Efficiency with Publicly Funded Catholic Schools David Card, Martin D. Dooley, and A. Abigail Payne

The Round Earth Project. Collaborative VR for Elementary School Kids

CHAPTER 4: REIMBURSEMENT STRATEGIES 24

2 nd grade Task 5 Half and Half

KEY 2: PRONOUNCE WORDS CLEARLY

Houghton Mifflin Reading Correlation to the Common Core Standards for English Language Arts (Grade1)

Transcription:

PTLC2005 Minematsu. Asakawa, Hirose, & Makino Structural representation of pronunciation:1 Structural representation of pronunciation and its use in pronunciation training N. Minematsu*, S. Asakawa*, K. Hirose*, and T. Makino** *The University of Tokyo, Japan **Chuo University, Japan 1 Introduction No two students are the same. Pronunciation teaching should be started after teachers know exactly how individual students are in their development. In one-tofifty situations such as classrooms, it is difficult for a teacher to know their pronunciations precisely. Computers have provided us with tools to visualize the pronunciations based on acoustics. In this case, however, good knowledge of acoustics is needed. Further, the acoustic representation inevitably shows many things irrelevant to the proficiency, such as speaker individuality, gender, age, microphone differences and so on. The first author already proposed a unique method of representing speech acoustics, where dimensions of the non-linguistic factors are well removed, Minematsu (2004a). This representation can be regarded as physical implementation of structural phonology, where only the interrelations among speech sounds are focused. In this paper, the non-native pronunciations are described based on the new method and the development of a student s pronunciation is traced. Further, automatic assessment of the pronunciation is investigated experimentally. 2 Structural representation of speech acoustics Spectrogram is noisy representation in that it inevitably shows things completely irrelevant to the pronunciation proficiency. The first author proposed a method to represent speech acoustics where the static nonlinguistic factors can hardly be seen, Minematsu (2004a). Since explanation of this method requires good knowledge of mathematics, only its short introduction is done here. This new method was inspired by Jakobson s phonological structure in Figure 1, where French vowels and semi-vowels are structurally represented and it was claimed that this structure is invariant with speakers. In acoustic phonetics, the vowel structure is often represented as F1-F2 formant chart. It is known that this representation clearly shows gender and age difference. With the proposed method, these non-linguistic factors can be effectively suppressed. What is geometrical definition of a structure? A triangle is determined by fixing length of all the three segments. An n-point structure, in turn, is determined by fixing length of all the segments including its diagonal lines. This means that an n-point structure is fully represented by its distance matrix among the n points. If cepstrum parameters are used to represent envelopes of the spectrogram and if a speech sound is represented as point in a cepstrum space, an acoustic structure of n speech sounds is an n-point structure in the space. The acoustic structure can become invariant if the non-linguistic factors cannot change distance between any two points. Mathematically speaking, however, the structure has to be variant due to the nonlinguistic factors. How to make inevitably variant structures invariant? The invariant structure can be obtained by applying theorem of the invariant structure, which was proposed by the first author Minematsu et al (2005). Here, every speech sound is represented as cepstrum distribution, not point. Distance between two distributions is calculated by distorting space so that the structure can be invariant. The space distortion can easily make variant structures invariant.

PTLC2005 Minematsu. Asakawa, Hirose, & Makino Structural representation of pronunciation:2 3. Structural representation of the non-native pronunciation Using the proposed method, individual students are described as distorted speech structures of the target language. Since the new method extracts an acoustic structure as distance matrix, Figure 1. Jakobson s phonological structure of French vowels and semi-vowels visualization of the structure using absolute properties, e.g. formants, is impossible. Here, tree diagram was adopted to visualize the matrix. Figure 2 shows two trees; one from utterances of an American speaker and the other from utterances of a Japanese student learning American English, both reading the same 60 sentences. The Japanese tree clearly shows the well-known Japanese habits of English pronunciation. Confusions of /r/ & /l/, /s/ & /θ/, /z/ & /ð/ /i/ & /ɪ/, /v/ & /b/, etc are observed. Mid and low vowels are located very close to each other. Schwa is found to be very close to the above vowels. Since the proposed method only extracts distance matrix of the speech sounds, nothing is known about physical properties of the individual sounds such as formant frequencies. This strategy of the analysis is contradictory to that of acoustic phonetics, where physical properties of the individual speech sounds are intensively measured. The authors consider that the conventional strategy only gives the noisy representation of speech, spectrogram, and that an alternative method, which is stable and reliable, has to be devised especially for educational use. A candidate answer this paper shows was provided by implementing structural phonology physically. Comparison between the new method and the conventional one was done in Minematsu (2004b) with respect to reliability. Figure 2. Tree diagrams of American English (left) and Japanese English (right) 3. Tracing the development of the pronunciation Different students show different distortions in their pronunciation structures. Within a student, the structure even changes easily through training. In this section, various Japanese English pronunciations are simulated and each of them is represented structurally. An adult male Japanese speaker, who had been an amateur actor on English stages, spoke /bvt/ words, where V is a

PTLC2005 Minematsu. Asakawa, Hirose, & Makino Structural representation of pronunciation:3 monophthong of American English (/ɪ i ɛ æ ʌ ɑ ʊ ɔ u ɚ ə/) or a Japanese vowel (/a ɪ u e o/). For American English, only one utterance was recorded per vowel and, for Japanese, five utterances were recorded per vowel. By mixing both the types of pronunciations, a variety of English pronunciations were simulated. Figure 3 shows vowel charts of American English and Japanese. Japanese learners often substitute Japanese vowels when they speak English and Table1 shows typical examples of the vowel substitution. Japanese /a/ has very strong power and is substituted for five vowels of American English. In the current analysis, pronunciation states were defined as the pronunciations with some vowel substitutions and the following states were considered. S1: All the American English vowels are replaced with Japanese vowels. S2: / æ ʌ ɑ ɚ ə/ are corrected. S3: /i ɪ/ are additionally corrected. S4: / ʊ u/ are additionally corrected. S5: /ɛ / is additionally corrected. S6: /ɔ/ is additionally corrected and all the vowels are pronounced correctly. When multiple English vowels, e.g. /bʌt/ and /bæt/, were replaced with a Japanese vowel, different utterances of the vowel, two utterances of /bat/ in this case, were used. S1 tree, the intentionally Japanized pronunciation, shows clear separation of the vowels into the 5 Japanese vowels. S6 tree, the good pronunciation used on English stages is accordant to American English phonetics. These two trees have very good correspondence to the two vowel charts. Gradual changes are found from S1 to S6. For example, correction of /æ ʌ ɑ ɚ ə / destroys the Japanese vowel system embedded in S1. Transition from S2 to S3 separates /i/ and /ɪ/. That from S3 to S4 enlarges the separation of /u/ and /ʊ/. In S5, /ɛ/ and /æ/ get closer. S5 and S6, however, show almost no difference. These results show that structuralization with a single example per vowel can describe the pronunciation effectively and efficiently and that it is possible to log a student s development with a small amount of utterances. Although this analysis was done using a single speaker, since Minematsu (2005) showed that the structuralization can delete dimensions of speaker differences effectively, this method can be used for other speakers as it is. It is interesting that structural acoustic models can recognize speech automatically with no direct use of acoustic substances of speech sounds, Murukami et al (2005). Figure 3. Vowel charts of American English and Japanese Table 1. Vowel substitution table

PTLC2005 Minematsu. Asakawa, Hirose, & Makino Structural representation of pronunciation:4 S1 S2 S3 S4 S5 S6 Figure 4. Visualization of various non-native pronunciations as tree diagrams 3. Assessment of the pronunciation based on size of the vowel structure The structural representation of the pronunciation can show not only its segmental aspect but also its prosodic aspect. Schwa is the most fundamental vowel in that it is located at the center of the vowel chart and that it is produced with the least articulatory effort. It is often said that unstressed vowels get closer to schwa sounds. These predict that the vowel structure gets larger if they are stressed and smaller if they are unstressed. Size of the structure may be interpreted as magnitude of the articulatory effort. Using 709 sentences read by a female American, the prediction was examined experimentally. Here, only /ɪ i ɛ æ ʌ ɑ ʊ u ɚ/ were focused because the other two vowels had a strong bias between occurrences as stressed and those as unstressed. Figure 5 shows a tree diagram of the stressed vowels and that of the unstressed vowels, where /æ1/ and /æ0/ mean stressed æ and unstressed æ. Height of the tree corresponds to radius of the structure. The stressed tree is 1.4 times higher than the unstressed tree and the above prediction was verified experimentally. Size of the vowel structure was used to assess the non-native pronunciations. 60 sentences were read by 19 Japanese students (10 males and 9 females) and two kinds of the vowel structures were extracted from each of them. The same sentences were read by 4 Americans and the vowel structures were extracted in the same way. Figure 6 shows the correlation between the ratios of the structure sizes (stressed to unstressed) and the proficiency scores rated by 5 American teachers. Very high correlation is found and this result shows high validity of using size of the vowel structure for automatic assessment of the pronunciation. The averaged ratio of the 4 Americans was 1.17.

PTLC2005 Minematsu. Asakawa, Hirose, & Makino Structural representation of pronunciation:5 Figure 6. Correlation between ratios of structure sizes and human scores Figure 5. Unstressed and stressed tree diagrams 4. Conclusions This paper effectively introduced the structural representation of speech sounds to the pronunciation training. Although some experimental results were shown, large portions of technical discussions were intentionally omitted. Interested readers in these issues should make contact to the first author (mine@gavo.t.u-tokyo.ac.jp). 5. References N. Minematsu, (2004a) Yet another acoustic representation of speech sounds, Proc. ICASSP, pp.585-588 N. Minematsu, (2004b) Pronunciation assessment based upon the phonological distortions observed in language learners utterances, Proc. ICSLP, pp.1669-1672 N. Minematsu, (2005) Mathematical evidence of the acoustic universal structure, Proc. ICASSP, pp.889-892 (2005) N. Minematsu, T. Nishimura, K. Nishinari, and K. Sakuraba, (2005) Theorem of the invariant structure and its derivation of speech Gestalt, English version is submitted to Interspeech 2005 and Japanese version is published as IEICE technical report, SP2005-12, pp.1-8 T. Murakami, K. Maruyama, N. Minematsu, and K. Hirose, (2005) Japanese vowel recognition based on structural representation of speech, submitted to Interspeech 2005