Corpus and Statistical Analysis of F0 Variation for Vietnamese Dialect Identification

Similar documents
Speech Recognition at ICSI: Broadcast News and beyond

Developing Autonomy in an East Asian Classroom: from Policy to Practice

Mandarin Lexical Tone Recognition: The Gating Paradigm

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Modeling function word errors in DNN-HMM based LVCSR systems

Speech Emotion Recognition Using Support Vector Machine

Modeling function word errors in DNN-HMM based LVCSR systems

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

WHEN THERE IS A mismatch between the acoustic

Learning Methods in Multilingual Speech Recognition

Teaching ideas. AS and A-level English Language Spark their imaginations this year

Journal of Phonetics

A study of speaker adaptation for DNN-based speech synthesis

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

The influence of metrical constraints on direct imitation across French varieties

SEGMENTAL FEATURES IN SPONTANEOUS AND READ-ALOUD FINNISH

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Rhythm-typology revisited.

Rule Learning With Negation: Issues Regarding Effectiveness

Listening and Speaking Skills of English Language of Adolescents of Government and Private Schools

SIE: Speech Enabled Interface for E-Learning

The Acquisition of English Intonation by Native Greek Speakers

Word Stress and Intonation: Introduction

Possessive have and (have) got in New Zealand English Heidi Quinn, University of Canterbury, New Zealand

Corpus Linguistics (L615)

Acoustic correlates of stress and their use in diagnosing syllable fusion in Tongan. James White & Marc Garellek UCLA

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

CEFR Overall Illustrative English Proficiency Scales

Double Master Degrees in International Economics and Development

Measurement. Time. Teaching for mastery in primary maths

Entrepreneurial Discovery and the Demmert/Klein Experiment: Additional Evidence from Germany

Progressive Aspect in Nigerian English

English Language and Applied Linguistics. Module Descriptions 2017/18

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Eye Level Education. Program Orientation

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

November 2012 MUET (800)

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Rule Learning with Negation: Issues Regarding Effectiveness

The Perception of Nasalized Vowels in American English: An Investigation of On-line Use of Vowel Nasalization in Lexical Access

An Acoustic Phonetic Account of the Production of Word-Final /z/s in Central Minnesota English

Information Session 13 & 19 August 2015

Lecture Notes in Artificial Intelligence 4343

Meta Comments for Summarizing Meeting Speech

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

The Effect of Extensive Reading on Developing the Grammatical. Accuracy of the EFL Freshmen at Al Al-Bayt University

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

PM tutor. Estimate Activity Durations Part 2. Presented by Dipo Tepede, PMP, SSBB, MBA. Empowering Excellence. Powered by POeT Solvers Limited

USE OF ONLINE PUBLIC ACCESS CATALOGUE IN GURU NANAK DEV UNIVERSITY LIBRARY, AMRITSAR: A STUDY

Language and Tourism in Sabah, Malaysia and Edinburgh, Scotland

Task-Based Language Teaching: An Insight into Teacher Practice

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Procedia - Social and Behavioral Sciences 141 ( 2014 ) WCLTA Using Corpus Linguistics in the Development of Writing

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

The Language of Football England vs. Germany (working title) by Elmar Thalhammer. Abstract

Linguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis

9.85 Cognition in Infancy and Early Childhood. Lecture 7: Number

Word Segmentation of Off-line Handwritten Documents

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Problems of the Arabic OCR: New Attitudes

Phonological Processing for Urdu Text to Speech System

Initial English Language Training for Controllers and Pilots. Mr. John Kennedy École Nationale de L Aviation Civile (ENAC) Toulouse, France.

Individual Component Checklist L I S T E N I N G. for use with ONE task ENGLISH VERSION

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

The taming of the data:

Arabic Orthography vs. Arabic OCR

Collecting dialect data and making use of them an interim report from Swedia 2000

THE PERCEPTION AND PRODUCTION OF STRESS AND INTONATION BY CHILDREN WITH COCHLEAR IMPLANTS

Text-to-Speech Application in Audio CASI

Investigation on Mandarin Broadcast News Speech Recognition

1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all

LISTENING STRATEGIES AWARENESS: A DIARY STUDY IN A LISTENING COMPREHENSION CLASSROOM

A Cross-language Corpus for Studying the Phonetics and Phonology of Prominence

Chapter 5: Language. Over 6,900 different languages worldwide

GOLD Objectives for Development & Learning: Birth Through Third Grade

A Case Study: News Classification Based on Term Frequency

DNN ACOUSTIC MODELING WITH MODULAR MULTI-LINGUAL FEATURE EXTRACTION NETWORKS

Numeracy Medium term plan: Summer Term Level 2C/2B Year 2 Level 2A/3C

USING VOKI TO ENHANCE SPEAKING SKILLS

Applying ADDIE Model for Research and Development: An Analysis Phase of Communicative Language of 9 Grad Students

OPAC and User Perception in Law University Libraries in the Karnataka: A Study

Revisiting the role of prosody in early language acquisition. Megha Sundara UCLA Phonetics Lab

Textbook Evalyation:

Rachel E. Baker, Ann R. Bradlow. Northwestern University, Evanston, IL, USA

Eyebrows in French talk-in-interaction

Review in ICAME Journal, Volume 38, 2014, DOI: /icame

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method

Higher Education Accreditation in Vietnam and the U.S.: In Pursuit of Quality

Body-Conducted Speech Recognition and its Application to Speech Support System

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Transcription:

, pp.205-210 http://dx.doi.org/10.14257/astl.2015.111.40 Corpus and Statistical Analysis of F0 Variation for Vietnamese Dialect Identification Pham Ngoc Hung 1, Trinh Van Loan 1,2, Nguyen Hong Quang 2 1 Faculty of Information Technology Hungyen University of Technology and Education Hungyen, Vietnam 2 School of Information and Communication Technology Hanoi University of Science and Technology Hanoi, Vietnam pnhung@utehy.edu.vn, {loantv, quangnh}@soict.hust.edu.vn Abstract. The performance of speech recognition systems will be improved if the corpus is organized in specialized domain and is applied in a consistent way for speech recognition in specific situations. Vietnamese dialects are various. Building of corpus for Vietnamese dialect is the first step to implement the system of dialect identification used for increasing the performance of Vietnamese recognition in general. This paper presents a method of building corpus for Vietnamese dialect identification. Vietnamese corpus VDSPEC is built with topic-based recording and tonal balance. The duration of corpus is 33.79 hours with 6 topics in total. The basic characteristics and preliminary evaluations of the corpus are also described. The statistical analysis of F0 variation showed that there are distinctions of pronunciation modality for Vietnamese tones toward Hue voice and Hanoi voice. These distinctions can be used as the important features for identifying these dialects Keywords: Vietnamese, corpus, Vietnamese dialect, statistical analysis, fundamental frequency, topic-based recording, tone balance 1 Introduction To be able to carry out research on speech recognition in general and in particular on dialect identification, we need a good quality corpus which meets research requirements. For Vietnamese, some corpora exist already such as VNSPEECHCORPUS [1], VOV (Voice of Vietnamese) Corpus [2] or VNBN (United Broadcast News corpus) [3]. The construction of corpus can be done in several different ways. For example, using the available audio sources from radio, television, and then classify, extract the appropriate audio signals matching requirements, browse and edit the text, respectively [2], [3]. The alternative is to perform recording environments and to select speakers based on recording scenario prepared in advance. In dialect recognition, especially for Vietnamese language, corpus should involve the characteristics of Vietnamese language. The mentioned available corpora do not ISSN: 2287-1233 ASTL Copyright 2015 SERSC

simultaneously satisfy these requirements. Therefore, building of Vietnamese corpus VDSPEC (Vietnamese Dialect Speech Corpus) was studied to meet the requirements for speech recognition and Vietnamese dialect recognition. It is known that dialect is a form of the language spoken in different regions of the country. These dialects may have distinctions of words, grammar and pronunciation modalities. For Vietnamese, researches on dialects are mainly concentrated on language approach [4]. In our research, we focus only on pronunciation modality for voices of Hanoi and Hue and the dialect identification is based on signal processing, hence the corpus does not reflect the difference of dialect words and grammar between these regions. Vietnamese is a tonal language. On the other hand, the tones of Vietnamese play a very important role in Vietnamese because they take part in the meaning of the word. The pronunciation modality of Vietnamese tones differs for different dialects. Therefore, the analysis of this pronunciation modality has an important implication in the identification and synthesis of Vietnamese dialects. Section 2 of this paper will present the methods for building Vietnamese corpus in which different topics are recorded to take account of tonal balance for some Vietnamese dialects. Section 3 describes in detail the corpus and the statistical analysis of F0 variation of dialects in this corpus. Finally, section 4 gives conclusions and development in future. 2 Method for building Vietnamese corpus There are already dialectal corpora for some languages such as English [5], Chinese [6], Arabic [9], Thai [11]... For English, FRED is really a big dialect corpus which cover 8 dialects with 2.45 million words of text and about 300 hours of speech. FRED contains data from 420 different speakers, the age of speakers included in FRED ranges from six years to 102 years. For material included in FRED, it was recorded over 30 years. The corpus permit the investigation of phenomena of non-standard morphosyntax beside analysis of phonetic or phonological details. For Chinese, there are eight major dialectal regions. The authors in [6] have built the corpus for Wu dialect belonging to eight major Chinese dialects and providing information at four levels: phonetic level, lexicon level, language level and acoustic decoder level. Our corpus is built mainly for the first step research on dialect identification of Vietnamese and the corpus s target is more modest and meets the basic criteria. The corpus is built to cover a relative large range of topics, text contents ensure tonal balance, gender equilibrium for speakers, speakers are selected so that they possess local accent and their voices are steady, low noise for recording environment. For a corpus, there are two ways for recording: spontaneous speech and read speech. To be more active, we have chosen read speech for recording. The building of Vietnamese corpus is done in two stages. Stage 1 includes compilation, collection and classification of documents by topic; performing adjustments to ensure tone balance in the prepared text. Next, in stage 2, recording is performed using specialized equipment with selected environment. The following is description in detail for these stages. 206 Copyright 2015 SERSC

The topics are selected from electronic documents. The words of these topics need to be counted to ensure tone balance. Tone balance means that the appearance probability of six tones is the same in quantity (about 717 words for each tone). This procedure is conducted automatically with the support of software or manually. The topics include life sciences, business, law, cars, motorcycles, texts are collected from electronic media VnExpress. sentences containing 4333 syllables have been collected, classified and selected. The selection of speakers have a significant impact on the quality of obtained voice. Speakers are chosen so that they speak with the local accent. The average age of speakers is 21 year old. At this age, voice quality is steady with full features for local voice. The recording is also held in different sections to cover the voice variability of human being. Audio is recorded as standard PCM, uncompressed, with sampling frequency of 16 KHz, 16 bits per sample with one channel (mono). 3 Results The corpus consists of 50 male voices and the same for female voices. There are two main dialects of Vietnamese for the corpus. The number of northern dialect speaker is 50 and the same speaker number for middle dialect. For each dialect, the number of male voices is equal to the number of female voices. In our case, northern dialect is Hanoi voice and middle dialect is Hue voice. For a topic, each speaker reads 25 sentences in total. The number of recorded sentences is 00 ( speakers and sentences for a speaker). The corpus capacity is 3.62GB and total duration is 33.79 hours. Fig. 1. Variation of 6 tones for female voices. (a) Hanoi, (b) Hue Fig. 2. Variation of 6 tones for male voices. (a) Hanoi, (b) Hue Praat [8] was used to estimate fundamental frequency variations for Vietnamese tones in VDSPEC and four representative voices including 2 males and 2 females with two dialects were selected. The durations of the actual tones are usually different. To make the difference more evident, these durations have been normalized by the same interval 0.5 seconds. The results are shown in figures 1 and 2. Copyright 2015 SERSC 207

For level tone, F0 variation is rather small at around the mid level for both dialects. For Hanoi voice, rising tone starts as mid and then rises but for Hue voice the difference between starting and ending values for F0 is smaller than Hanoi voice. For low-falling tone, F0 starts low-mid and falls monotonously. With heavy tone, F0 starts mid or low-mid and rapidly falls at the end for Hanoi voice. For asking tone (falling rising tone), F0 goes down and has a tendency to goes up at the end with Hanoi voice. With broken tone, F0 falls down, maybe is broken before going up for Hanoi voice. In general, F0 of tones for Hue voices has tendency to go down monotonously as low-falling or heavy tones for Hanoi voices. Asking tone 350 300 50 Broken tone Fig. 3. F0 variation for asking tone Fig. 4. F0 variation for broken tone Heavy tone 300 Level tone 50 Fig. 5. F0 variation for heavy tone Fig. 6. F0 variation for level tone 208 Copyright 2015 SERSC

Low-falling tone Rising tone 300 350 50 Fig. 7. F0 variation for low-falling tone Fig. 8. F0 variation for rising tone The variation of F0 values for speakers including 50 males and 50 females is also evaluated and is depicted by boxplots in Figures from 3 to 8. These figures show F0 variation for Hanoi male voices (Hn-M), Hanoi female voices (Hn-F), Hue male voices (Hue-M) and Hue female voices (Hue-F). For each dialect, the number of female voices equals 25 and the same for the number of male voices. From Figure 3, the range of F0 variation for asking tone of Hue voices is smaller than the case of Hanoi voices, nevertheless this range for level tone of Hue voices is larger than Hanoi voices (Figure 6). For broken and rising tones, F0 of Hue voices tends to go down lower in comparison with Hanoi voices as in Figures 4 and 8. In contrast, for heavy and low-falling tones, F0 of Hue voices tends to go up higher than Hanoi voices as we can see from Figures 5 and 7. Generally speaking, the direction and the range of F0 variation for Hue tones tends to be opposed to Hanoi tones. This conclusion is also consistent with the perception in reality of the difference between the pronunciation modality for the tones of Hue voice in comparison with Hanoi voice. To determine the signal-to-noise ratio of VDSPEC, the influence of background noise on speech signal is assumed to have properties of addition noise. This assumption is consistent with the actual condition in the recording studio. Therefore, the determination of signal-to-noise ratio is the following. During silence, which means no voice and there is only background noise, the noise power will be calculated according to the following formula: (1) where P N is short time power for the background noise, N is window length, b(n) is background noise. With the sampling frequency 16000 Hz, N is selected by 256. Being based on assumptions of addition noise, the spectrum subtraction method has been implemented and we get the clean speech signal. The power of clean speech signal is calculated as follows: (2) Where is short time power of clean speech signal x(n). Finally. the signal-tonoise ratio in db will be: (3) Copyright 2015 SERSC 209

According to the mentioned method, the signal-to-noise ratio of the corpus VDSPEC was determined and the average value of this ratio is approximately 35 db. This value is perfectly appropriate for dialect identification and speech recognition systems. 4 Conclusions and development This paper presents the methods and results of building a new corpus for Vietnamese taking account of tonal balance for speech recognition and Vietnamese dialect identification. The statistical analysis for the variation of fundamental frequency shows that there are distinctions in pronunciation modality of tones for Hue and Hanoi voices. These distinctions can be used as the important features in combination with other features for identifying the dialects. Our corpus will be served not only for research on dialect identification but also for Vietnamese synthesis. This corpus can be developed more completely by adding different voices and other Vietnamese dialects in the near future. References 1. V.B. Le, D.D. Tran, E. Castelli, L. Besacier, and J-F. Serignat: Spoken and Written Language Resources for Vietnamese. In LREC 4, Lisbon, Portugal, May 26-28, (4), vol. II, pp. 599 602 2. T.T. Vu, D.T. Nguyen, M.C. Luong, and J-P. Hosom: Vietnamese Large Vocabulary Continuous Speech Recognition. In INTERSPEECH (5), Lisbon, Portugal, September, 5. 3. Vu, Q., Demuynck, K., Compernolle, D.V: Vietnamese Automatic Speech Recognition: the FlaVoR Approach. ISCSLP 6, Kent Ridge, Singapore (6). 4. Hoàng Thị Châu: Phương ngữ học tiếng Việt. NXB Đại học Quốc gia Hà Nội (9). 5. Bernd Kortmann: A Comparative Grammar of British English Dialects. Walter de Gruyter (5) 6. Jing Li et al.: A Dialectal Chinese Speech Recognition Framework. Journal of Compute. Sci. & Technol., Vol. 21, No. 1, pp. 106-115, Jan (6) 7. Theatre Supplies and Services, http://adena.co.nz/theatre/products/sound/microphoneswired/shure/sm-series/shure-sm48.htm 8. www.praat.org 9. Fadi Biadsy, Julia Hirschberg: Using Prosody and Phonotactics in Arabic Dialect Identification. Interspeech, Vol. 1, pp 208-211 (9) 10. Jean-Luc Rouas: Automatic prosodic variations modelling for language and dialect discrimination. IEEE Transactions on Audio, Speech and Language Processing, V. 15, N. 6, p. 1904-1911 (7) 11. Sittichok Aunkaew, Montri Karnjanadecha, Chai Wutiwiwatchai: Development of a Corpus for Southern Thai Dialect Speech Recognition: Design and Text Preparation. The 10th International Symposium on Natural Language Processing, October 28-30, (2013), Phuket, Thailand 210 Copyright 2015 SERSC