MICRO-PROSODIC CONTROL IN CANTONESE TEXT-TO-SPEECH SYNTHESIS

Similar documents
Mandarin Lexical Tone Recognition: The Gating Paradigm

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Learning Methods in Multilingual Speech Recognition

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Rhythm-typology revisited.

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Phonological Processing for Urdu Text to Speech System

SEGMENTAL FEATURES IN SPONTANEOUS AND READ-ALOUD FINNISH

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all

Word Stress and Intonation: Introduction

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

Revisiting the role of prosody in early language acquisition. Megha Sundara UCLA Phonetics Lab

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Universal contrastive analysis as a learning principle in CAPT

Quarterly Progress and Status Report. Voiced-voiceless distinction in alaryngeal speech - acoustic and articula

REVIEW OF CONNECTED SPEECH

Speech Emotion Recognition Using Support Vector Machine

ADDIS ABABA UNIVERSITY SCHOOL OF GRADUATE STUDIES MODELING IMPROVED AMHARIC SYLLBIFICATION ALGORITHM

A comparison of spectral smoothing methods for segment concatenation based speech synthesis

A Cross-language Corpus for Studying the Phonetics and Phonology of Prominence

Demonstration of problems of lexical stress on the pronunciation Turkish English teachers and teacher trainees by computer

Journal of Phonetics

L1 Influence on L2 Intonation in Russian Speakers of English

Pobrane z czasopisma New Horizons in English Studies Data: 18/11/ :52:20. New Horizons in English Studies 1/2016

Florida Reading Endorsement Alignment Matrix Competency 1

The IRISA Text-To-Speech System for the Blizzard Challenge 2017

The Acquisition of English Intonation by Native Greek Speakers

Speech Recognition at ICSI: Broadcast News and beyond

Contrastiveness and diachronic variation in Chinese nasal codas. Tsz-Him Tsui The Ohio State University

Automatic intonation assessment for computer aided language learning

The Effect of Discourse Markers on the Speaking Production of EFL Students. Iman Moradimanesh

Acoustic correlates of stress and their use in diagnosing syllable fusion in Tongan. James White & Marc Garellek UCLA

/$ IEEE

Rachel E. Baker, Ann R. Bradlow. Northwestern University, Evanston, IL, USA

THE MULTIVOC TEXT-TO-SPEECH SYSTEM

A study of speaker adaptation for DNN-based speech synthesis

THE PERCEPTION AND PRODUCTION OF STRESS AND INTONATION BY CHILDREN WITH COCHLEAR IMPLANTS

Modern TTS systems. CS 294-5: Statistical Natural Language Processing. Types of Modern Synthesis. TTS Architecture. Text Normalization

WHEN THERE IS A mismatch between the acoustic

**Note: this is slightly different from the original (mainly in format). I would be happy to send you a hard copy.**

CEFR Overall Illustrative English Proficiency Scales

Eyebrows in French talk-in-interaction

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

Phonological and Phonetic Representations: The Case of Neutralization

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Perceived speech rate: the effects of. articulation rate and speaking style in spontaneous speech. Jacques Koreman. Saarland University

A Neural Network GUI Tested on Text-To-Phoneme Mapping

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Building Text Corpus for Unit Selection Synthesis

The Perception of Nasalized Vowels in American English: An Investigation of On-line Use of Vowel Nasalization in Lexical Access

Improved Effects of Word-Retrieval Treatments Subsequent to Addition of the Orthographic Form

The influence of metrical constraints on direct imitation across French varieties

The analysis starts with the phonetic vowel and consonant charts based on the dataset:

Voice conversion through vector quantization

Dyslexia/dyslexic, 3, 9, 24, 97, 187, 189, 206, 217, , , 367, , , 397,

Segregation of Unvoiced Speech from Nonspeech Interference

Program Matrix - Reading English 6-12 (DOE Code 398) University of Florida. Reading

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Copyright by Niamh Eileen Kelly 2015

Linking object names and object categories: Words (but not tones) facilitate object categorization in 6- and 12-month-olds

Unit Selection Synthesis Using Long Non-Uniform Units and Phonemic Identity Matching

On the Formation of Phoneme Categories in DNN Acoustic Models

Language Acquisition by Identical vs. Fraternal SLI Twins * Karin Stromswold & Jay I. Rifkin

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

Linking Task: Identifying authors and book titles in verbose queries

Disambiguation of Thai Personal Name from Online News Articles

Proceedings of Meetings on Acoustics

Investigation on Mandarin Broadcast News Speech Recognition

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Infants learn phonotactic regularities from brief auditory experience

Assessing speaking skills:. a workshop for teacher development. Ben Knight

Parallel Evaluation in Stratal OT * Adam Baker University of Arizona

Cambridgeshire Community Services NHS Trust: delivering excellence in children and young people s health services

Highlighting and Annotation Tips Foundation Lesson

English Language and Applied Linguistics. Module Descriptions 2017/18

THE RECOGNITION OF SPEECH BY MACHINE

A survey of intonation systems

Edinburgh Research Explorer

Consonants: articulation and transcription

Language Acquisition Chart

To appear in the Proceedings of the 35th Meetings of the Chicago Linguistics Society. Post-vocalic spirantization: Typology and phonetic motivations

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

A Case Study: News Classification Based on Term Frequency

Body-Conducted Speech Recognition and its Application to Speech Support System

Word Segmentation of Off-line Handwritten Documents

First Grade Curriculum Highlights: In alignment with the Common Core Standards

Discourse Structure in Spoken Language: Studies on Speech Corpora

Designing a Speech Corpus for Instance-based Spoken Language Generation

Phonological encoding in speech production

SOUND STRUCTURE REPRESENTATION, REPAIR AND WELL-FORMEDNESS: GRAMMAR IN SPOKEN LANGUAGE PRODUCTION. Adam B. Buchwald

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

A Hybrid Text-To-Speech system for Afrikaans

Textbook Evalyation:

Think A F R I C A when assessing speaking. C.E.F.R. Oral Assessment Criteria. Think A F R I C A - 1 -

age, Speech and Hearii

University of Waterloo School of Accountancy. AFM 102: Introductory Management Accounting. Fall Term 2004: Section 4

EXECUTIVE SUMMARY. TIMSS 1999 International Mathematics Report

Transcription:

MICRO-PROSODIC CONTROL IN CANTONESE TEXT-TO-SPEECH SYNTHESIS Tan Lee 1, Helen M. Meng 2,W.Lau 1, W.K. Lo 1 and P.C. Ching 1 1 Department of Electronic Engineering 2 Department of Systems Engineering & Engineering Management The Chinese University of Hong Kong, Shatin, Hong Kong tanlee@ee.cuhk.edu.hk http://dsp.ee.cuhk.edu.hk/speech ABSTRACT This paper describes a pioneer study on prosodic control for Cantonese text-to-speech synthesis. We attempt to establish a set of segment-level duration rules and contextdependent F profiles and apply them to a syllable-based concatenative speech synthesizer which uses TD-PSOLA as prosodic modification technique. The prosodic features are extracted by statistical characterization of a large amount of speech data. Subjective listening test shows that the micro-prosodic control results in a marginal but consistent improvement in perceptual naturalness. Keywords: TTS, Cantonese, micro-prosody 1 INTRODUCTION Cantonese is a major Chinese dialect spoken by over 6 million people in Southern China and Hong Kong. As the demand for human-computer speech interfaces rises within Chinese-speaking communities, Cantonese spoken language technologies have attracted increasing attention in recent years. We have developed one of the few existing Cantonese text-to-speech systems, as previously reported in [1]. This system adopted the syllabe-based concatenative synthesis approach using TD-PSOLA technique. As having been shown in many other studies, the TD-PSOLA method can produce acoustic signal with fairly high voice quality [2],[3]. It is extremely suitable for monosyllabic and tonal language like Mandarin and Cantonese because of its great flexibility in F and time-scale modification [4]. Prosodic control is of critical importance for attaining high naturalness of synthetic speech. In this paper, the problem of controlling micro-prosodic parameters for Cantonese TTS is being addressed. By micro-prosody, we refer mainly to the segment-level temporal structure and F variation. The temporal structure includes the duration of sub-syllable segments as well as pause length between adjacent syllables. The syllable-wide F profile is seen as the primary control of lexical tone. Based on statistical derivation from a large speech database, a set of prosodic rules is established to improve the perceived naturalness of synthetic speech. 2 PROSODIC STRUCTURES OF CANTONESE A spoken Cantonese sentence is a sequence of syllables. Each syllable essentially corresponds to a Chinese character which may have lexical or grammatical function. Syllable is also considered the fundamental pronunciation unit of Cantonese. Traditionally, a Cantonese syllable can be divided into an INITIAL (I) and a FINAL (F). The INITIAL is basically a consonant onset and the FINAL is typically a vowel nucleus followed by an optional consonant coda. Table 1 gives the list of INITIALs and FINALs, while Table 2 lists all phonologically valid syllable structures in Cantonese. 22 INITIALs Unaspirated plosives (UP) u Aspirated plosives (AP) u Approximants (G) Nasals (N) Fricatives (F) Affricates (AF) 53 FINALs Nasal (N) long vowel (LV) Diphthong (D) long vowel + stop (LV-S) Short vowel + stop (SV-S) Long vowel + nasal (LV-N) Short vowel + nasal (SV-N) Table 1 : Cantonese INITIALs and FINALs Syllable Structure #ofexisting syllables Examples D 6 LV 3 LV-S 4 LV-N 5 SV-S 4 SV-N 4 N 2 C-D 134 C-LV 82 C-LV-S 117 u C-LV-N 133 u C-SV-S 79 C-SV-N 91 Table 2 : Different syllable structures in Cantonese Cantonese is well known of having nine tones as depicted in Figure 1. They are numbered from 1 to 9 respectively. Tone 1 6 are referred as non- tones and tone 7 9 are referred as tones. The primary acoustic feature for Cantonese lexical tones is the syllable-wide F profile. Also, tones, which are associated exclusively with syllables with stop coda (i.e. /p/, /t/ or /k/), are much shorter than the non- tones. Figure 2 shows the acoustic waveform of a Cantonese utterance, aligned with the time-varying F and short-time energy (RMS). The utterance consists of two digit strings separated by a major break in the middle. It is observed that the syllable nucleus (vowel) can be roughly estimated

from the peaks in the energy plot. Also, each syllable is made up of an optional unvoiced segment and a voiced segment. If the coda is a stop, syllable duration tends to be short and a closure will follow. Just like in English, sentence-final lengthening is noticeable in Cantonese. Examples: level series rising Non- tones going level series rising Figure 1. The nine Cantonese lexical tones It is also obvious that the F profile is heavily affected by tonal context. For example, digit 2 (tone 6) occurs four times (labeled as case A-D) in the utterances, and the observed F patterns differ greatly among the cases. In case A, F keeps rising from a low level. This is because its left context is the lower level tone. In case B and D, where the left context is the upper rising tone, a declining F pattern can be observed. Lastly in case C, the slight declination of F is caused by its right context which is the lower rising tone. In addition, there exists a long-term and slow declination of F across the whole utterance. 3 THE BASELINE TTS SYSTEM 3.1 The Use of TD-PSOLA As described in [1], the baseline system produces synthetic speech by concatenating pre-recorded syllables which have been modified using TD-PSOLA technique to match the prescribed duration or F targets. Only the voiced segment of the syllable is subject to PSOLA modification while the unvoiced segment is concatenated as it is. 3.2 Syllable Inventory Undoubtedly prosodic modification by TD-PSOLA would distort the original signal. For the audible distortion to be kept at a low level, the degree of modification should be as small as possible. Therefore tonal syllables have been chosen as the basic templates for synthesis. We are using the CUSYL database which is designed specially for syllable-based synthesis [5]. It has a large coverage of about 1,8 Cantonese tonal syllables, which include many colloquials and alternative pronunciations. All syllables were recorded from a female native Cantonese speaker. 3.3 Prosodic Control Fixed syllable duration was assumed in the baseline system. The voiced segment of all syllables with non tones were assumed to be 18 msec in length, regardless of their difference in syllabic structure. For all syllables with tones (i.e. with coda /p/, /t/ or /k/), a duration of 9 msec was assigned. For each of the nine lexical tones, a fixed F profile was used regardless of any contextual effect. The baseline system allowed adjustment of duration and F at utterance level. That is, speaking rate and F going Entering tones Middle 1 2 3 4 5 6 7 8 9 Tone number being used in this work dynamic range can be varied by linearly and uniformly scaling the nominal syllable duration and F profile. 4 DURATION AND PAUSE CONTROL Obviously the duration of a Cantonese syllable depends very much on its phonetic content. For example, the voiced segment of a C-LV-N syllable (e.g. / /) is longer than that of a C-LV or C-D syllable (e.g. / /, / /). In this work, we try to obtain: 1 nominal duration of the voiced and unvoiced segments in each Cantonese base syllable; 1 nominal length of inter-syllable pause between each pair of syllable coda and onset. 4.1 Speech Database We use part of CUSENT, a newly developed Cantonese speech database, for duration measurement. The speech data includes a total of 13,8 continuous sentences from 46 different speakers. The sentence length ranges from 4 3 syllables and the average is 1 syllables. 4.2 Segmental Duration Syllable-level time alignment is carried out using HMM forced alignment method. The length of inter-syllable pause is also available from this time alignment. Afterwards voiced/unvoiced detection is performed using the get_f program in the ESPS waves+ software package [6]. The get_f program essentially implements a robust algorithm for pitch tracking (RAPT) base on normalized cross-correlation function [7]. In this way, duration the of voiced and unvoiced segments are derived. 4.3 Speaking Rate Normalization Speaking rate normalization is performed to reduce undesirable variation of segmental duration from utterance to utterance. For each syllable S in an utterance, its local rate of speaking is evaluated as [8], SROS where DUR S and DUR = DUR µ DUR µ is the mean duration for all occurences of denotes the duration in this particular utterance. Then the utterance-level rate of speaking is estimated as the average over all syllabes, i.e. UROS = average S [ SROS ] Both the absolute segmental duration and inter-syllable pause legnth are normalized using the UROS. 4.4 Nominal Duration and Pause Length For each of thr 664 base syllables in CUSYL, the nominal duration of its voiced and unvoiced segments are estimated as described above. The results are shown as in Figure 3 and 4. For easy visualization, syllables which similar phonetic structure are grouped together. Indeed, segmental duration varies greatly from on syllable to another. As shown in Figure 3, the duration of voiced segment in (C)-LV-S syllable is much shorted than those in a (C)-LVor (C)-D syllable. The duration difference

between syllables with long vowel and short vowel as nuclei is also quite noticeable. Figure 5 shows the nominal pause length for different coda-onset combinations. As expected, a short pause needs to be inserted whenever there is a closure between the syllables.thispausemaybeupto9msecifthecodaisa stop and the following onset is an unaspirated plosives. 5 CONTEXT-DEPENDENT F PROFILE In this work, we focus on how the F profile of a Cantonese syllable may be affected by its left tonal context. Speech materials used for analysis are obtained from a female native Cantonese speaker and make up a total of 4, polysyllabic words. F extraction is performed using the get_f program in the ESPS software package, with the syllable boundaries given by HMM forced alignment. All of the F patterns are linearly re-sampled to have the same length of 24. There are 1 possible kinds of left tonal context for each syllable, i.e. tone 1 9 and utterance-beginning. An averaged F profile is calculated for each context. As an illustrative example, the context-independent and contextdependent F profiles for tone 6 are plotted in Figure 6. Overall speaking, tone 6 is featured by a slowly declining F pattern. At the utterance-beginning position, the whole F profile tends to shift upwards. It also seems that F keeps good continuity even across syllable boundaries. As shown in Figure 6, a relatively high F is observed when the left context is tone 1 or tone 2 both of which conclude with high F level. 6 PERCEPTUAL TEST 6.1 Design of the Test Subjects are required to listen to pairs of utterances and to grade the utterances in a scale from 1 to 5 (1 being the worst and 5 the best). In each pair, one utterance is generated by the baseline system and the another is the result of either one of the following prosodic controls: 1) Duration and pause only; 2) Context-dependent F only; 3) Both duration and F. The reference is arbitrarily placed in the first or the second position. A total of 3 sentences have been selected as the synthesis materials. Therefore, each subject has to listen to 9 pairs of synthetic utterances (which are randomly ordered) and give 18 grades. Fifteen subjects participated in the test. 6.2 Results Analysis For each trial in the listening test, a pair of grades is obtained. Let G p bet the grade for the utterance with prosodic control and G b be the grade for the reference utterance. Then the difference G p -G b wouldbeagood indication of the relative improvement (or degradation) resulted from the prosodic control. In Figure 7, the histograms of G p -G b are plotted separately for the 3 types of prosodic control. It can be observed that there is a marginal but consistent improvement after applying either of the prosodic modification. It is also observed that the effect of duration modification is more prominent than the F modification. 7 DISCUSSION & CONCLUSION Indeed, the improvement attained is marginal. But this is expected for several reasons. Firstly, the overall perceptual naturalness of synthetic speech is affected by many factors which include minor or major breaks at word, phrase or sentence level, stress, intonation, etc. It might be possible that, in fluent speech, the macro-prosodic factors overwhelm the contribution of the segment-level duration and F adjustment. Secondly, our duration rules are derived from speech data which are all read newspaper sentences. They usually carry much more than the microprosodic effects. Thirdly, the HMM forced alignment method is known to be erroneous. This may affect to certain extent the accuracy of estimated nominal duration. For more reliable prosodic rules, manually labelled speech materials are most desirable. Fourthly, we only consider left tonal context at this stage. This is certainly inadequate as evidenced by the case C in Figure 2. After all, it is our belief that the segment-level duration and F control is the first essential step towards natural speech synthesis. In the near future, we will proceed to investigate the long-term prosodic phenomena and properly incorporate them for the betterment of Cantonese TTS technology. 8 RERFERENCES [1] Min Chu and P.C. Ching. A Cantonese Synthesizer Based on TD-PSOLA Method, in Proceedings of ISMIP-97, pp. 262 7, Taipei. [2] E. Moulines et al, A Real-Time French Text-to- Speech System Generating High-Quality Synthetic Speech, in Proceedings of ICASSP-9, Vol.1, pp.39-12. [3] D. Bigorgne et al, Multi-lingual PSOLA Text-to- Speech System, in Proceedings of ICASSP-93, Vol.2, pp.187-9. [4] Min Chu and Shinan Lu, A Text-to-Speech System with High Intelligibility and High for Chinese, Chinese Journal of Acoustics, Vol.15, No.1, pp.81-9, 1996. [5] W.K. Lo, Tan Lee and P.C. Ching, Development of Cantonese Spoken Language Corpora for Speech Applications, in Proceedings of ISCSLP-98, pp.12 7, Singapore. [6] ESPS Programs Version 5., Entropic Research Laboratory, Inc. [7] D. Talkin (1995), A Robust Algorithm for Pitch Tracking (RAPT), in Speech Coding and Synthesis (W.B. Kleijn and K.K. Paliwal eds.), pp.495 518, Elsevier Science B.V., Amsterdam. [8] Tan Lee, R. Carlson and B. Granstrom, Context- Dependent Duration Modeling for Continuous Speech Recognition, in Proceedings of ICSLP-98, Vol.7, pp. 2955 8, Syndey.

Case A Case B Case C Case D 7 8 2 6 7 8 9 2 8 6 2 5 4 7 9 2 Figure 2: Prosodic structure in Cantonese speech: an example Figure 3: Duration of voiced segment in Cantonese syllables with different phonetic structures. LV: Long Vowel; SV: Short Vowel; D: Diphthong; S: Stop; N: Nasal. Figure 5: Inter-syllable Pause length for different coda-onset combinations F R E Q / Hz 22 215 21 25 195 19 185 18 175 17 Time Figure 6: F profile of a syllable under different tonal context Context-independent Sentence Initial Tone 1 Tone 2 Tone 3 Tone 4 Tone 5 Tone 6 Tone 7 Tone 8 Tone 9 C O U N T 3 3 2 1 Tonal Only 2 1 Durational Only 2 1 Durational and Tonal Figure 4: Duration of unvoiced segment in Cantonese syllables with different phonetic structures. LV: Long Vowel; SV: Short Vowel; D: Diphthong; S: Stop; N: Nasal. -2-1 1 2-2 -1 1 2 G p G b -2-1 1 2 Figure 7: Results of the listening test: histograms of G p G b for different types of prosodic modification