The Pause Duration Prediction for Mandarin Text-to-Speech System

Similar documents
Speech Emotion Recognition Using Support Vector Machine

Mandarin Lexical Tone Recognition: The Gating Paradigm

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Learning Methods in Multilingual Speech Recognition

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Word Stress and Intonation: Introduction

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

On the Formation of Phoneme Categories in DNN Acoustic Models

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Modern TTS systems. CS 294-5: Statistical Natural Language Processing. Types of Modern Synthesis. TTS Architecture. Text Normalization

Application of Visualization Technology in Professional Teaching

Revisiting the role of prosody in early language acquisition. Megha Sundara UCLA Phonetics Lab

Rachel E. Baker, Ann R. Bradlow. Northwestern University, Evanston, IL, USA

Phonological Processing for Urdu Text to Speech System

Perceived speech rate: the effects of. articulation rate and speaking style in spontaneous speech. Jacques Koreman. Saarland University

A study of speaker adaptation for DNN-based speech synthesis

Quarterly Progress and Status Report. Voiced-voiceless distinction in alaryngeal speech - acoustic and articula

Statewide Framework Document for:

1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all

Journal of Phonetics

Unit Selection Synthesis Using Long Non-Uniform Units and Phonemic Identity Matching

Speech Recognition at ICSI: Broadcast News and beyond

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

Letter-based speech synthesis

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Calibration of Confidence Measures in Speech Recognition

Modeling function word errors in DNN-HMM based LVCSR systems

Program Matrix - Reading English 6-12 (DOE Code 398) University of Florida. Reading

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

/$ IEEE

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Disambiguation of Thai Personal Name from Online News Articles

The Acquisition of English Intonation by Native Greek Speakers

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

L1 Influence on L2 Intonation in Russian Speakers of English

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

The IRISA Text-To-Speech System for the Blizzard Challenge 2017

Modeling function word errors in DNN-HMM based LVCSR systems

SEGMENTAL FEATURES IN SPONTANEOUS AND READ-ALOUD FINNISH

Segregation of Unvoiced Speech from Nonspeech Interference

WHEN THERE IS A mismatch between the acoustic

Arizona s English Language Arts Standards th Grade ARIZONA DEPARTMENT OF EDUCATION HIGH ACADEMIC STANDARDS FOR STUDENTS

Investigation on Mandarin Broadcast News Speech Recognition

Application of Multimedia Technology in Vocabulary Learning for Engineering Students

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Voice conversion through vector quantization

Human Emotion Recognition From Speech

Acoustic correlates of stress and their use in diagnosing syllable fusion in Tongan. James White & Marc Garellek UCLA

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

arxiv: v1 [math.at] 10 Jan 2016

BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY

Word Segmentation of Off-line Handwritten Documents

THE PERCEPTION AND PRODUCTION OF STRESS AND INTONATION BY CHILDREN WITH COCHLEAR IMPLANTS

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

Journal of Phonetics

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

CEFR Overall Illustrative English Proficiency Scales

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Automatic intonation assessment for computer aided language learning

ADDIS ABABA UNIVERSITY SCHOOL OF GRADUATE STUDIES MODELING IMPROVED AMHARIC SYLLBIFICATION ALGORITHM

Eyebrows in French talk-in-interaction

Rhythm-typology revisited.

A student diagnosing and evaluation system for laboratory-based academic exercises

Expressive speech synthesis: a review

Dyslexia/dyslexic, 3, 9, 24, 97, 187, 189, 206, 217, , , 367, , , 397,

INPE São José dos Campos

Learning Methods for Fuzzy Systems

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

The Effect of Discourse Markers on the Speaking Production of EFL Students. Iman Moradimanesh

The influence of metrical constraints on direct imitation across French varieties

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

A Case Study: News Classification Based on Term Frequency

Building Text Corpus for Unit Selection Synthesis

A Cross-language Corpus for Studying the Phonetics and Phonology of Prominence

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

**Note: this is slightly different from the original (mainly in format). I would be happy to send you a hard copy.**

Rendezvous with Comet Halley Next Generation of Science Standards

University of Groningen. Systemen, planning, netwerken Bosman, Aart

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading

Designing a Speech Corpus for Instance-based Spoken Language Generation

GDP Falls as MBA Rises?

Using EEG to Improve Massive Open Online Courses Feedback Interaction

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Florida Reading Endorsement Alignment Matrix Competency 1

Facing our Fears: Reading and Writing about Characters in Literary Text

B. How to write a research paper

REVIEW OF CONNECTED SPEECH

Lecture 1: Machine Learning Basics

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Transcription:

The Pause Duration Prediction for Mandarin Text-to-Speech System Jian Yu(1) Jianhua Tao(2) National Laboratory of Pattern Recognition (NLPR) Institute of Automation, Chinese Academy of Sciences {jyu(1), jhtao(2)}@nlpr.ia.ac.cn Abstract-In this paper, we enter into detailed analysis on how the pause duration under different prosodic boundaries are affected by various contextual factors in natural speech. To get the correlation between them, the paper calculates the mean pause duration under different prosodic boundaries. The contextual factors investigated in this paper contains both linguistic features, such as types, syllable tones of sides, initial and final types etc, and acoustic features, such as pitch gap across the. The paper makes experiments and discussion which reveals the influence of these factors on pause duration. Based on that, the paper creates a pause duration prediction model for mandarin speech synthesis system. The model was proved to be able to generate high quality prosody output with the listening test. I. INTRODUCTION In text-to-speech system, it s very important to predict the prosody information for producing natural sounding speeches. The pause duration model, which is one of the important parts of prosody model, is essential in improving prosodic quality. The synthesized speech will not be natural and even unacceptable sometimes, if we only use constant pause duration for each prosody. Actually, the pause duration is related to various linguistic features and other prosodic features. Although the research on pause model has not been paid as much attention as pitch model and duration model before, there is still some work having been done in the last several years. The rule-based method [1] is one of the typical methods, which used linguistic expertise to infer some pause generation rules based on observations on large corpus. This approach is simple and convenient, but it s quite time consuming to get lots of trivial rules. And the results, which influence the prosody generation, were not so good. Later on, someone tried to use the training model such as ANN for pause duration prediction [2]. It generated better results than traditional rule-based methods, but with ANN model, we had to prepare a very large corpus for training, and the results were still limited in most cases. It is also hard to get the relationship between pause duration and other features with only ANN outputs. Similar work has also been done by some others [6]. Unlike previous works, in this paper, we make the detailed analysis on the relationship between pause duration and various contextual features. These results help us to understand the nature of pause generation and predict pause duration more precisely. Then, a decision tree is used to automatically collect the rules for pause duration prediction with a limited training corpus. The model has successfully been integrated into the CASIA TTS system, and was proved to be able to generate high prosody outputs. This paper is organized as follows. The speech corpus that our research is based on is introduced in section II. Section III elaborates on how our experiments are carried out for both text information and prosodic information. The results of experiments are listed in detail and analyzed from the points of phonetics and phonology. Section IV introduces the CART model based pause duration prediction, and the experiments of the model outputs. The final discussion and conclusion are arranged in the last part. II. SPEECH CORPUS The corpus used in our work contains 5,000 sentences (about 80,000 syllables), which are recorded from a professional female speaker. It is carefully designed to cover all of the Mandarin syllables, tone combinations, and as much as contextual variations. The corpus was manually labeled with prosody boundaries, word segmentation, POS tagging, pitch tagging, and acoustic syllabic boundaries. The prosodic boundaries are classified into four layers. They are, * B0: syllable. * B1: prosodic word, a group of syllables that are uttered closely. * B2: prosodic phrase, a group of prosodic words that has a perceptive rhythm break at the end. * B3: sentence, the utterance for a whole speech. Sentence boundaries always contain a long silence which is out of the research in the paper. The statistical distribution of the pause in other boundaries is listed in table 1. It contains mean pause duration, standard deviation and appearing probabilities of pauses. TABLE1 PAUSE DURATION UNDER DIFFERENT BOUNDARY CATEGORIES Boundary Mean Deviation Probability B0 38 ms 19 ms 47.3% B1 51 ms 25 ms 63.6% B2 122 ms 68 ms 97.2%

As we see, the longer pause durations are normally related to the higher prosody boundaries. The results are used by most of the previous TTS systems, but they just used some simple rules. On the other hand we can not say the higher prosody boundaries always make the longer pause duration, since there is still a big deviation of the pause in each. Such as in B2, the standard deviation of the pause is 68ms while the mean pause duration is just 122ms. Therefore, more features are necessary for predicting pause duration precisely. III. EXPERIMENTS AND DISCUSSION When predicting prosodic information in the past, researchers always only made use of the results of text analysis, neglecting the prosodic information itself [4] [5]. Therefore, the various factors investigated in this paper include not only text information, such as type, initial category and final category, etc, but also prosodic information itself, such as pitch gap across the pause. A. Influenced by inner-syllabic feature Considering the great influence of initial and final category on syllable duration, we may suspect that the initial category of the following syllable and final category of the previous syllable have some influence on pause duration. Fig. 1 and Fig. 2 show the statistic results based on our corpus. These figures demonstrate the influence of the previous syllable s final category and the following syllable s initial category, showing three remarkable points: (1) under nonphrase, the following syllable s initial category has great influence on pause duration. Fig. 1 displays that the pause duration before stops and affricates is much longer than that before fricatives, nasals, and zero initials. This phenomenon may be caused by different initials articulatory manners. For example, some part of vocal track is closed before stops are pronounced, which leads to the appearing of pause. (2) Compared with initial category, the influence of the previous syllable s final category under non-phrase is weaker, but we can still observe that the pause duration after nasal finals gets a little shorter under non-phrase. (3) Under phrase, neither the previous syllable s final category nor the following syllable s initial category has influence on pause duration. One reason might be that the pause duration under phrase is already long enough that articulators can complete any action happened in vocal track, such as closing action before pronouncing stops, so the influence of initial category and final category can not be revealed. B. Influenced by tone combination Tone identities of the previous and following syllables may also have some influence on pause duration. The means of pause durations under different prosodic environments engendered by different tones and categories are listed in table 2 and table 3. The influence of neutral tone is not included in these tables due to the sparsity and imbalance of neutral tone. For example, neutral tones hardly occur in the first syllable of phrase, so the means of pauses durations under this environment is meaningless. The difference in the means of pause durations under phrase can be neglected on account of the long pause duration under this environment. That is to say the tone identity has little influence on pause duration under phrase. However, under non-phrase, the situation is different: Table 2 shows that the pause duration after syllables with tone3 is much longer than that in the other situations; Table 3 tells us that the pause duration may be lengthened under non-phrase when the next syllable s tone is tone 1 or tone 4. Fig. 1. The influence of the following syllable s initial category on pause duration Fig. 2. The influence of the previous syllable s final category on pause duration TABLE2 THE INFLUENCE OF THE PREVIOUD SYLLABLE'S TONE ON PAUSE DURATION The tone of previous syllable Syllable Word Phrase Tone1 14.5 ms 28.6ms 119.4ms Tone2 16.0 ms 27.9ms 107.8ms Tone3 30.4 ms 41.3ms 120.3ms Tone4 16.2 ms 31.1ms 118.5ms

TABLE3 THE INFLUENCE OF THE FOLLOWING SYLLABLE'S TONE ON PAUSE DURATION The tone of following syllable Syllable Word Phrase Tone1 22.4 ms 36.4ms 112.1ms Tone2 13.6 ms 28.1ms 110.9ms Tone3 16.0 ms 27.6ms 117.2ms Tone4 18.3 ms 35.9ms 120.8ms C. Correlation with more context features Prosodic structure is another important factor that may have great influence on pause duration. Here prosodic structure includes position in word, phrase, and sentence. Fig. 3 shows the influence of position in sentence on pause duration under different boundaries. In Fig. 3, under all boundaries pause duration becomes longer as the position in sentence approaches the end. This phenomenon can be interpreted in such way, speakers become rather tired as the syllables of a sentence are pronounced one by one, therefore in the posterior of the sentence the pause duration becomes longer for speakers to release pressure. However, this phenomenon does not occur in Fig. 4 that shows the influence of position in phrase. No matter under syllable or word, we cannot see any explicit relationship between pause duration and position in phrase. But under phrase, where position in phrase equals the length of phrase, the pause duration is lengthened as the length of phrase increases. Fig. 5 shows the influence of position in word. Under syllable, the change of pause duration, as a function of position in word, is ruleless and stochastic. However, under word, where the position in word equals the length of word, the pause duration and the length of word have some relationship that is similar to linear increase, just as what Fig. 5 shows. This phenomenon is similar to the change of pause duration as the function of phrase length under phrase, which is showed in Fig. 4. One reason for these phenomena could be that when speakers pronounce a large number of syllables successively, he needs comparative long pause to relax. Fig. 3. The influence of position in sentence on pause duration under different boundaries Fig. 4. The influence of position in phrase on pause duration under different boundaries Fig. 5. The influence of position in word on pause duration under different D. Influenced by pitch gap between two syllables As an important part of prosodic information, pause has close connection with other prosodic information, like pitch gap across the pause. But in conventional prosody models, the prosodic information itself is always neglected. Therefore, in this section we try to elucidate the relationship between pause duration and pitch gap under different prosodic environments, offering reference for constructing a well-performed prosody model. It is known that there is some specific connection between pause duration and pitch gap across the pause. According to statistic results, we plot several curves to display how the pause duration changes as a function of pitch gap. From Fig. 6, we can see that there is some explicit relationship between pause duration and pitch gap under word and syllable. When pitch gap approaches zero, the pause duration is almost minimum, and when the abstract value of pitch gap becomes larger, the pause duration also becomes longer. But this phenomenon is not obvious under phrase, for there are several peaks in this curve. Then we will minutely study the relationship between these two variables under phrase. Because of the complexity of pitch contour in mandarin, pitch may rise or decline after pause. The correlation between pause duration and pitch rise, and that between pause duration and pitch decline are calculated respectively, as showed in table 4. The correlation between pause duration and pitch decline is rather little, only 0.03. So when pitch declines after pause,

there is no explicit relationship between pause duration and pitch decline. Just as in Fig. 6, when pitch gap is negative, the curve of pause duration under phrase is similar to the white noise. Meanwhile, the correlation between pause duration and pitch rise is 0.19, showing that there is some specific relationship between these two variables. However, in Fig. 6, when the pitch gap is positive, there is more than one peak in the curve of pause duration under phrase, which shows that the relationship between these two variables is not simple. Given the complexity of pitch contours of various tones, we study this relationship respectively according to various tone combinations. Previous research proves that, there is one-to-one correspondence between pitch targets and tones [3]. So the four normal tones can be represented by four basic targets: high (tone1), rise (tone2), low (tone3), and fall (tone4). Among these rise can be seen as low -to- high, and fall can be seen as high -to- low. Then the correlation between pause duration and pitch rise under phrase is calculated according to this sorting method. In Table 5, LL, LH, HH, and HL represent all kinds of tone combinations. For example, LH represents that the previous syllable s ending pitch is low, which denotes the tone of this syllable is tone 3 or tone 4, and the following syllable s starting pitch is high, which denotes the tone of this syllable is tone 1 or tone 4. From the statistic data in table 5, we can see that, when the previous syllable s ending pitch is low, namely, the tone of previous syllable is tone3 or tone4, there is close relationship between pause duration and pitch rise, the correlations is 0.42, 0.44 respectively. While the previous syllable s ending pitch is high, this relation is very weak. Fig. 7 shows the curve of pause duration under different tone combinations. When previous syllable s ending pitch is low, just like Fig. 7(a) and Fig. 7(b), there is almost linear relationship between pause duration and pitch rise, But in Fig. 7(c) and Fig. 7(d), when previous syllable s ending pitch is high, there is no obvious relationship between pause duration and pitch rise. Fig. 6. The relationship between pause duration and pitch gap under different boundaries (a) LL (b) LH (c) HL (d) HH Fig. 7. The relationship between pause duration and pitch rise in different tone combinations under phrase TABLE4 THE CORRELATION BETWEEN PAUSE DURATION AND PITCH GAP UNDER PHRASE BOUNDARY Correlation between pause duration and pitch rise 0.19 Correlation between pause duration and pitch decline -0.03 TABLE5 THE CORRELATION OF PAUSE DURATION AND PITCH RISE IN DIFFERENT TONE COMBINATIONS UNDER PHRASE BOUNDARY LL LH HH HL The previous syllable s ending pitch 150 156 255 260 The next syllable s starting pitch 190 294 315 208 Correlation 0.42 0.44 0.10 0.09 IV. CART-BASED PAUSE DURATION PREDICTION Our final goal is not only to analyze the influence of various factors on pause duration, but rather to precisely predict pause duration in our TTS system. The classification and regression tree (CART) is an effective method to solve this prediction problem. Based on the knowledge of various factors influences on pause duration, a precisely cart-based pause model can be constructed. One more thing worth mention, we do not predict pause duration directly, otherwise we predict the logarithm of pause duration, so the goal of CART is to minimize the mean standard error of the logarithm of pause duration. This method can improve the objective perception of pause prediction. For example, when actual pause duration is 200ms, if the predicting error is 20ms, listener can not feel some unnatural. But if actual pause duration is 20ms and the predicting error is also 20 ms, then the discomfort is very large. Using the logarithm of pause duration as predicting target can resolve this problem to some extent.

Tree A Tree B TABLE6 THE PREDICTING FEATURES OF TWO TREES Features Boundary type, initial and final, tone, prosodic structure Boundary type, initial and final, tone, prosodic structure, and pitch gap TABLE7 THE PREDICTING FEATURES OF TWO TREES predict precision correlation Train Test Train Test Tree A 30.8% 31.9% 0.80 0.78 Tree B 23.5% 24.6% 0.88 0.86 We construct two trees (Tree A and Tree B) in which Tree A only uses text information to predict pause duration, while Tree B also uses prosody information besides text information when predicting pause duration. Through the comparison of these two trees, we can get the value of the prosody information in predicting pause duration. Table 6 and Table 7 respectively list the predicting features and results of these two trees. From these two tables, we can see that adding pitch gap as one predicting feature can largely improve the precision of pause duration prediction. This result also validates our analysis in section III and section IV. For the application of Tree B in TTS system, the pause duration model should be put in the back of pitch model, which can generate precise pitch contour, so the pitch gap between two syllables can be got. This method has been used in our TTS system, which can generate high quality prosody output. V. CONCLUSIONS This paper systematically and thoroughly studies various factors that have influence on pause duration in mandarin. The factors include not only the information from the results of text analysis, but also the prosodic information itself, such as pitch gap across the pause. Experiments designed in this paper explicitly reveal the relationship between pause duration and these factors: (1) Under non-phrase, text information that has influence on pause duration includes initial category of the following syllable, final category of the previous syllable, tone identities of the previous and following syllables, and position in sentence. Moreover, word length also has some influence on pause duration under word. (2) Under phrase, only a little text information has influence on pause, such as position in phrase and phrase length. So it s necessary to make use of the prosodic information for predicting pause duration precisely. (3) Pause duration also has close connection with pitch gap across the pause. Under non-phrase, pause duration is in proportion to the abstract value of pitch gap. Under phrase, however, their relationship becomes more complex. When pitch declines after pauses, the correlation between pause duration and pitch decline is little, only 0.03, which shows there is no explicit relation between these two variables. When pitch rises after pauses, the correlation between pause duration and pitch rise is 0.19, which demonstrates these two variables has some specific connection. And Experiments show this relationship varies according to different tone environments: When previous syllable s ending pitch is low, which denotes the tone of this syllable is tone 3 or tone 4, there is almost linear relationship between pause duration and pitch rise, but when previous syllable s ending pitch is high, which denotes the tone of this syllable is tone 1 or tone 2, there is no obvious relationship between pause duration and pitch rise. Revealing these relationships is not our final goal, and we make use of these results to construct a more precise pause model in text-to-speech system in the section IV. We construct two cart-based pause models to predict pause duration, one use only text information and another also includes prosody information itself. From the comparison between these two models, we can find that prosody information is very useful in predicting pause duration. REFERENCES [1] Lin-Shan Lee, Chiu-Yu Tseng, and Ming Ouh-Young The Synthesis Rules in a Chinese Text-to-Speech System IEEE Trans. Acoustic, Speech, Signal processing, vol 37, no 9, pp. 269-285,1989 [2] Sin-Horng Chen, Shaw-Hwa Hwang, and Chun-Yu Tsai A First study on Neural Net Based Generation of Prosodic and Spectral Information for Mandarin Text-to-Speech ICASSP'92, San Francisco, March 1992. [3] Yi Xu, and Q. Emily Wang Pitch Targets and Their Realization: Evidence from Mandarin Chinese Speech Communication 33(2001) 319-337 [4] Min Chu, and Yongqiang Feng Study in Factors Influencing Durations of Syllables in Mandarin, EuroSpeech 2001, Scandinavia [5] Sun lu, Yu Hu, and RenHua Wang Polynomial Regression Model for Duration Prediction in Mandarin ICSLP 2004, Korea [6] Elena Zvonik, and Fred Cummins The Effect of Surrounding Phrase Lengths on Pause Duration, EuroSpeech 2003, Geneva