CREATING AN INDIVIDUAL SPEECH RHYTHM: A DATA DRIVEN APPROACH

Similar documents
Speech Recognition at ICSI: Broadcast News and beyond

English Language and Applied Linguistics. Module Descriptions 2017/18

Rhythm-typology revisited.

Quarterly Progress and Status Report. Voiced-voiceless distinction in alaryngeal speech - acoustic and articula

A Cross-language Corpus for Studying the Phonetics and Phonology of Prominence

Speech Emotion Recognition Using Support Vector Machine

Mandarin Lexical Tone Recognition: The Gating Paradigm

Universal contrastive analysis as a learning principle in CAPT

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

SEGMENTAL FEATURES IN SPONTANEOUS AND READ-ALOUD FINNISH

Learning Methods in Multilingual Speech Recognition

Phonological Processing for Urdu Text to Speech System

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Phonological and Phonetic Representations: The Case of Neutralization

ADDIS ABABA UNIVERSITY SCHOOL OF GRADUATE STUDIES MODELING IMPROVED AMHARIC SYLLBIFICATION ALGORITHM

Acoustic correlates of stress and their use in diagnosing syllable fusion in Tongan. James White & Marc Garellek UCLA

Rachel E. Baker, Ann R. Bradlow. Northwestern University, Evanston, IL, USA

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Revisiting the role of prosody in early language acquisition. Megha Sundara UCLA Phonetics Lab

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

Unit Selection Synthesis Using Long Non-Uniform Units and Phonemic Identity Matching

Proceedings of Meetings on Acoustics

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Perceived speech rate: the effects of. articulation rate and speaking style in spontaneous speech. Jacques Koreman. Saarland University

Dyslexia/dyslexic, 3, 9, 24, 97, 187, 189, 206, 217, , , 367, , , 397,

Word Segmentation of Off-line Handwritten Documents

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all

Florida Reading Endorsement Alignment Matrix Competency 1

Program Matrix - Reading English 6-12 (DOE Code 398) University of Florida. Reading

A Hybrid Text-To-Speech system for Afrikaans

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Modeling function word errors in DNN-HMM based LVCSR systems

Word Stress and Intonation: Introduction

Journal of Phonetics

Demonstration of problems of lexical stress on the pronunciation Turkish English teachers and teacher trainees by computer

On the Formation of Phoneme Categories in DNN Acoustic Models

Phonological encoding in speech production

Automatic Pronunciation Checker

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

A comparison of spectral smoothing methods for segment concatenation based speech synthesis

Multilingual Speech Data Collection for the Assessment of Pronunciation and Prosody in a Language Learning System

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Sample Goals and Benchmarks

Pobrane z czasopisma New Horizons in English Studies Data: 18/11/ :52:20. New Horizons in English Studies 1/2016

THE MULTIVOC TEXT-TO-SPEECH SYSTEM

The Acquisition of English Intonation by Native Greek Speakers

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

WHEN THERE IS A mismatch between the acoustic

Lecture 1: Machine Learning Basics

An Interactive Intelligent Language Tutor Over The Internet

Eyebrows in French talk-in-interaction

Expressive speech synthesis: a review

Review in ICAME Journal, Volume 38, 2014, DOI: /icame

Stages of Literacy Ros Lugg

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Consonants: articulation and transcription

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Segregation of Unvoiced Speech from Nonspeech Interference

Modeling function word errors in DNN-HMM based LVCSR systems

Modern TTS systems. CS 294-5: Statistical Natural Language Processing. Types of Modern Synthesis. TTS Architecture. Text Normalization

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Human Emotion Recognition From Speech

Software Maintenance

Linguistics 220 Phonology: distributions and the concept of the phoneme. John Alderete, Simon Fraser University

Building Text Corpus for Unit Selection Synthesis

DEVELOPMENT OF LINGUAL MOTOR CONTROL IN CHILDREN AND ADOLESCENTS

BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY

/$ IEEE

Collecting dialect data and making use of them an interim report from Swedia 2000

A study of speaker adaptation for DNN-based speech synthesis

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

Letter-based speech synthesis

L1 Influence on L2 Intonation in Russian Speakers of English

Infants learn phonotactic regularities from brief auditory experience

Abstractions and the Brain

Problems of the Arabic OCR: New Attitudes

Grade 4. Common Core Adoption Process. (Unpacked Standards)

Understanding and Supporting Dyslexia Godstone Village School. January 2017

Python Machine Learning

Body-Conducted Speech Recognition and its Application to Speech Support System

An Acoustic Phonetic Account of the Production of Word-Final /z/s in Central Minnesota English

Automatic intonation assessment for computer aided language learning

The influence of metrical constraints on direct imitation across French varieties

Statewide Framework Document for:

A new Dataset of Telephone-Based Human-Human Call-Center Interaction with Emotional Evaluation

Fluency Disorders. Kenneth J. Logan, PhD, CCC-SLP

SOUND STRUCTURE REPRESENTATION, REPAIR AND WELL-FORMEDNESS: GRAMMAR IN SPOKEN LANGUAGE PRODUCTION. Adam B. Buchwald

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

The Effect of Discourse Markers on the Speaking Production of EFL Students. Iman Moradimanesh

Phonetics. The Sound of Language

Evaluation of a Simultaneous Interpretation System and Analysis of Speech Log for User Experience Assessment

The Perception of Nasalized Vowels in American English: An Investigation of On-line Use of Vowel Nasalization in Lexical Access

Transcription:

ISCA Archive CREATING AN INDIVIDUAL SPEECH RHYTHM: A DATA DRIVEN APPROACH Oliver Jokisch, Diane Hirschfeld, Matthias Eichner, Rüdiger Hoffmann Technical Acoustics Laboratory, Dresden University of Technology, D-01062 Dresden, Germany Email: jokisch@eakss1.et.tu-dresden.de ABSTRACT Generating a near-to-natural speech rhythm can greatly contribute to the user's acceptance of TTS systems. Beside common aspects of the rhythm control (correctness of the segmental durations, robust function, etc.) rhythmic flexibility for several applications and individual speaking styles are desired. This article describes a data driven concept, which aims at the generation of an individual speech rhythm for the Dresden TTS system for German (DreSS). An additional, prosodic-phonetic database has been extracted from the source speakers of the existing diphone inventories (acoustic synthesis). This database is used for adjusting rule-based and statistic models for the duration control, but also for training an alternative, neural network model (ANN). Several combinations of the models have been tested. From the current point of view, the effect of the specific model used is less than expected, but the appropriate design of the prosodic database seems to support the necessary variety of the rhythmic parameters. A limited individual modeling of the speech rhythm is possible. However, the global evaluation of the introduced approach includes some contradictions; more extensive tests are required. 1. INTRODUCTION The control of the speech rhythm has an essential influence on the quality of synthetic speech. Beside common aspects of a duration control - like the correct modeling of segmental durations and a robust function - nowadays speech applications demand a higher rhythm flexibility (text reader, dialogue systems, etc.) and the realization of individual speaking styles. With the higher segmental speech quality, also in the Dresden TTS system [1] for German, faults of non-acoustic processing stages, especially in the prosodic parts, are not longer masked. So far, the TTS system contains a rule-based, phoneme leveloriented duration control according to Klatt [2]. The redesign of the duration control postulates following theses: Global and local rhythm: The durations must be generated on several levels assuming the duration levels are not correlated. Availability of large databases and automatic analysis: Designing a prosodically-oriented database with respect to the synthesis target (reimplementing the individual rhythm of the inventory speaker, flexible speaking styles, etc.) Rule-based methods and data driven approaches can be combined. According to the specific TTS application the duration control shall enable e.g. a secure speech output with a high intelligibility, respectively, e.g. an very exciting rhythm, which may contain mistakes, too. Following these theses, the prosodic model was extended by a syllabic level and a phrase level. On each hierarchical level different control concepts (rule-based, ANN) can be used. The combination of basically different concepts or hybrid types become more and more important, since the commercial development of synthesis systems requires both, secure solutions using knowledge-based components and flexible systems, e.g. by data driven extensions. Without a deeper understanding of the internal human information processing, that strategy of a Limited Training versus straight data driven concepts seems to provide better solutions. For example, Corrigan et al [3] suggested a hybrid rule-based/ neural network approach to generate segment durations and pointed out the improved performance over a straight neural network system. The new multi-level approach in the Dresden TTS system including the alternative models is described in [4]. Since each level can be processed or trained separately, the database must be structured. On the other hand, there is a demand for consistent databases over the complete TTS process from the text pre-processing to the acoustic synthesis (1 speaker, similar conditions of labelling and extraction). The contradiction between database structuring and uniform data causes some practical problems: For example, the common phonologic syllable definition: Onset-Nucleus-Coda ( ONC syllable ) is well-suited for rule based algorithms, but implies faults during the semiautomatic labelling (boundary position or pause detection) with some effect on the sensible neural network algorithm. The alternative syllable definition starting with a nucleus ( NCO syllable ) enables a higher neural performance, but it less corresponds with the phrase and the phonemic levels (See also chapter 2). The current study concentrates on the database design and aims to create an "individual voice" for the TTS system. 2. DATABASE The database is designed according to the mentioned uniform data set, which is necessary through the whole TTS process but also with respect to a possible coexistence of different concepts of duration control, which are already established (e.g. the syllable-oriented Campbell model [5] versus the phonemeoriented Klatt model).

2.1. Database Design In order to adjust the rule-based and the statistical algorithms but also to train the neural networks - new speech data from both, male and female, original speakers of the diphone inventory have been recorded. The data of our male speaker (native speaker of German, f0=100hz) can be subdivided into two parts: 1. The text corpus (344 sentences, 10780 segments) was selected to show natural prosodic effects and a speech rhythm typical for a text reading application. It combines two short stories and a longer passage of a coherent text from a story tale. 2. For the purpose of inventory extraction the sentence corpus (443 sentences, 11353 segments) contains all phoneme combinations in the German language. On the other hand, the demand for a natural and fluent speaking style requires the embedding of the units into a sentence context. Both demands are met by the recorded sentences similar to the German PhonDat 1 - corpus [6]. Data preparation. The natural speech signal was labelled using information from different linguistic description levels. Much attention was paid to provide labels on the base of objective features. The labels should be re-useable for other purposes (training of automatic labellers, inventory generation, statistic studies, etc.). Phone labels. The SAMPA-inventory for German was extended by symbols, e.g. for pauses, noise and segments to be excluded from further processing. Plosives were subdivided into two segments: pause and burst including aspiration phase. The labelling of vowels was done on the base of formant features [7]. Prosodic labels. The labelling of accents (phrase accent, word accent) and phrase boundaries was done on the base of smoothed z-score-traces and pitch contours. Syntactic labels. Finally, labels for syllabic, word and clause boundaries were manually provided. Syllable types. Two alternative definitions of the syllable were used for pragmatic reasons: The NCO- pseudo syllable is enclosed by two vowels. The syllable starts at a vowel and ends before the next vowel. That keeps the syllabification process simple. Word boundaries are not included in the hierarchy built up by NCO-syllables. The next higher level is the phrase or clause. At the phrase begin, there are rudimentary syllables without vowel, that are excluded from further processing. The second type, the ONC-syllable, is oriented on phonologic/ acoustic criteria. Word boundaries and prefix or suffix boundaries are matching the syllabic boundaries. The position of syllabic boundaries in consonant clusters considers the acoustic segmentation of the speech signal (within plosive stops/ after voiceless fricatives). To prevent open syllables containing short vowels, single inter-vocalic consonants are distributed to both neighbor syllables. 2.2. Data Analysis For the analysis of phrase, syllabic and phonemic durations all prosodic and syntactic label files were projected to the phone labels. All relevant information was extracted automatically: For the raw duration distribution of each phoneme, mean and standard deviation were calculated. To compensate the skew of the distributions the logarithm of the raw duration was taken into account. Beside the phonemic database (Pho-DB), for both syllable types (NCO, ONC) a syllabic database (Sy-DB) was constructed containing the following information: Index of syllable in the word, index of word in the clause, index of clause in the speech file, filename, phoneme string, duration of the syllable, nucleus type (long vowel, short vowel, diphthong, reduced vowel and syllabic consonant), accent type, function word, phrase- and word position (initial, medial, final), number of phonemes and relative position of the nucleus. The phrase database (Phr-DB) was constructed to contain prosodic phrases. It contains information about: index of the clause in the speech file, filename, phrase duration, phrase type and the number of syllables in the phrase. The following phrase-types are examined: clause begin - word accent, clause begin - phrase accent, word accent - phrase accent, word accent - word accent, phrase accent - clause end. 3. DISCUSSION 3.1. Methods (Overview) The prosody module contains on the mentioned levels (phrase, syllable, phoneme) several procedures for generating the segmental durations, which can be used alternatively or in combination. The rule-based, ANN and statistically motivated approaches are described in [4]. For the rule design, the adjustment and for training the ANN-procedures the new database (described in chapter 2) is used. The database is subdivided into validation set, test set and training set (27..290 sentences). The section 3.2 and 3.3 discusses following examples from the syllabic level and the influences on the phonemic level: Syllabic and phrase level: ANN - phonemic level: statistical model (Campbell) Syllabic and phrase level: rule-based ( Multilevel rule MLR, see [4]) - phonemic level: Campbell Phonemic level: rule-based (similar Klatt) For the comparison: the observed original syllables, the estimation results of the untrained syllable-ann and tests with a constant, mean syllable duration (d syl =d syl =217.0 ms), respectively, with a constant, mean phone duration (d pho =d pho =67.5 ms) are examined. Without a notice, all further numerical results are corresponding to the ONC syllable type.

Figure 1: Utterance: Da kam endlich ein kleiner Mann mit grauem Haar und drängte sich ziemlich rücksichtslos nach vorn.. Correlation between phoneme durations (zscore - step function) and ANN-generated f0 contour. Input above: observed original frames of syllabic durations. Input below: ANN-generated syllabic frames. Examples from the NeuRosy-Tool [8]. 3.2. Adjustment and Training Using the training set, the syllable-ann is trained until the Root Mean Square Error (RMSE) achieves a minimum in the test set. The ANN evaluation bases on the (independent) validation set. Table 1 shows the mean deviations from the observed durations of the syllables and the RMSE of the corresponding sets. Both parameters are presented in percent of the mean (observed) syllable duration. Data set dsyl/dsyl RMSE(dsyl)/dsyl Training (290 sent.) 34.6 % 52.9 % Test (27 sent.) 32.3 % 51.6 % Validation (27 sent.) 33.3% 52.5 % Validation dsyl=dsyl 52.0 % 84.3 % Validation untrained 73.5 % 93.2 % Table 1: Results from the ANN training (correctness of the syllable durations) - ONC syllables. (validation set, dsyl=dsyl.. reference, only) The duration deviations (33..35 %) are resulting partly from the variation of the original syllables (217.0 ms +/- 92.3 %), on the other hand, from faults of the semiautomatic labeling (ONC syllable..). The estimation of NCO syllables, e.g., produces deviations of 24..26 %, only. However, the resulting duration distribution on the phonemic level is more complex. For visualizing the ANN results on the phonemic level the prosodic, experimental tool NeuRosy [8] is available, which includes several modules for the duration control and the intonation control. Figure 1 shows the correlation between the phoneme durations (step function) and the ANN-generated, continuous f0 contour along the phoneme sequence for one utterance. Unseen the syllable deviations (table 1): the visual and perceptive differences between the examples above (basing on the syllables observed) and below (ANN-generated syllable durations) are low. For example, the stronger accent (zscore bar on ha6 ) in the center of the diagram below is audible. The multi-level rule (MLR) produces similar results on the syllabic level. 3.3. Performance on the Phonemic Level Table 2 shows the mean relative deviations between generated and observed phone durations and the RMSE normalized on the mean (observed) phone duration with reference to the validation set for several models on the syllabic level, but also for the Klatt model.

Model/ reference dpho,rel RMSE(dpho)/dpho Original syllables 30.6 % 41.7 % Multi-level rule 32.7 % 43.1 % Klatt 38.4 % 50.6 % ANN 40.6 % 52.0 % dpho=dpho=const 52.7 % 93.6 % Table 2: Effect of different syllablic models on the phonemic level (correctness of phone durations) - validation set. For all models (except Klatt) the ONC syllable type and an elastic distribution of phoneme durations according to Campbell has been used (dpho=dpho.. reference, only). According to table 2 the figure 2 presents the distribution of the duration deviations from the observed phone durations: Though, the MLR versus the careful Klatt model is wellmodeling the rhythmic variance (including some runaways), the number of correctly-generated phone durations is higher. Figure 3 compares the results of the Klatt model with phone durations, obtained on the basis of a (mean) constant syllable duration of 217.0 ms and shows the similarities. The mean, relative phone duration is 30.6 %, even for the original syllable durations. Probably, the chosen elastic phoneme distributions (without considering the phoneme positions) is not appropriate. In opposite to the MLR, the Klatt model and the ANN generate higher deviations (38..40 %), which almost achieve the area for the assumption of constant phone durations (52.7 %). However, the perceptive evaluation shows, that the original syllables, MLR and ANN are nearly on the same level, but the Klatt model gets a reduced score. Of course, the assumption constant phone durations does not produce a suitable sound. Number of phones (%) 20.0 15.0 10.0 Syllable duration: 217 ms Klatt model 20.0 Original Syllable Multi level rule 1 Klatt model 5.0 15.0 0.0 0.0 50.0 100.0 150.0 Deviation (%) Number of phones (%) 10.0 Figure 3: Effect of the syllabic duration control on the phonemic level (example 2): deviations from the observed phone durations. 5.0 0.0 0.0 50.0 100.0 150.0 Deviation (%) 3.4. Evaluation Comparing the statistical parameters of different models with first informal, perceptive tests (14 sentences, 10 listeners, only), a non-uniform scenario appears: MLR and ANN have no significant differences. Klatt differs obviously. There is only a low average preference for MLR- or ANN-generated durations. Figure 2: Effect of the syllabic duration control on the phonemic level (example 1): deviations from the observed phone durations. A few listeners clearly prefers the ANN results and they also recognize the original speaker s style!

The use of a speaker-specific database improves the performance over all procedures mentioned. From the current point of view, the differences between rule-based and ANNprocedures are less significant than expected. The modeling on the phonemic level requires more expenditure and the database needs to be enlarged. Afterwards, an extensive perceptual experiment is planed. 4. CONCLUSION The consideration of an individual, the acoustic synthesis matching database supports rule-based, statistic and ANN approaches. The favorite method(s) can be selected according to pragmatic viewpoints of the synthesis application. The introduced (non-optimized) methods differ concerning the effort of rule adjustment, train data collection and so on. Nevertheless, the generated target durations are quite similar on each level. 6. PhonDat 1, BAS corpora on CD-ROM, Institute of Phonetics and Speech Communication, Munich. 7. Hirschfeld, D.: Variabilitaet und Stabilitaet segmentaler Merkmale unter dem Aspekt der konkatenativen Sprachsynthese Vokale (in German). Proc. 7 th Conference on Electronic Speech Signal Processing (ESSV), Berlin, 94-101, 1996. 8. Jokisch, O., Pescheck, M.: Neuronale Prosodiegenerierung - Einfluss der Trainingsdaten (in German). Proc. 24 th Annual German Conference on Acoustics (DAGA), Zurich, 1998 (to be published). For further examinations concerning rhythm or duration phenomena the authors suggest a stronger separation into already existing or new terms, e.g.: Global (phrase?) rhythm Local (syllable?) rhythm Segment durations Conscious rhythm control versus time structure caused by articulatory effects or similar categories in contrast to an over-all strategy for controlling the durations. For these categories additional, objective description parameter should be defined as e.g. a rhythmic parameter Syllable Zscore with regards to the zscore of phoneme durations. 5. REFERENCES 1. Hirschfeld, D., Maas, H. D.: Improving the functionality of a text-to-speech system by adding morphological knowledge. Proc. 20 th Annual German Conference on Artificial Intelligence (KI96), Dresden, 103-106, 1996. 2. Klatt, D. H.: Review of text-to-speech conversion for English. J. Acoustic. Soc. Am., 88: 737-793, 1987. 3. G. Corrigan, N. Massey, O. Karaali: Generating segment durations in a text-to-speech system: A hybrid rule-based/neural network approach. Proc. Eurospeech'97 Vol. 5, 2675-2678, Rhodes, 1997. 4. Jokisch, O., Hirschfeld, D., Eichner, M., Hoffmann, R.: Multi-Level rhythm control for speech synthesis using hybrid data driven and rule-based approaches. Proc. ICSLP 98, Sydney, 1998 (to be published). 5. Campbell, W. N., Isard, S. D.: Segment durations in a syllable frame. J. of Phonetics, 19: 37-47, 1991.