WORD AND SYLLABLE CONCATENATION IN TEXT-TO- SPEECH SYNTHESIS

Similar documents
Pobrane z czasopisma New Horizons in English Studies Data: 18/11/ :52:20. New Horizons in English Studies 1/2016

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

On the Formation of Phoneme Categories in DNN Acoustic Models

Phonological Processing for Urdu Text to Speech System

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

ADDIS ABABA UNIVERSITY SCHOOL OF GRADUATE STUDIES MODELING IMPROVED AMHARIC SYLLBIFICATION ALGORITHM

Mandarin Lexical Tone Recognition: The Gating Paradigm

The analysis starts with the phonetic vowel and consonant charts based on the dataset:

Language Acquisition by Identical vs. Fraternal SLI Twins * Karin Stromswold & Jay I. Rifkin

Quarterly Progress and Status Report. Voiced-voiceless distinction in alaryngeal speech - acoustic and articula

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Lexical phonology. Marc van Oostendorp. December 6, Until now, we have presented phonological theory as if it is a monolithic

Building Text Corpus for Unit Selection Synthesis

Universal contrastive analysis as a learning principle in CAPT

The Perception of Nasalized Vowels in American English: An Investigation of On-line Use of Vowel Nasalization in Lexical Access

Learning Methods in Multilingual Speech Recognition

DEVELOPMENT OF LINGUAL MOTOR CONTROL IN CHILDREN AND ADOLESCENTS

Demonstration of problems of lexical stress on the pronunciation Turkish English teachers and teacher trainees by computer

SEGMENTAL FEATURES IN SPONTANEOUS AND READ-ALOUD FINNISH

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

Word Stress and Intonation: Introduction

Phonological encoding in speech production

To appear in the Proceedings of the 35th Meetings of the Chicago Linguistics Society. Post-vocalic spirantization: Typology and phonetic motivations

Linguistics 220 Phonology: distributions and the concept of the phoneme. John Alderete, Simon Fraser University

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Large Kindergarten Centers Icons

Speech Recognition at ICSI: Broadcast News and beyond

Rachel E. Baker, Ann R. Bradlow. Northwestern University, Evanston, IL, USA

A Cross-language Corpus for Studying the Phonetics and Phonology of Prominence

Stages of Literacy Ros Lugg

**Note: this is slightly different from the original (mainly in format). I would be happy to send you a hard copy.**

Journal of Phonetics

On the nature of voicing assimilation(s)

Handout #8. Neutralization

Parallel Evaluation in Stratal OT * Adam Baker University of Arizona

Contrastiveness and diachronic variation in Chinese nasal codas. Tsz-Him Tsui The Ohio State University

Phonological and Phonetic Representations: The Case of Neutralization

Consonants: articulation and transcription

Program Matrix - Reading English 6-12 (DOE Code 398) University of Florida. Reading

Proceedings of Meetings on Acoustics

Phonetics. The Sound of Language

English Language and Applied Linguistics. Module Descriptions 2017/18

UKLO Round Advanced solutions and marking schemes. 6 The long and short of English verbs [15 marks]

SOUND STRUCTURE REPRESENTATION, REPAIR AND WELL-FORMEDNESS: GRAMMAR IN SPOKEN LANGUAGE PRODUCTION. Adam B. Buchwald

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

A Fact in Historical Phonology from the Viewpoint of Generative Phonology: The Underlying Schwa in Old English

Florida Reading Endorsement Alignment Matrix Competency 1

Dyslexia/dyslexic, 3, 9, 24, 97, 187, 189, 206, 217, , , 367, , , 397,

ABSTRACT. Some children with speech sound disorders (SSD) have difficulty with literacyrelated

A comparison of spectral smoothing methods for segment concatenation based speech synthesis

Unit Selection Synthesis Using Long Non-Uniform Units and Phonemic Identity Matching

A Neural Network GUI Tested on Text-To-Phoneme Mapping

THE MULTIVOC TEXT-TO-SPEECH SYSTEM

1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all

Houghton Mifflin Reading Correlation to the Common Core Standards for English Language Arts (Grade1)

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading

Different Task Type and the Perception of the English Interdental Fricatives

Modeling function word errors in DNN-HMM based LVCSR systems

A Hybrid Text-To-Speech system for Afrikaans

First Grade Curriculum Highlights: In alignment with the Common Core Standards

An argument from speech pathology

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Revisiting the role of prosody in early language acquisition. Megha Sundara UCLA Phonetics Lab

THE PHONOLOGICAL WORD IN STANDARD MALA Y

WiggleWorks Software Manual PDF0049 (PDF) Houghton Mifflin Harcourt Publishing Company

Modern TTS systems. CS 294-5: Statistical Natural Language Processing. Types of Modern Synthesis. TTS Architecture. Text Normalization

SIE: Speech Enabled Interface for E-Learning

Copyright by Niamh Eileen Kelly 2015

Rhythm-typology revisited.

Phonology Revisited: Sor3ng Out the PH Factors in Reading and Spelling Development. Indiana, November, 2015

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Taught Throughout the Year Foundational Skills Reading Writing Language RF.1.2 Demonstrate understanding of spoken words,

Perceived speech rate: the effects of. articulation rate and speaking style in spontaneous speech. Jacques Koreman. Saarland University

Michael Grimsley 1 and Anthony Meehan 2

English for Life. B e g i n n e r. Lessons 1 4 Checklist Getting Started. Student s Book 3 Date. Workbook. MultiROM. Test 1 4

The Odd-Parity Parsing Problem 1 Brett Hyde Washington University May 2008

age, Speech and Hearii

Bi-Annual Status Report For. Improved Monosyllabic Word Modeling on SWITCHBOARD

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

Segregation of Unvoiced Speech from Nonspeech Interference

Automatic English-Chinese name transliteration for development of multilingual resources

BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY

Rule Learning with Negation: Issues Regarding Effectiveness

Richardson, J., The Next Step in Guided Writing, Ohio Literacy Conference, 2010

LEXICAL CATEGORY ACQUISITION VIA NONADJACENT DEPENDENCIES IN CONTEXT: EVIDENCE OF DEVELOPMENTAL CHANGE AND INDIVIDUAL DIFFERENCES.

Modeling function word errors in DNN-HMM based LVCSR systems

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Defragmenting Textual Data by Leveraging the Syntactic Structure of the English Language

Books Effective Literacy Y5-8 Learning Through Talk Y4-8 Switch onto Spelling Spelling Under Scrutiny

Radical CV Phonology: the locational gesture *

Joan Bybee, Phonology and Language Use. Cambridge: Cambridge University Press, 2001,

Infants learn phonotactic regularities from brief auditory experience

Considerations for Aligning Early Grades Curriculum with the Common Core

Observations on the phonetic realization of opaque schwa in Southern French *

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

THE RECOGNITION OF SPEECH BY MACHINE

Acoustic correlates of stress and their use in diagnosing syllable fusion in Tongan. James White & Marc Garellek UCLA

Designing a Speech Corpus for Instance-based Spoken Language Generation

Transcription:

WORD AND SYLLABLE CONCATENATION IN TEXT-TO- SPEECH SYNTHESIS Eric Lewis 1 and Mark Tatham 2 1 Department of Computer Science, Merchant Venturers Building, Woodland Road, University of Bristol, Bristol, BS8 1UB, UK. email: Eric.Lewis@bristol.ac.uk 2 Department of Language and Linguistics, University of Essex, Wivenhoe Park, Colchester, CO4 3SQ, UK. email: Mark.Tatham@essex.ac.uk ABSTRACT MeteoSPRUCE is a database of 2000 words relating to weather forecasting. While such a database is clearly not large enough to be definitive, its usability can be greatly extended by excising syllables from polysyllabic words in its inventory and recombining them to form new words [1], [2], [3]. The authors believe that it provides sufficient data to start to enable conclusions to be drawn as to how syllables should be modified for concatenation in contexts other than those in which they were recorded. A classification scheme for syllables, based on the class of their initial and final segments, has been defined and used to determine a set of rules for making modifications to syllables so that when concatenated the joins are perceptually not noticeable. 1. INTRODUCTION In recent years concatenated waveform speech synthesis systems have gained in popularity due to their more natural sounding output. In order to be truly general purpose such systems must have an exhaustive inventory of stored waveforms for rearrangement and concatenation as needed. The size of the waveform units for concatenation differs between systems but the authors have argued [4] that for natural sounding speech the syllable is probably the preferred unit. MeteoSPRUCE is a limited domain syllable and word based system which has an inventory consisting of recordings of 2000 monosyllabic and polysyllabic words. Words for which recordings do not exist in the inventory are constructed by extracting syllables from words which are in the inventory and recombining them as appropriate. In this paper we describe how such syllables have to be modified for concatenation in contexts other than those from which they were excised. 2. SYLLABLE IDENTIFICATION In a previous paper [5] the authors showed how to characterise syllables on three representational levels, viz. phonological - a phonological syllable is a representation of listener perception. It is therefore abstract and cognitively based, including only the characteristics necessary for perception. phonetic - a phonetic syllable is a stretch of actual waveform, including all coarticulations, which triggers a listener s perception of a phonological syllable. synthetic - a synthetic syllable is a normalised syllable model based on a phonetic syllable and is used in forming new words. Although this suggests that the syllable boundaries in the lexicon should be marked phonologically we have decided that wherever possible morphemic boundaries should be used, provided the separated morpheme is syllabic. Therefore windy is expressed as wind-ey rather than win-dey but the word winds is expressed as windz rather than windz. If there is a syllable boundary which is not a morpheme boundary, then segmentation occurs on the basis of the phonology. For example afternoon is arf-ta-nuun - the first division is phonological while the second is morphemic. The reason for marking up the syllables in this way is that since we will be using the syllables for making new words it is more likely that the new words will be built-up on a morphemic basis rather than on a phonological basis. In SPRUCE a syllable waveform in the inventory can exist as a phonetic syllable, for example the recording of the monosyllabic word rain. However, in order to produce the word raining the phonetic syllable is transformed by a normalisation procedure into a synthetic syllable, one which the

speaker may not normally produce but which triggers the same perceptual response in the listener as the phonetic syllable. 3. SYLLABLES TYPES AND CONTEXTS All syllables can be written in the form C 0,3 + V + C 0,4 where C 0,n indicates 0 to n consonants and V indicates a vowel. Although not all combinations of consonants are allowed the possible number of syllables is still large. However, by classifying syllables in terms of their initial and final segments it becomes necessary to consider the concatenation of syllables only in terms of their class, and the number of these is fairly small. The relevant classes are vowels diphthongs liquids nasals voiced fricatives voiceless fricatives voiced plosives voiceless plosives A syllable inventory is created consisting of all syllables in the lexicon where each entry also includes information about the syllable stress, the classes of the initial and final phonemes as well as the classes of the phonemes which immediately precede and follow the syllable. Since the same syllable can occur in different contexts in the lexicon the syllable inventory contains multiple entries for each syllable. 4. RE-COMBINING RULES Ideally, a syllable required for concatenation is extracted from the inventory such that its context in the sentence under construction is the same as that in which it was recorded. With a sufficiently large inventory of recordings it is highly likely that such a situation will be the norm rather than the exception. In the current situation the inventory of recordings consists of nearly 2000 mono and polysyllabic words relating to weather forecasting so the need for re-combining rules is very much higher. In either case a strategy is required for recombining syllables in contexts other than those from which they were extracted. There are several quite different types of concatenation which can be categorised as follows: the syllable for insertion is a monosyllable. the syllable for insertion has the correct context and comes from a polysyllabic word. the syllable for insertion has the wrong context and comes from a polysyllabic word. In each of the above cases it is also possible that the stress of the inserted syllable may not be the required stress. And, of course, it is possible that in many cases one half of the syllable context will be correct but the other half wrong. A monosyllable almost always has the wrong context in as much as it has been recorded within a standard frame of a sentence which always has the same words preceding and following it, and which is specially designed to minimise coarticulatory and rhythm effects. A sentence is built up linearly from the start by successively concatenating words and/or syllables. Therefore, the concatenation phase consists essentially of having rules for joining two syllables depending on the final class of the first syllable and the initial class of the second syllable. We shall henceforth refer to these two syllables as syllable-f and syllable-i respectively. The position of the syllable within the word is not important except when it s at the start or end of a phrase. We have, therefore, the situation of concatenating two syllables where each syllable is one of the following: a monosyllable - a syllable which is also a word an onset-syllable - that is, one extracted from the start of a polysyllabic word a medial-syllable - that is, one extracted from within a polysyllabic word a coda-syllable - that is, one extracted from the end of a polysyllabic word. The problem in replacing a syllable by one from a different context is that the coarticulation of the adjacent segments can be wrong. For example, the plosive /b/ at the beginning of the monosyllable broke has its normally devoiced stop period replaced by a voiced stop in the polysyllable word unbroken, see Fig. 1. the timing of the syllable can be wrong. All the stressed monosyllabic words are recorded to occupy one rhythm unit, or foot, [3] so when

Fig. 1. Waveforms of broke and unbroken. Fig 2. Waveforms for north and northeast. inserted in a polysyllabic word room must be made for the accompanying unstressed syllables. In the above example the length of broke is 34ms compared with 53ms for the length of unbroken. the amplitude of the syllable can be wrong. This would be the case when trying to use a stressed syllable as a substitute for its unstressed form, or vice versa. Compare the two waveforms in Fig. 2 for north. The MeteoSPRUCE database contains very few examples of syllables ending or beginning with diphthongs so syllables belonging to this class are not considered in any of the following rules. 3-period rule One rule, which is applied frequently to syllables whose onsets or codas are periodic, is that of cutting three periods from the start or end of such syllables. It can apply to monosyllables at either end, as well as to onset-syllables at the start and to coda-syllables at the end. It is not applied if the length of the vowel is less than 12 periods. This rule will henceforth be referred to as the 3-period rule. Using the classification defined in section 3 rules for word and syllable concatenation are given in Table 1. In addition to these rules it is also necessary to consider the situation of being forced to use a stressed syllable in an unstressed position and vice versa. In the former case the authors believe that the best strategy is to reduce the length of the vowel and the overall amplitude of the syllable. Experimentation is still taking place as to how this may best be implemented. It is also important to remember that these rules have been derived for a particular speaker s voice and that the specified quantification may not be appropriate for other voices and recording rates. 5. CONCLUSIONS Rules for word and syllable concatenation have been derived using a 2000 word database, using a classification system based on the classes of initial and final syllable segments. The size of this database necessarily limits the possible combinations of syllables that may occur and which may be used for deriving and testing the appropriate concatenation procedures. However, the authors have developed a strategy that they believe can be extended to process much larger databases and provide the basis for an unrestricted text-to-speech synthesis system.

Syllable-F Syllable-I Rule Vowel Vowel Apply the 3-period rule to syllable-f and syllable-i. Apply the 3-period rule to syllable-f. Vowel Apply the 3-period rule to syllable-i. Voiceless Voiced If syllable-i is a monosyllable or onset-syllable cut that part of the signal prior to the release of the plosive. Syllables whose initial context is classified as voiced plosive cannot be used in any other context because there is no release for the plosive. Vowel Apply 3-period rule to syllable-i. Voiced fricative Remove the silence before the release at the start of syllable-i Voiced plosive Voiceless fricative Voiceless plosive Vowel If syllable-f is a monosyllable or coda-syllable ending with segment /v/ then detect the start and end of the voicing for that segment. If these positions are A and B, respectively, then delete the remainder of the signal after B and double the length of AB. Apply the 3-period rule to syllable-i If syllable-f is a monosyllable or coda-syllable ending with segments /s/, /sh/, /z/ or /zh/ then apply the 3-period rule to syllable-i.. For syllable-f ending in segments/f/, /th/ or /dh/ there is insufficient data in the inventory to draw any conclusions. Not enough examples to make any strong recommendation but a promising interim Vowel Voiced plosive Voiceless plosive Table 1. Rules for concatenating syllables. rule is to trim both fricatives back by 25%. If syllable-f is a monosyllable or coda-syllable, then determine the length of the signal between the last marked period and the release of the plosive. Preserve 50ms of this signal following the marked period and delete the remainder. Syllable-I s which have been extracted with a pre-context of a voiced plosive must not be used in any other context. Remove the release of the plosive of syllable-f and preserve 100ms of the signal after the last marked period. 6. REFERENCES [1] Boeffard, O. Miclet, I. and White, S. (1992) Automatic generation of optimized unit dictionaries for textto-speech synthesis. Proceedings of the International Conference on Spoken Language Processing, Banff, pp 1211-1214. [2] Campbell, N. and Black, A. (1995) Prosody and the selection of source units for concatenative synthesis. In, J. van Santen, R. Sproat, J. Olive and J. Hirshberg (eds) Progress in Speech Synthesis, Springer Verlag, New York. [3] Hunt, A.J. and Black, A. (1996) Unit selection in a concatenative speech synthesis system using a large speech database. Proceedings of the International Conference on Acoustics, Speech and Signal Processing. Atlanta. [4] Lewis, E. and Tatham, M. (1991) SPRUCE - a new text-to-speech synthesis system. Proceedings of Eurospeech 91. ESCA Genova. [5] Tatham, M. and Lewis, E. (1999) Syllable reconstruction in concatenated waveform speech synthesis. In J. Ohala [ed.] Proceedings of the International Congress of Phonetic Sciences, San Francisco