A Greek TTS Based on Non Uniform Unit Concatenation and the Utilization of Festival Architecture

Similar documents
A Hybrid Text-To-Speech system for Afrikaans

A comparison of spectral smoothing methods for segment concatenation based speech synthesis

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

A study of speaker adaptation for DNN-based speech synthesis

Speech Recognition at ICSI: Broadcast News and beyond

Speech Emotion Recognition Using Support Vector Machine

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Learning Methods in Multilingual Speech Recognition

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Mandarin Lexical Tone Recognition: The Gating Paradigm

Phonological Processing for Urdu Text to Speech System

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Expressive speech synthesis: a review

Unit Selection Synthesis Using Long Non-Uniform Units and Phonemic Identity Matching

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

Modern TTS systems. CS 294-5: Statistical Natural Language Processing. Types of Modern Synthesis. TTS Architecture. Text Normalization

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Human Emotion Recognition From Speech

Modeling function word errors in DNN-HMM based LVCSR systems

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

Edinburgh Research Explorer

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Modeling function word errors in DNN-HMM based LVCSR systems

/$ IEEE

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Body-Conducted Speech Recognition and its Application to Speech Support System

Voice conversion through vector quantization

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

Proceedings of Meetings on Acoustics

SARDNET: A Self-Organizing Feature Map for Sequences

Using dialogue context to improve parsing performance in dialogue systems

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

Phonological encoding in speech production

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading

Journal of Phonetics

Building Text Corpus for Unit Selection Synthesis

BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY

Segregation of Unvoiced Speech from Nonspeech Interference

THE MULTIVOC TEXT-TO-SPEECH SYSTEM

Statewide Framework Document for:

Letter-based speech synthesis

WHEN THERE IS A mismatch between the acoustic

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Rachel E. Baker, Ann R. Bradlow. Northwestern University, Evanston, IL, USA

On the Formation of Phoneme Categories in DNN Acoustic Models

Word Segmentation of Off-line Handwritten Documents

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Revisiting the role of prosody in early language acquisition. Megha Sundara UCLA Phonetics Lab

English Language and Applied Linguistics. Module Descriptions 2017/18

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Dialog Act Classification Using N-Gram Algorithms

Quarterly Progress and Status Report. Voiced-voiceless distinction in alaryngeal speech - acoustic and articula

Designing a Speech Corpus for Instance-based Spoken Language Generation

The Acquisition of English Intonation by Native Greek Speakers

ADDIS ABABA UNIVERSITY SCHOOL OF GRADUATE STUDIES MODELING IMPROVED AMHARIC SYLLBIFICATION ALGORITHM

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Spoken Language Parsing Using Phrase-Level Grammars and Trainable Classifiers

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

The Bruins I.C.E. School

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

SEGMENTAL FEATURES IN SPONTANEOUS AND READ-ALOUD FINNISH

First Grade Curriculum Highlights: In alignment with the Common Core Standards

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

Rhythm-typology revisited.

THE RECOGNITION OF SPEECH BY MACHINE

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

A Cross-language Corpus for Studying the Phonetics and Phonology of Prominence

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Perceived speech rate: the effects of. articulation rate and speaking style in spontaneous speech. Jacques Koreman. Saarland University

The IRISA Text-To-Speech System for the Blizzard Challenge 2017

Dyslexia/dyslexic, 3, 9, 24, 97, 187, 189, 206, 217, , , 367, , , 397,

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Demonstration of problems of lexical stress on the pronunciation Turkish English teachers and teacher trainees by computer

Getting the Story Right: Making Computer-Generated Stories More Entertaining

Speaker recognition using universal background model on YOHO database

Statistical Parametric Speech Synthesis

CEFR Overall Illustrative English Proficiency Scales

Taught Throughout the Year Foundational Skills Reading Writing Language RF.1.2 Demonstrate understanding of spoken words,

A student diagnosing and evaluation system for laboratory-based academic exercises

REVIEW OF CONNECTED SPEECH

Sample Goals and Benchmarks

PRAAT ON THE WEB AN UPGRADE OF PRAAT FOR SEMI-AUTOMATIC SPEECH ANNOTATION

1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Individual Component Checklist L I S T E N I N G. for use with ONE task ENGLISH VERSION

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

The Perception of Nasalized Vowels in American English: An Investigation of On-line Use of Vowel Nasalization in Lexical Access

Eyebrows in French talk-in-interaction

Speaker Identification by Comparison of Smart Methods. Abstract

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Think A F R I C A when assessing speaking. C.E.F.R. Oral Assessment Criteria. Think A F R I C A - 1 -

CS Machine Learning

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Florida Reading Endorsement Alignment Matrix Competency 1

SIE: Speech Enabled Interface for E-Learning

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Transcription:

A Greek TTS Based on Non Uniform Unit Concatenation and the Utilization of Festival Architecture Zervas P., Potamitis I., Fakotakis N., Kokkinakis G. Wire Communications Lab, Department of Electrical & Computer Engineering University of Patras, 26500, Rion, Patras, Greece Email: {pzervas,potamitis}@wcl.ee.upatras.gr Abstract. In this article we describe the first Text To Speech (TTS) system for the Greek language based on Festival architecture. We discuss practical implementation details and we capitalize on the preparation of the diphone database and on the prediction of phoneme duration module implemented with CART tree technique. Two male and one female databases where used for three different speech synthesis engines, namely, residual LPC synthesis, TD-PSOLA and MBROLA technique. 1 Introduction The waveform speech synthesis techniques can be divided into three categories. The general-purpose concatenative synthesis, the corpus based synthesis and the phrase splicing. The general-purpose concatenative synthesis translates incoming text onto phoneme labels, stress and emphasis tags, and phrase break tags. This information is used to compute a target prosodic pattern (i.e., phoneme durations and pitch contour). Finally, signal processing methods retrieve acoustic units (fragments of speech corresponding to short phoneme sequences such as diphones) from a stored inventory, modify the units so that they match the target prosody, and glue and smooth (concatenate) them together to form an output utterance. Corpus based synthesis is similar to general-purpose concatenative synthesis, except that the inventory consists of a large corpus of labeled speech, and that, instead of modifying the stored speech to match the target prosody, the corpus is searched for speech phoneme sequences whose prosodic patterns match the target prosody. Last but not least, at phrase splicing technique the system units are stored prompts, sentence frames, and stored items used in the slots of these frames which are glued together. General-purpose concatenative synthesis is able to handle any input sentence but generally produces mediocre quality due to the difference of the spectral content in the connection points. On the other hand corpus based synthesis can produce very high quality, but only if its speech corpus contains the right phoneme sequences with the right prosody for a given input sentence. Phrase splicing methods produce natural speech, but can only produce the pre-stored phrases or combinations of sentence frames and slot items. If the slot items are not carefully matched to the sentence frames in terms of prosody, naturalness is degraded. The proposed work is supported by GEMINI (IST-2001-32343) EC project.

663 2 System Architecture This paper describes the construction of a Greek TTS based on general-purpose concatenative synthesis architecture. In particular, three different engines have been taken into consideration, the residual LPC synthesizer, the TD-PSOLA and the MBROLA synthesizer. Fig. 1. Text-To-Speech system architecture Festival is a general multi-lingual speech synthesis system developed at Centre for Technology Research, Edinburgh, Scotland (CSTR) [1, 2]. It consists off a general framework for building speech synthesis systems. It enables the construction of an operational TTS through a number APIs: from shell level, though a Scheme command interpreter, as a C++ library, and an Emacs interface. The architecture of FESTIVAL is diphone-based utilizing the Residual-Exited LPC synthesis technique. In this method, feature parameters for fundamental small units of speech such as syllables, phonemes or one-pitch-period speech, are stored and connected by rules. In our system (Fig. 1), we used a database consisting of diphones. Mbrola is a speech synthesizer based on the concatenation of diphones coded as Pulse Code Modulation 16 bit linear signals. It takes a list of phonemes as input, together with prosodic information (duration of phonemes and a piecewise linear description of pitch), and produces speech samples using linear coding at 16 bits, at the sampling frequency of the diphone database used. Mbrola is not a Text-To-Speech (TTS) synthesizer since it does not accept raw text as input [3]. 3 Greek TTS Implementation Hereafter, we describe the creation of two diphone databases, a male and a female, required from the residual LPC synthesizer provided by the Festival toolbox and for our TD-PSOLA implementation. Diphones are speech segments beginning in the middle of the stable state of a phone and ending in the middle of the stable state of the

664 following one. Diphones are selected as basic speech segments as they minimize concatenation problems, since they include most of the transitions and co-articulations between phones, while requiring an affordable amount of memory, as their number remains relatively small (as opposed to other synthesis units such as half-syllables or triphones). 3.1 Diphone Database Building The selection and the recording of the corpus greatly affect the overall quality of the synthesized speech. A male speaker was asked to read a 900-word phonetically balanced text corpus in a well articulated and at a natural manner. Thus we wanted to ensure that the diphones would be available in a neutral prosodic context.speech database was used for the creation of the male voice concatenation database (Fig. 2). Besides the creation of diphones and some times triphones we created and all the vowels and consonants of our language. As a result our database was consisting of 398 diphones, 24 triphones and 22 phones of the vowels and consonants. The number of the selected units and their partitioning in triphones and diphones has been chosen according to MBROLA requirements. Fig. 2. Greek corpus segmentation procedure The female voice was created from a 679-word speech database. Contrary to male voice where we used natural carrier words in this case we use nonsense carrier words to collect all possible diphones and some triphones, following [6]. The words uttered where constructed in a way that the extracted diphone or triphone to be in the perfect condition in order the best merge possible to be achieved. Finally the voice had 565 diphones and 114 triphones covering all the greek language. (a) (b) Fig. 3. a) Automatic placement of pitch-marks. b) Correction of the automatic placement of pitch-marks.

665 Both voices were recorded in a studio with professional actorslpc residual synthesis requires the LPC coefficients (perceptual experiments have indicated that 16 coefficients were adequate), residual term of the various speech segments and pitch marks. Epoch-extraction technique was employed to derive the pitch periods of the signal (Fig. 3a). Subsequently, we manually corrected errors in pitch-mark selection (Fig. 3b). As far as it concerns the voiced parts of the speech, the pitch-marks where placed with a synchronous rate, meaning that we first traced the periods of the signal and then the pitch-marks were placed at the max point of the period. For the voiced parts of the signal they were placed with a constant rate. As regards the MBROLA synthesizer we have made use of the Gr2 Greek database [4] that has been encoded in TCTS Labs [5]. 4. Duration Module The prediction of the phoneme duration in a specific phonemic and prosodic context is a crucial factor for the performance of TTS systems. For our system we used treebased modelling and in particular the CART technique. A 500-word speech database was constructed to study the duration model of the Modern Greek language. This database covers all the Greek phonemes and their most frequent contextual combinations. It contains words of various syllabic structures in various locations. The 500 words were spoken in an isolated manner by eight Greek native adult speakers, (four male and four female). The speech database was then labelled manually. The complete database constructed contains a total of about 35.000 entries. Each entry consists of a phoneme label, its duration, its normalized duration, its context and the length of the word it belongs to. In order to apply tree-based modelling clustering we calculated the mean and standard deviations of duration from the entries. Tree-based modelling is a nonparametric statistical clustering technique which successively divides the regions of feature space in order to minimize the prediction error. The CART technique, a special case of tree-based modelling, can produce decision trees or regression trees with the type of duration. The advantage of the CART technique is the ease of interpreting decision and regression trees. The tree predicts zscores (number of standard deviations from the mean) rather than durations directly. After prediction the segmental durations are calculated by the formula: Duration = mean + (zscores * standard deviation) 5 F0 Generation Module For the purpose of the accurate regeneration of the intonation patterns the basic idea was to capture the F0 contour s characteristics by the determination of all its turning points (maxima and minima) in association with discrete textual phenomena along with information about the location of emphasis. For this reason the syllables of the input text were labelled in terms if a set of discrete features (Table 1) and a set of

666 rules which assigns a target F0 level (BASE, MID, TOP or FOCUS) for every syllable was extracted. The kind of textual information used for the syllable s labelling was selected on the basis of its unambiguous extraction directly from the input text except for the information concerning the location of emphasis which is manually provided. The intonation rules have the form: a,b,c, = F0 level Syllable s features Stressed/unstressed syllable Ultimate/penultimate/antepenultimate syllable Distance in syllables from the previous stressed syllable Distance in syllables from the next stressed syllable Distance in syllables from the phrase boundary Emphatic/non emphatic syllable according to the segmentation of the sentence in the pre-focal, focal and post-focal parts Table 1. Discrete syllable features used for the association of the turning points with the input text. The rules do not produce absolute F0 values for every syllable but rather the syllable s corresponding pitch value according to the calculation of the four declined lines with respect to the sentence s duration and the location in time of the emphatic items. For the generation of the appropriate F0 contour the input to the intonation algorithm is the text string enriched with emphatic markings which reflect speaker s intonational focus. First, the declined lines are determined according to the sentence duration and location of the emphasis. Then, the input text is processed and each syllable is assigned a unique vector representing its attributes according to table 1. Finally, every syllable is assigned a F0 level according to the rules and the final contour is constructed by linear interpolation between the successive levels. The resulting pitch contour is a fairly accurate reproduction of the original one as can be seen in figure 4, as far as the patterns used for the analysis are concerned. (a)

667 (b) Fig. 4. Pitch contours of the sentence My father will come at noon from his work with emphasis at noon : (a) Original contour, (b) F0 contour of prosody module 5 Conclusions The work we described here was the creation of a Greek diphone-based database for residual LPC synthesizer of Festival architecture and the application of duration derived from CART tree technique. Sample files that demonstrate the high quality of the synthesis results and a Java based web-tts under construction can be found at http://slt.wcl.ee.upatras.gr/zervas/index.asp. Further work focuses on prosody modelling and specifically on the intonation module utilizing the Bayesian networks approach. References 1. Black A., Taylor P., "The Festival Speech Synthesis System", Technical Report HCRC/TR-83, University of Edinburgh, Scotland, (1997), available at http:// www.cstr.ed.ac.uk/projects/festival.html 2. Black A., Taylor P., The Festival Speech Synthesis System, Carnegie Mellon University, Pittsburgh, PA, available at http://www.cstr.ed.ac.uk/projects/festival 3. Dutoit T., An Introduction to Text-to-Speech Synthesis. Kluwer, (1997) 4. http://www.di.uoa.gr/speech/synthesis/demosthenes 5. http://tcts.fpms.ac.be/synthesis/ 6. Isard S., Miller D. Diphone Synthesis Techniques. Proceedings IEE International Conference on Speech Input/Output (1986), 77-82 7. Moulines E., Charpentier F., Pitch-synchronous Waveform Processing Techniques for Text-to-speech Synthesis Using Diphones. Speech Communication, 9(5/6):453-467 (1990) 8. Galanis D., Darsinos V., Kokkinakis G., Modeling of Intonation Bearing Emphasis for TTS-Synthesis of Greek Dialogues., ICSLP, vol.3, USA, (1996)

668 9. Stylianou Y., Dutoit T., Schroeter J.. Diphones Concatenation Using a Harmonic plus Noise Model of Speech. Proceedings Eurospeech Conference, (1997) 10. Sgarbas K., Fakotakis N., Kokkinakis G. A PC-KIMO Based Bi-Directional Grapheme/Phoneme Converter for Modern Greek, Literary and Linguistic Computing Journal, 13(2):65-75, (1998) 11. Yiourgalis N., Kokkinakis G., "A TTS System for the Greek Language Based on Concatenation of Formant Coded Segments". Speech Communication (1996) 12. Haan P., Oostdijk M. (eds.) Prosody in NIROS with FONPARS and ALFEIOS. Proceedings 18, pp.107-119. University of Nijmegen, Department of Language and Speech. 13. Styger T., Keller E. Formant synthesis. In: E. Keller (ed.), Fundamentals of Speech Synthesis and Speech Recognition: Basic Concepts, State of the Art, and Future Challenges (1994) 109-128