L17: Speech synthesis (front-end)

Similar documents
Mandarin Lexical Tone Recognition: The Gating Paradigm

Modern TTS systems. CS 294-5: Statistical Natural Language Processing. Types of Modern Synthesis. TTS Architecture. Text Normalization

Word Stress and Intonation: Introduction

Phonological Processing for Urdu Text to Speech System

Rhythm-typology revisited.

Florida Reading Endorsement Alignment Matrix Competency 1

Program Matrix - Reading English 6-12 (DOE Code 398) University of Florida. Reading

Phonological and Phonetic Representations: The Case of Neutralization

Letter-based speech synthesis

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Rachel E. Baker, Ann R. Bradlow. Northwestern University, Evanston, IL, USA

English Language and Applied Linguistics. Module Descriptions 2017/18

Revisiting the role of prosody in early language acquisition. Megha Sundara UCLA Phonetics Lab

Journal of Phonetics

Demonstration of problems of lexical stress on the pronunciation Turkish English teachers and teacher trainees by computer

A Cross-language Corpus for Studying the Phonetics and Phonology of Prominence

CS 598 Natural Language Processing

LING 329 : MORPHOLOGY

SEGMENTAL FEATURES IN SPONTANEOUS AND READ-ALOUD FINNISH

Expressive speech synthesis: a review

Character Stream Parsing of Mixed-lingual Text

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

The Acquisition of English Intonation by Native Greek Speakers

Speech Recognition at ICSI: Broadcast News and beyond

ADDIS ABABA UNIVERSITY SCHOOL OF GRADUATE STUDIES MODELING IMPROVED AMHARIC SYLLBIFICATION ALGORITHM

Learning Methods in Multilingual Speech Recognition

What the National Curriculum requires in reading at Y5 and Y6

Acoustic correlates of stress and their use in diagnosing syllable fusion in Tongan. James White & Marc Garellek UCLA

Journal of Phonetics

A survey of intonation systems

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Books Effective Literacy Y5-8 Learning Through Talk Y4-8 Switch onto Spelling Spelling Under Scrutiny

L1 Influence on L2 Intonation in Russian Speakers of English

Linking Task: Identifying authors and book titles in verbose queries

Copyright and moral rights for this thesis are retained by the author

Building Text Corpus for Unit Selection Synthesis

Parallel Evaluation in Stratal OT * Adam Baker University of Arizona

First Grade Curriculum Highlights: In alignment with the Common Core Standards

THE MULTIVOC TEXT-TO-SPEECH SYSTEM

THE PERCEPTION AND PRODUCTION OF STRESS AND INTONATION BY CHILDREN WITH COCHLEAR IMPLANTS

Improved Effects of Word-Retrieval Treatments Subsequent to Addition of the Orthographic Form

Designing a Speech Corpus for Instance-based Spoken Language Generation

A Neural Network GUI Tested on Text-To-Phoneme Mapping

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Natural Language Processing. George Konidaris

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Cross Language Information Retrieval

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading

Emmaus Lutheran School English Language Arts Curriculum

Speech Emotion Recognition Using Support Vector Machine

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Applications of memory-based natural language processing

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

South Carolina English Language Arts

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

Understanding and Supporting Dyslexia Godstone Village School. January 2017

Parsing of part-of-speech tagged Assamese Texts

Unit Selection Synthesis Using Long Non-Uniform Units and Phonemic Identity Matching

Quarterly Progress and Status Report. Voiced-voiceless distinction in alaryngeal speech - acoustic and articula

C a l i f o r n i a N o n c r e d i t a n d A d u l t E d u c a t i o n. E n g l i s h a s a S e c o n d L a n g u a g e M o d e l

The analysis starts with the phonetic vowel and consonant charts based on the dataset:

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Discourse Structure in Spoken Language: Studies on Speech Corpora

ENGBG1 ENGBL1 Campus Linguistics. Meeting 2. Chapter 7 (Morphology) and chapter 9 (Syntax) Pia Sundqvist

Automatic intonation assessment for computer aided language learning

REVIEW OF CONNECTED SPEECH

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Grade 4. Common Core Adoption Process. (Unpacked Standards)

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Arabic Orthography vs. Arabic OCR

CLASSIFICATION OF PROGRAM Critical Elements Analysis 1. High Priority Items Phonemic Awareness Instruction

Proceedings of Meetings on Acoustics

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

On the Formation of Phoneme Categories in DNN Acoustic Models

/$ IEEE

Opportunities for Writing Title Key Stage 1 Key Stage 2 Narrative

Constructing Parallel Corpus from Movie Subtitles

The influence of metrical constraints on direct imitation across French varieties

CELTA. Syllabus and Assessment Guidelines. Third Edition. University of Cambridge ESOL Examinations 1 Hills Road Cambridge CB1 2EU United Kingdom

Underlying Representations

1972 M.I.T. Linguistics M.S. 1972{1975 M.I.T. Linguistics Ph.D.

Derivational and Inflectional Morphemes in Pak-Pak Language

Stages of Literacy Ros Lugg

Modeling full form lexica for Arabic

BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Eyebrows in French talk-in-interaction

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Vocabulary Usage and Intelligibility in Learner Language

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

1. Introduction. 2. The OMBI database editor

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

Correspondence between the DRDP (2015) and the California Preschool Learning Foundations. Foundations (PLF) in Language and Literacy

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

A Case Study: News Classification Based on Term Frequency

Transcription:

L17: Speech synthesis (front-end) Text-to-speech synthesis Text processing Phonetic analysis Prosodic analysis Prosodic modeling [This lecture is based on Schroeter, 2008, in Benesty et al., (Eds); Holmes, 2001, ch. 7; van Santen et al., 2008, in Benesty et al., (Eds); ] Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 1

Introduction Text to speech synthesis The goal of text-to-speech (TTS) synthesis is to convert an arbitrary input text into intelligible and natural sounding speech TTS is not a cut-and-paste approach that strings together isolated words Instead, TTS employs linguistic analysis to infer correct pronunciation and prosody (i.e., NLP) and acoustic representations of speech to generate waveforms (i.e., DSP) These two areas delineate the two main components of a TTS system the front-end, the part of the system closer to the text input, and the back-end, the part of the system that is closer to the speech output [Schroeter, 2008, in Benesty et al., (Eds)] Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 2

TTS front-end (the NLP component) Serves two major functions Convert raw text, which may include numbers, abbreviations, etc., into the equivalent of written-out words Assign phonetic transcriptions to each word, and mark the text into prosodic units such as phrases, clauses and sentences Thus, the front-end provides a symbolic linguistic representation of the text in terms of phonetic transcription and prosody information TTS back-end (the DSP component) Often referred to as the synthesizer, the back-end converts the symbolic linguistic representation into sounds A number of synthesis techniques exist, including Formant synthesis Articulatory synthesis Concatenative synthesis HMM-based synthesis http://en.wikipedia.org/wiki/speech_synthesis Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 3

Components of a front-end Text processing Responsible for determining all knowledge about the text that is not specifically phonetic or prosodic Phonetic analysis Transcribes lexical orthographic symbols into phonemic representations, maybe also diacritic information such as stress placement Prosodic analysis Determines the proper intonation, speaking rate and amplitude for each phoneme in the transcription Proper treatment of these topics would require a separate course Here we just provide a brief overview of the different steps involved in transforming text inputs into a representation that is suitable for synthesis Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 4

Tasks and processing in a TTS front-end [Schroeter, 2008, in Benesty et al., (Eds)] Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 5

Text processing Purpose Text processing is responsible for determining all knowledge about the text that is not specifically phonetic or prosodic In its simplest form, text processing does little more than converting nonorthographic items (e.g., numbers) into words More ambitious systems attempt to analyze white spaces and punctuations to determine document structure Tasks Document structure detection Depending on the text source, may include filtering out headers (e.g., in email messages) Tasks are simplified if document follows the standard generalized markup language (SGML), an international standard for representing e-text Text normalization Handles abbreviations, acronyms, dates, etc. to match how an educated human speaker would read the text Examples: St. can be read as street or as saint, Dr. as drive or doctor, spelling out IBM or MIT but not NASDAQ or NATO Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 6

Text markup interpretation Can be used to control how the TTS engine renders its output Examples: using address mode for reading a street address, rendering sentences with various emotions (e.g., angry, sad, happy, neutral) Easier if text follows the speech synthesis markup language (SSML) Linguistic analysis (a.k.a. syntactic and semantic parsing) May include tasks such as determining parts-of-speech (POS) tags, word sense, emphasis, appropriate speaking style, and speech acts (e.g., greetings, apologies) Example: in order to accentuate the sentence They can can cans it is essential to know that the first can is a function word, whereas the second and third are a verb and a noun, respectively Most TTS systems forego fully parsing the input text in order to reduce computational complexity and also because text input oftentimes consists of isolated sentences or fragments Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 7

Purpose Phonetic analysis Phonetic analysis focuses on the phone level within each word, tagging each phone with information about what sound to produce and how to produce it Tasks Morphological analysis Analyzes the component morphemes of a word (e.g., prefixes, suffixes, stem words) Example: the word antidisestablishmentarianism has six morphs Decomposes inflected, derived and compound words into their elementary graphemic units (their morphs) Rules can be devised to correctly decompose the majority of words (about 95% of those in a typical text) into their constituent morphs Why morphological analysis? A high proportion of English words can be combined with prefixes and/or suffixes to form other words, and the pronunciation of the derived words are closely related to that of their roots Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 8

Homograph disambiguation Disambiguates words with different senses to determine pronunciations Examples: object (verb/noun), resume (verb/noun), contrast (verb/noun), read (present/past) Grapheme to phoneme (G2P) conversion Generates a phonemic transcription of a word given its spelling Two approaches are commonly used for G2P conversion Letter-to-sound rules (LTS) Lookup dictionaries (Lexicon) LTS rules are best suited for languages with a relatively simple relation between orthography and phonology (e.g., Spanish, Finnish) Languages like English, however, generally require a lexicon to achieve highly accurate pronunciations The lexicon should at least include words whose pronunciation cannot be predicted from general (LTS) rules Words not included in the lexicon are then transcribed through LTS rules LTS rules may be learned by means of classification and regression trees Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 9

Prosodic analysis Purpose Prosodic analysis determines the progression of intonation, speaking rate and loudness across an utterance This information is ultimately represented at the phoneme level as amplitude duration, and pitch (F0) Roles of prosody in language In the case of tonal languages, pitch is used to distinguish lexical items Prosody helps structure an utterance in terms of phrases, and indicates relationships between phrases in utterances Prosody helps focus attention on certain words Highlight a contrast (contrastive stress) Emphasize their importance Enhance the intelligibility of words that may be unpredictable from their context Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 10

Loudness/intensity Mainly determined by phone identity e.g. voiceless fricatives are weak, most vowels are strong However, loudness also varies with stress e.g., stressed syllables are normally a little louder It is fairly easy to include rules to simulate these effects The effect of loudness is not critical in the synthesized speech (when compared to pitch and duration) and most TTS system ignore it Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 11

Duration The second most important prosodic element, it helps with Stress: phones become longer than normal Phrasing: phones get noticeably larger prior to a phrase break Rhythm Properties Intrinsic duration vary considerably between phones, e.g. bit vs. beet Durations is affected by speaking rate, by steady sounds (vowels, fricatives), which vary more than transient sounds (stops) Duration depends on neighboring phones: e.g., vowels before voiced Cs ( feed ) are longer than before unvoiced Cs ( feet ) Other rules include If a word is emphasized, its most prominent syllable is normally lengthened At the end of a phrase syllables tend to be longer than in other positions Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 12

Pitch The most important prosodic element As with duration, some general rules are known F 0 contours typically show maxima closed to stress syllables There is generally a globally downward trend of the F 0 contour over the duration of a phrase Trend is reversed for the final syllable in yes/no questions or in nonterminal phrases, but further accelerates downward in terminal phrases Pitch is a controversial topic with many different schools of thought British school: evolved from old style prescriptive linguistics, concerned with teaching correct intonation to non-native speakers Autosegmental-metrical school: seeks to provide a theory of intonation that work cross linguistically Fujisaki model: aimed to follow known biological production mechanisms Tilt model: built purely for engineering purposes Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 13

Prosodic models History of prosodic models Rule-based approaches Developed during the period of formant synthesizers Models employ a set of rules derived from experiments or the literature Examples Duration: Klatt s model, used for the MITTalk system Intonation: Pierrehumbert s model, which is the basis for ToBI Statistical approaches Developed during the period of diphone synthesizers Examples: Duration: sums-of-products model of van Santen Intonation: tilt model of Taylor Use as-is approaches Developed with unit-selection systems Approach is to use a large corpora of natural speech to train prosodic models and serve as a source of units for synthesis Instead of having one token per diphone, corpus contains several tokens with different phonetic and prosodic context characteristics Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 14

Klatt s duration model The model assumes that Each phonetic segment has an inherent duration Each rules tries to effect a % increase or decrease in the phone s duration Segments cannot be compressed beyond a certain minimum InhDur MinDur Perc Dur = MinDur 100 where Perc is determined according to 10 different rules that take into consideration the phonetic environment, emphasis, stress level, etc. Each rule produces a separate Perc, which are then combined multiplicatively However, the model does not account for interactions between rules Other duration models CART-based models (used in Festival) Neural-network-based models (Campbell) Sums-of-products (van Santen) Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 15

Pierrehumbert s intonation model Considers intonation to be a sequence of high (H) and low (L) tones The H and L tones are the building blocks for three larger tone units Pitch accents, used to mark prominence Can be single tones (H*,L*) or pairs of tones (L+H*, L*+H,H*+L,H+L*), where the asterisk (*) denotes alignment with the stressed syllable Phrase accents, link the last pitch accent to the phrase boundary Denoted by (L-,H-) Boundary tones, determine the boundary of intonational phrases These are represented by (%H,%L,H%, L%), where the % denotes the alignment of the boundary tone with the onset or offset of the intonation Pierrehumbert s theory of intonation led to the ToBI (tones and break indices) prosody annotation standard ToBI is just a labeling system, but does not provide F 0 contours Several methods have been developed to convert ToBI labels into actual F 0 contours Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 16

http://www.linguistics.ucla.edu/people/jun/ktobi/k-tobi.html Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 17

Tilt model Developed explicitly as a practical engineering model of intonation Considers intonation to be a sequence of four types of events Pitch accents, boundary tones, connections, and silences Pitch accents and boundary tones are modeled by piece-wise combinations of parameterized quadratic functions (rising or falling) Connections are modeled by straight-line interpolations Amplitude and duration of these functions are defined by three parameters tilt amp = A rise A fall ; tilt dur = D rise D fall ; tilt = tilt amp + tilt dur A rise + A fall D rise + D fall 2 +1 +0.5 0.0-0.5-1 Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 18

Fujisaki s intonation model Considers the logf 0 contour to be the addition of two components A phrase command Characterizes the overall trend of the intonation Modeled by pulses, placed at intonational phrase boundaries An accent command Highlights extreme excursions (e.g. for stressed syllables) Modeled by step functions, placed around accent groups [Holmes, 2001] Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 19