Modern TTS systems. CS 294-5: Statistical Natural Language Processing. Types of Modern Synthesis. TTS Architecture. Text Normalization

Similar documents
Rachel E. Baker, Ann R. Bradlow. Northwestern University, Evanston, IL, USA

Speech Recognition at ICSI: Broadcast News and beyond

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Mandarin Lexical Tone Recognition: The Gating Paradigm

Word Stress and Intonation: Introduction

On the Formation of Phoneme Categories in DNN Acoustic Models

L1 Influence on L2 Intonation in Russian Speakers of English

Florida Reading Endorsement Alignment Matrix Competency 1

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Phonological Processing for Urdu Text to Speech System

Rhythm-typology revisited.

Acoustic correlates of stress and their use in diagnosing syllable fusion in Tongan. James White & Marc Garellek UCLA

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

SEGMENTAL FEATURES IN SPONTANEOUS AND READ-ALOUD FINNISH

CS 598 Natural Language Processing

The Acquisition of English Intonation by Native Greek Speakers

A comparison of spectral smoothing methods for segment concatenation based speech synthesis

Part I. Figuring out how English works

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Books Effective Literacy Y5-8 Learning Through Talk Y4-8 Switch onto Spelling Spelling Under Scrutiny

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Natural Language Processing. George Konidaris

Designing a Speech Corpus for Instance-based Spoken Language Generation

English for Life. B e g i n n e r. Lessons 1 4 Checklist Getting Started. Student s Book 3 Date. Workbook. MultiROM. Test 1 4

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

First Grade Curriculum Highlights: In alignment with the Common Core Standards

Speech Emotion Recognition Using Support Vector Machine

Demonstration of problems of lexical stress on the pronunciation Turkish English teachers and teacher trainees by computer

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

A Neural Network GUI Tested on Text-To-Phoneme Mapping

ADDIS ABABA UNIVERSITY SCHOOL OF GRADUATE STUDIES MODELING IMPROVED AMHARIC SYLLBIFICATION ALGORITHM

Expressive speech synthesis: a review

Using a Native Language Reference Grammar as a Language Learning Tool

Dyslexia/dyslexic, 3, 9, 24, 97, 187, 189, 206, 217, , , 367, , , 397,

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

Revisiting the role of prosody in early language acquisition. Megha Sundara UCLA Phonetics Lab

English Language and Applied Linguistics. Module Descriptions 2017/18

Speaker Recognition. Speaker Diarization and Identification

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Consonants: articulation and transcription

Learning Methods in Multilingual Speech Recognition

Universal contrastive analysis as a learning principle in CAPT

THE PERCEPTION AND PRODUCTION OF STRESS AND INTONATION BY CHILDREN WITH COCHLEAR IMPLANTS

age, Speech and Hearii

Unit Selection Synthesis Using Long Non-Uniform Units and Phonemic Identity Matching

Developing Grammar in Context

Perceived speech rate: the effects of. articulation rate and speaking style in spontaneous speech. Jacques Koreman. Saarland University

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

Quarterly Progress and Status Report. Voiced-voiceless distinction in alaryngeal speech - acoustic and articula

REVIEW OF CONNECTED SPEECH

CEFR Overall Illustrative English Proficiency Scales

IN THIS UNIT YOU LEARN HOW TO: SPEAKING 1 Work in pairs. Discuss the questions. 2 Work with a new partner. Discuss the questions.

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

L1 and L2 acquisition. Holger Diessel

Copyright 2017 DataWORKS Educational Research. All rights reserved.

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Program Matrix - Reading English 6-12 (DOE Code 398) University of Florida. Reading

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Context Free Grammars. Many slides from Michael Collins

Understanding and Supporting Dyslexia Godstone Village School. January 2017

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Taught Throughout the Year Foundational Skills Reading Writing Language RF.1.2 Demonstrate understanding of spoken words,

Copyright and moral rights for this thesis are retained by the author

Applications of memory-based natural language processing

**Note: this is slightly different from the original (mainly in format). I would be happy to send you a hard copy.**

1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all

Journal of Phonetics

Automatic intonation assessment for computer aided language learning

THE MULTIVOC TEXT-TO-SPEECH SYSTEM

Parallel Evaluation in Stratal OT * Adam Baker University of Arizona

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

The IRISA Text-To-Speech System for the Blizzard Challenge 2017

Tour. English Discoveries Online

Word Segmentation of Off-line Handwritten Documents

A Cross-language Corpus for Studying the Phonetics and Phonology of Prominence

Cross Language Information Retrieval

Building Text Corpus for Unit Selection Synthesis

A Hybrid Text-To-Speech system for Afrikaans

ENGBG1 ENGBL1 Campus Linguistics. Meeting 2. Chapter 7 (Morphology) and chapter 9 (Syntax) Pia Sundqvist

Linking Task: Identifying authors and book titles in verbose queries

Cambridgeshire Community Services NHS Trust: delivering excellence in children and young people s health services

Software Maintenance

Houghton Mifflin Reading Correlation to the Common Core Standards for English Language Arts (Grade1)

The influence of metrical constraints on direct imitation across French varieties

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Letter-based speech synthesis

Learners Use Word-Level Statistics in Phonetic Category Acquisition

The Perception of Nasalized Vowels in American English: An Investigation of On-line Use of Vowel Nasalization in Lexical Access

WiggleWorks Software Manual PDF0049 (PDF) Houghton Mifflin Harcourt Publishing Company

DIBELS Next BENCHMARK ASSESSMENTS

A study of speaker adaptation for DNN-based speech synthesis

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

How to analyze visual narratives: A tutorial in Visual Narrative Grammar

Linking object names and object categories: Words (but not tones) facilitate object categorization in 6- and 12-month-olds

Body-Conducted Speech Recognition and its Application to Speech Support System

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Transcription:

CS 294-5: Statistical Natural Language Processing Speech Synthesis Lecture 22: 12/4/05 Modern TTS systems 1960 s first full TTS Umeda et al (1968) 1970 s Joe Olive 1977 concatenation of linearprediction diphones Speak and Spell 1980 s 1979 MIT MITalk (Allen, Hunnicut, Klatt) 1990 s- present Diphone synthesis Unit selection synthesis Slides directly from Dan Jurafsky, indirectly many others Types of Modern Synthesis Articulatory Synthesis: Model movements of articulators and acoustics of vocal tract Formant Synthesis: Start with acoustics, create rules/filters to create each formant Concatenative Synthesis: Use databases of stored speech to assemble new utterances. TTS Demos (Mostly Unit-Selection) Comparisons: http://www.tmaa.com/tts/companies.htm ATT: http://www.naturalvoices.att.com/demos/ Rhetorical (= Scansoft) http://www.rhetorical.com/cgi-bin/demo.cgi Festival http://www-2.cs.cmu.edu/~awb/festival_demos/inde.html IBM http://www-306.ibm.com/software/pervasive/tech/demos/tts.shtml Tet from Richard Sproat slides Raw Tet in TTS Architecture Tet Analysis Tet Normalization Part-of-Speech tagging Homonym Disambiguation Phonetic Analysis Dictionary Lookup Grapheme-to-Phoneme (LTS) Prosodic Analysis Boundary placement Pitch accent assignment Duration computation Waveform synthesis Speech out Tet Normalization Analysis of raw tet into pronounceable words Sample problems: He stole $ million from the bank It's 13 St. Andrews St. The home page is http://www.cnn.com yes, see you the following tues, that's 11/12/01 Steps Identify tokens in tet Chunk tokens into reasonably sized sections Map tokens to words Identify types for words 1

Words to Phones Two methods: Dictionary-based Rule-based (Letter-to-sound=LTS) Early systems, all LTS MITalk was radical in having huge 10K word dictionary Now systems use a combination Big dictionary Special code for handling names Machine learned LTS system for other unknown words CMU dictionary: 127K words http://www.speech.cs.cmu.edu/cgi-bin/cmudict Letter-to-Sound Rules Festival LTS rules: (LEFTCONTEXT [ ITEMS] RIGHTCONTEXT = NEWITEMS ) Eamples: ( # [ c h ] C = k ) ( # [ c h ] = ch ) Rules apply in order christmas pronounced with [k] But word with ch followed by non-consonant pronounced [ch] E.g., choice More modern approach: learn HMMs / CRFs Prosody Prosody: Getting from words+phones to boundaries, accent, F0, duration Prosodic phrasing Need to break utterances into phrases Punctuation is useful, not sufficient Accents: Predictions of accents: which syllables should be accented Realization of F0 contour: given accents/tones, generate F0 contour Duration: Predicting duration of each phone Three aspects of prosody Prominence: some syllables/words are more prominent than others Structure/boundaries: sentences have prosodic structure Some words group naturally together Others have a noticeable break or disjuncture between them Tune: the intonational melody of an utterance. From Ladd (1996) Prominence: Pitch Accents Graphic representation of F0 A: What types of foods are a good source of vitamins? B1: Legumes are a good source of VITAMINS. B2: LEGUMES are a good source of vitamins. Prominent syllables are: Louder Longer Have higher F0 and/or sharper changes in F0 (higher F0 velocity) F0 (in Hertz) 2 legumes are a good source of VITAMINS time 2

The ripples The ripples 2 2 [ s ] [ s ] [ t ] legumes are a good source of VITAMINS [ g ] [ z ] [ g ] [ v ] legumes are a good source of VITAMINS F0 is not defined for consonants without vocal fold vibration.... and F0 can be perturbed by consonants with an etreme constriction in the vocal tract. Abstraction of the F0 contour The waves and the swells wave = accent 2 2 swell = phrase legumes are a good source of VITAMINS legumes are a good source of VITAMINS Our perception of the intonation contour abstracts away from these perturbations. Stress vs. Accent Stress is a structural property of a word it marks a potential (arbitrary) location for an accent to occur, if there is one. Accent is a property of a word in contet it is a way to mark intonational prominence in order to highlight important words in the discourse. () vi ta mins Ca li () for nia (accented syll) stressed syll full vowels syllables Which Word is Accented? It depends on the contet. For eample, the new information in the answer to a question is often accented, while the old information usually is not. Q1: What types of foods are a good source of vitamins? A1: LEGUMES are a good source of vitamins. Q2: Are legumes a source of vitamins? A2: Legumes are a GOOD source of vitamins. Q3: I ve heard that legumes are healthy, but what are they a good source of? A3: Legumes are a good source of VITAMINS. 3

Same tune, different alignment Same tune, different alignment 2 2 LEGUMES are a good source of vitamins Legumes are a GOOD source of vitamins The main rise-fall accent (= I assert this ) shifts locations. The main rise-fall accent (= I assert this ) shifts locations. Same tune, different alignment Broad focus 2 legumes are a good source of VITAMINS Tell me something about the world. 2 legumes are a good source of vitamins The main rise-fall accent (= I assert this ) shifts locations. In the absence of narrow focus, English tends to mark the first and last content words with perceptually prominent accents. Yes-No question tune Yes-No question tune 5 5 0 0 4 2 are LEGUMES a good source of vitamins 4 2 are legumes a GOOD source of vitamins Rise from the main accent to the end of the sentence. Rise from the main accent to the end of the sentence. 4

5 0 4 2 Yes-No question tune are legumes a good source of VITAMINS 2 WH-questions [I know that many natural foods are healthy, but...] WHAT are a good source of vitamins Rise from the main accent to the end of the sentence. WH-questions typically have falling contours, like statements. Broad focus Tell me something about the world. 2 legumes are a good source of vitamins 5 0 4 2 Rising statements Tell me something I didn t already know. legumes are a good source of vitamins [... does this statement qualify?] High-rising statements can signal that the speaker is seeking approval. Surprise-redundancy tune [How many times do I have to tell you...] 2 legumes are a good source of vitamins Low beginning followed by a gradual rise to a high at the end. 2 Contradiction tune I ve heard that linguini is a good source of vitamins. linguini isn t a good source of vitamins [... how could you think that?] Sharp fall at the beginning, flat and low, then rising at the end. 5

A single intonation phrase Multiple phrases 2 2 legumes are a good source of vitamins legumes are a good source of vitamins Broad focus statement consisting of one intonation phrase (that is, one intonation tune spans the whole unit). Utterances can be chunked up into smaller phrases in order to signal the importance of information in each unit. Phrasing can disambiguate Global ambiguity: The old men and women stayed home. The old men % and women % stayed home. Sally saw % the man with the binoculars. Sally saw the man % with the binoculars. John doesn t drink because he s unhappy. John doesn t drink % because he s unhappy. Phrasing can disambiguate Temporary ambiguity: When Madonna sings the song... When Madonna sings % the song is a hit. When Madonna sings the song % it s a hit. [from Speer & Kjelgaard (1992)] Phrasing can disambiguate Phrasing can disambiguate 2 Mary & Elena s mother mall 2 Mary Elena s mother mall I met Mary and Elena s mother at the mall yesterday I met Mary and Elena s mother at the mall yesterday One intonation phrase with relatively flat overall pitch range. Separate phrases, with epanded pitch movements. 6

ToBI: Tones and Break Indices Pitch accent tones H* peak accent L* low accent L+H* rising peak accent (contrastive) L*+H scooped accent H+!H* downstepped high Boundary tones L-L% (final low; Am Eng. Declarative contour) L-H% (continuation rise) H-H% (yes-no queston) Break indices 0: clitics, 1, word boundaries, 2 short pause 3 intermediate intonation phrase 4 full intonation phrase/final boundary. Eamples of the TOBI system I don t eat beef. L* L* L*L-L% Marianna made the marmalade. H* L-L% L* H-H% I means insert. H* H* H*L-L% 1 H*L- H*L-L% 3 Slide from Lavoie and Podesva Intonation in TTS 1) Accent: Decide which words are accented, which syllable has accent, what sort of accent 2) Boundaries: Decide where intonational boundaries are 3) Duration: Specify length of each segment 4) F0: Generate F0 contour from these Factors in accent prediction Contrast Legumes are poor source of VITAMINS No, legumes are a GOOD source of vitamins I think JOHN and MARY should go No, I think JOHN AND MARY should go But it s more than just contrast List intonation: I went and saw ANNA, LENNY, MARY, and NORA. Part of speech: Content words are usually accented Function words are rarely accented Word Order Preposed items are accented more frequently TODAY we will BEGIN to LOOK at FROG anatomy. We will BEGIN to LOOK at FROG anatomy today. Information Status: New versus old information. Old information is not deaccented There are LAWYERS, and there are GOOD lawyers EACH NATION DEFINES its OWN national INTERST. Comple NP Structure Sproat, R. 1994. English noun-phrase accent prediction for tet-tospeech. Computer Speech and Language 8:79-94. Proper Names, stress on right-most word New York CITY; Paris, FRANCE Adjective-Noun combinations, stress on noun Large HOUSE, red PEN, new NOTEBOOK Noun-Noun compounds: stress left noun HOTdog (food) versus HOT DOG (overheated animal) WHITE house (place) versus WHITE HOUSE (made of stucco) eamples: Madison AVENUE, park STREET, MEDICAL building APPLE cake, cherry PIE Some Rules: Furniture+Room -> RIGHT (e.g., kitchen TABLE) Proper-name + Street -> LEFT (e.g. PARK street) 7

State-of-the-Art Supervised systems Hand-labeled accented data Feature driven More features: POS POS of previous word POS of net word Stress of current, previous, net syllable Unigram probability of word Bigram probability of word Position of word in sentence Duration Simplest: fied size for all phones ( ms) Net simplest: average duration for that phone (from training data). Samples from SWBD in ms: aa 118 b 68 a 59 d 68 ay 138 dh 44 eh 87 f 90 ih 77 g 66 Net Net Simplest: add in phrase- final and initial lengthening plus stress: Duration Klatt duration rules: modify duration based on: Position in clause Syllable position in word Syllable type Leical stress Left+right contet phone Prepausal lengthening Supervised systems now used F0 generation by regression Supervised learning again Predict value of F0 at 3 places in each syllable Predictor features: Accent of current word, net word, previous Boundaries Syllable type, phonetic information Stress information Need training sets with pitch accents labeled Waveform Synthesis Given: String of phones Prosody Desired F0 for entire utterance Duration for each phone Stress value for each phone, possibly accent value Generate: Waveforms Concatenative Synthesis All current commercial systems. Diphone Synthesis Units are diphones; middle of one phone to middle of net. Why? Middle of phone is steady state. Record 1 speaker saying each diphone Unit Selection Synthesis Larger units Record 10 hours or more, so have multiple copies of each unit Use search to find best sequence of units 8

Diphone TTS architecture Recording conditions Collecting diphones: Record diphones in correct contets l sounds different in onset than coda t is flapped sometimes, etc. Need quiet recording room, etc. Need to label them very very eactly Training: Choose units (kinds of diphones) Record diphones Label diphones (decide where break is) Synthesizing an utterance, grab relevant diphones from database, use signal processing to change the prosody (F0, energy, duration) of selected sequence of diphones Ideal: Anechoic chamber Studio quality recording EGG signal More likely: Quiet room Cheap microphone/sound blaster No EGG Headmounted microphone What we can do: Repeatable conditions Careful setting on audio levels Diphone Boundaries, Ends Diphones Mid-phone is more stable than edge Need O(phone 2 ) number of units Some combinations don t eist (hopefully) May include stress, consonant clusters Lots of phonetic knowledge in design Database relatively small (by today s standards) Around 8 MB for English (16 KHz 16 bit) Diphone Synthesis Augmentations Stress Onset/coda Demi-syllables Problems: Signal processing still necessary for modifying durations Source data is still not natural Units are just not large enough; can t handle wordspecific effects, etc Unit Selection Synthesis Generalization of the diphone intuition Larger units From diphones to sentences Many many copies of each unit 10 hours of speech instead of 0 diphones (a few minutes of speech) 9

Why Unit Selection Synthesis Natural data solves problems with diphones Diphone databases are carefully designed but: Speaker makes errors Speaker doesn t speak intended dialect Require database design to be right If it s automatic Labeled with what the speaker actually said Coarticulation, schwas, flaps are natural There s no data like more data Lots of copies of each unit mean you can choose just the right one for the contet Larger units mean you can capture wider effects Unit Selection Intuition Given a big database Find the unit in the database that is the best to synthesize some target segment What does best mean? Target cost : Closest match to the target description, in terms of Phonetic contet F0, stress, phrase position Join cost : Best join with neighboring units Matching formants + other spectral characteristics Matching energy Matching F0 Targets and Target Costs A measure of how well a particular unit in the database matches the internal representation produced by the prior stages Features, costs, and weights Eamples: /ih-t/ from stressed syllable, phrase internal, high F0, content word /n-t/ from unstressed syllable, phrase final, low F0, content word /dh-a/ from unstressed syllable, phrase initial, high F0, from function word the Slide from Paul Taylor Target Costs Comprised of k subcosts Stress Phrase position F0 Phone duration Leical identity Target cost for a unit: p C t (t i,u i ) = w t k C t k ( t i,u i ) k=1 Slide from Paul Taylor How to set target cost weights Clever Hunt and Black (1996) idea: Hold out some utterances from the database Now synthesize one of these utterances Compute all the phonetic, prosodic, duration features Now for a given unit in the output For each possible unit that we COULD have used in its place We can compute its acoustic distance from the TRUE ACTUAL HUMAN utterance. This acoustic distance can tell us how to weight the phonetic/prosodic/duration features Join (Concatenation) Cost Measure of smoothness of join Measured between two database units (target is irrelevant) Features, costs, and weights Comprised of k subcosts: Spectral features F0 Energy p Join cost: C j (u i 1,u i ) = w j k C j k ( u i 1,u i ) k=1 Slide from Paul Taylor 10

Join costs The join cost can be used for more than just part of search Can use the join cost for optimal coupling (Conkie 1996), i.e., finding the best place to join the two units. Vary edges within a small amount to find best place for join This allows different joins with different units Thus labeling of database (or diphones) need not be so accurate Total Costs Hunt and Black 1996 We now have weights (per phone type) for features set between target and database units Find best path of units through database that minimize: n n C(t n 1,u n 1 ) = C target ( t i,u i ) + C join ( u i 1,u i ) i=1 u ˆ n 1 = argminc(t n 1,u n 1 ) u 1,...,u n Standard problem solvable with Viterbi search with beam width constraint for pruning i= 2 Slide from Paul Taylor Unit Selection Search Improvements Taylor and Black 1999: Phonological Structure Matching Label whole database as trees: Words/phrases, syllables, phones For target utterance: Label it as tree Top-down, find subtrees that cover target Recurse if no subtree found Produces list of target subtrees: Eplicitly longer units than other techniques Selects on: Phonetic/metrical structure Only indirectly on prosody No acoustic cost Database creation (1) Good speaker Professional speakers are always better: Consistent style and articulation Although these databases are carefully labeled Ideally (according to AT&T eperiments): Record 20 professional speakers (small amounts of data) Build simple synthesis eamples Get many (?) people to listen and score them Take best voices Correlates for human preferences: High power in unvoiced speech High power in higher frequencies Larger pitch range Tet from Paul Taylor and Richard Sproat Database creation (2) Good recording conditions Good script Application dependent helps Good word coverage News data synthesizes as news data News data is bad for dialog. Good phonetic coverage, especially wrt contet Low ambiguity Easy to read Annotate at phone level, with stress, word information, phrase breaks Tet from Paul Taylor and Richard Sproat 11

Creating database Unlike diphones, prosodic variation is a good thing Accurate annotation is crucial Pitch annotation needs to be very very accurate Phone alignments can be done automatically, as described for diphones Practical System Issues Size of typical system (Rhetorical rvoice): ~M Speed: For each diphone, average of 0 units to choose from, so: 0 target costs 00 join costs Each join cost, say 3030 float point calculations 10-15 diphones per second 10 billion floating point calculations per second But commercial systems must run ~ faster than real time Heavy pruning essential: 0 units -> 25 units Slide from Paul Taylor Unit Selection Summary Advantages Quality is far superior to diphones Natural prosody selection sounds better Disadvantages: Quality can be very bad in places HCI problem: mi of very good and very bad is quite annoying Synthesis is computationally epensive Can t synthesize everything you want: Diphone technique can move emphasis Unit selection gives good (but possibly incorrect) result Joining Units (+F0 + duration) Both diphone and unit selection synthesis need to join the units For diphone synthesis, need to modify F0 and duration For unit selection, in principle also need to modify F0 and duration of selection units But in practice, if unit- selection database is big enough (commercial systems) often avoid prosodic modifications altogether, as selected targets may already be close to desired prosody. Alan Black Joining Units Dumb: just join Better: at zero crossings TD-PSOLA Time- domain pitch- synchronous overlap- andadd Join at pitch periods (with windowing) Prosodic Modification Modifying pitch and duration independently Changing sample rate modifies both: Chipmunk speech Duration: duplicate/remove parts of the signal Pitch: resample to change pitch Alan Black Tet from Alan Black 12

Speech as Short Term signals Duration modification Duplicate/remove short term signals Alan Black Pitch Modification Overlap-and-add (OLA) Move short- term signals closer together/further apart Huang, Acero and Hon TD-PSOLA TD-PSOLA Time-Domain Pitch Synchronous Overlap and Add Patented by France Telecom (CNET) Very efficient No FFT (or inverse FFT) required Can modify Hz up to two times or by half Thierry Dutoit 13

Evaluation of TTS Intelligibility Tests Diagnostic Rhyme Test (DRT) Humans do listening identification choice between two words differing by a single phonetic feature Voicing, nasality, sustenation, sibilation 96 rhyming pairs Veal/feel, meat/beat, vee/bee, zee/thee, etc Subject hears veal, chooses either veal or feel Subject also hears feel, chooses either veal or feel % of right answers is intelligibility score. Overall Quality Tests Have listeners rate space on a scale from 1 (bad) to 5 (ecellent) Preference Tests (prefer A, prefer B) Huang, Acero, Hon 14