Speech Synthesis. Tokyo Institute of Technology Department of fcomputer Science

Speech Synthesis Sadaoki Furui Tokyo Institute of Technology Department of fcomputer Science furui@cs.titech.ac.jp

0107-14 Pronouncing Acoustic dictionary segments and rules dictionary Text input Pronounce words based on rules or dictionary look-up. Synthesize waveform based on Concatenated parameters. Semantic preprocessing and phrase parsing Text-to- phoneme conversion Timing and intonation Segmental concatenation Synthesizer Expand abbreviations, numbers, etc.; assign phrase structure and stress based on grammatical heuristics. Assign pitch and duration. Concatenate parts of speech. Speech output Principal elements of text-to-speech conversion system

SH lever SH whistle Reed cutoff Speech sounds come out here Bellows Leather resonator Nostril Auxiliary bellows S whistle S lever Reed Speech sounds Compressed air chamber Mechanical speech synthesizer by von Kempelen

The sound production mechanism of Kempelen's speaking machine.

FOSAS NASALES FUELLE PRINCIPAL BOCA FUELLE AUXILLAR MUELLE Von Kempelen's speaking machine, as it can be seen in the Deutsches Museum in Munich, and seen from above, with the cover of the box

Voder synthesizer (1939)

0111-18 Random noise source Relaxation oscillator Constriction (Unvoiced source) Vocal cords (Voiced source) Vocal tract Resonance control Radiation Amplifier Qu uiet 2 3 4 7 1 6 5 10 8 9 Filter-control keys Loud speaker Energy switch (Wrist bar) t-d p-b k-g Stops Pitch-control control pedal Voder synthesizer

0105-16 Amplitud de [db] 40 30 20 F 10 0-10 -20-30 -40 F 1 F 2 F 3 F 4-50 0 1 2 3 4 Frequency [khz] Contribution of each formant to the amplitude spectrum

Operating controls Microphone Testomg equipt and clock Wrist bar Pitch-control control pedal The voder as demonstrated by Mrs. Harper at the Franklin institute

The voder being demonstrated at the New York world s fair 0202-06

0311-05 History of speech synthesis 1 The VODER of Homer Dudley 1939 11 The DAVO articulatory synthesizer developed 1958 by George Rosen at M.I.T. 6 Copying a natural sentence using the second generation of Gunnar Fant s OVE cascade formant synthesizer 13 Linear-prediction analysis and resynthesis of speech at a low-bit rate in the Texas Instruments Speak- n-spell toy, Richard Wiggins 30 The M.I.T. MITalk system by Jonathan Allen, Sheri Hunnicutt, and Dennis Klatt 33 The Klattalk system by Dennis Klatt of M.I.T. which formed the basis for Digital Equipment Corporation s DEC-talk commercial system 1962 1980 1979 1983 35 Several of the DECtalk voices 36 DECtalk speaking at about 300 words/munute

Speak- n-spell toy

Flow diagram showing CHATR s corpus processing Pre-existing existing language & prosody knowledge base New speaker database Text Speech Labeling the speech data Predicting gp prosody Input text (at synthesis time) Parameter estimation Learning db-specific prosodic knowledge Index creation Speaker database Predicted values (f0, pwr, dur, etc.) Unit Selection Waveform concatenation ti Synthesized speech

HMM-based speech synthesis system Speech database Speech signal Excitation parameter extraction Excitation parameter Spectral parameter extraction Spectral parameter Training part Label Training of HMM Text Context dependent HMMs Text analysis Label Parameter generation from HMM Synthesis part Excitation parameter Excitation generation Synthesis filter Spectral parameter Synthesized speech

Parsed text and phone string Pause insertion and prosodic phrasing Speech style Duration F0Contour Volume Enriched prosodic representation Block diagram of a prosody generation system; different prosodic Block diagram of a prosody generation system; different prosodic representations are obtained depending on the speaking style we use.

0108-12 Parsed text and phone string Symbolic prosody Pauses Prosodic phrases Accent Tone Tune Prosody attributes Pitch range Prominence Declination Speaking style F 0 contour F 0 Contour generation Pitch generation decomposed in symbolic and phonetic prosody

F 0 1 st 2 nd 3 rd 4 th t The four Chinese tones

ToBI pitch accent tones ToBI tone Description Graph 0108-15 H* L* Peak accent a tone target on an accented syllable which is in the upper part of the speaker ss pitch range. Low accent a tone target on an accented syllable which is in the lowest part of the speaker s pitch range. L*+H Scooped accent a low tone target on an accented syllable which is immediately followed by a relatively sharp rise to a peak in the upper part of the speaker s pitch range. Scooped downstep accent a low tone target on an L*+!H accented syllable which is immediately followed by a relatively flat rise to a downstep peak. L+H*!H* Rising peak accent a high peak target on an accented syllable which is immediately preceded by a relatively sharp rise from a valley in the lowest part of the speaker s pitch range. Downstep high tone a clear step down onto an accented syllable from a high pitch which itself cannot be accounted for by an H phrasal tone ending the preceding phrase or by a preceding H pitch accent in the same phrase.

Marianna made the marmalade, with an H* accent on Marianna and marmalade, and final L-L% marking the characteristic sentence-final pitch drop. Note the use of 1 for the weak inter-word breaks, and 4 for the sentence-final break (after Beckman)