Speech Synthesis. Tokyo Institute of Technology Department of fcomputer Science

Size: px

Start display at page:

Download "Speech Synthesis. Tokyo Institute of Technology Department of fcomputer Science"

Rosamond Pearson
6 years ago
Views:

1 Speech Synthesis Sadaoki Furui Tokyo Institute of Technology Department of fcomputer Science

2 Pronouncing Acoustic dictionary segments and rules dictionary Text input Pronounce words based on rules or dictionary look-up. Synthesize waveform based on Concatenated parameters. Semantic preprocessing and phrase parsing Text-to- phoneme conversion Timing and intonation Segmental concatenation Synthesizer Expand abbreviations, numbers, etc.; assign phrase structure and stress based on grammatical heuristics. Assign pitch and duration. Concatenate parts of speech. Speech output Principal elements of text-to-speech conversion system

3 SH lever SH whistle Reed cutoff Speech sounds come out here Bellows Leather resonator Nostril Auxiliary bellows S whistle S lever Reed Speech sounds Compressed air chamber Mechanical speech synthesizer by von Kempelen

4 The sound production mechanism of Kempelen's speaking machine.

5 FOSAS NASALES FUELLE PRINCIPAL BOCA FUELLE AUXILLAR MUELLE Von Kempelen's speaking machine, as it can be seen in the Deutsches Museum in Munich, and seen from above, with the cover of the box

6 Voder synthesizer (1939)

7 Random noise source Relaxation oscillator Constriction (Unvoiced source) Vocal cords (Voiced source) Vocal tract Resonance control Radiation Amplifier Qu uiet Filter-control keys Loud speaker Energy switch (Wrist bar) t-d p-b k-g Stops Pitch-control control pedal Voder synthesizer

8 Amplitud de [db] F F 1 F 2 F 3 F Frequency [khz] Contribution of each formant to the amplitude spectrum

9 Operating controls Microphone Testomg equipt and clock Wrist bar Pitch-control control pedal The voder as demonstrated by Mrs. Harper at the Franklin institute

10 The voder being demonstrated at the New York world s fair

11 History of speech synthesis 1 The VODER of Homer Dudley The DAVO articulatory synthesizer developed 1958 by George Rosen at M.I.T. 6 Copying a natural sentence using the second generation of Gunnar Fant s OVE cascade formant synthesizer 13 Linear-prediction analysis and resynthesis of speech at a low-bit rate in the Texas Instruments Speak- n-spell toy, Richard Wiggins 30 The M.I.T. MITalk system by Jonathan Allen, Sheri Hunnicutt, and Dennis Klatt 33 The Klattalk system by Dennis Klatt of M.I.T. which formed the basis for Digital Equipment Corporation s DEC-talk commercial system Several of the DECtalk voices 36 DECtalk speaking at about 300 words/munute

12 Speak- n-spell toy

13 Flow diagram showing CHATR s corpus processing Pre-existing existing language & prosody knowledge base New speaker database Text Speech Labeling the speech data Predicting gp prosody Input text (at synthesis time) Parameter estimation Learning db-specific prosodic knowledge Index creation Speaker database Predicted values (f0, pwr, dur, etc.) Unit Selection Waveform concatenation ti Synthesized speech

14 HMM-based speech synthesis system Speech database Speech signal Excitation parameter extraction Excitation parameter Spectral parameter extraction Spectral parameter Training part Label Training of HMM Text Context dependent HMMs Text analysis Label Parameter generation from HMM Synthesis part Excitation parameter Excitation generation Synthesis filter Spectral parameter Synthesized speech

15 Parsed text and phone string Pause insertion and prosodic phrasing Speech style Duration F0Contour Volume Enriched prosodic representation Block diagram of a prosody generation system; different prosodic Block diagram of a prosody generation system; different prosodic representations are obtained depending on the speaking style we use.

16 Parsed text and phone string Symbolic prosody Pauses Prosodic phrases Accent Tone Tune Prosody attributes Pitch range Prominence Declination Speaking style F 0 contour F 0 Contour generation Pitch generation decomposed in symbolic and phonetic prosody

17 F 0 1 st 2 nd 3 rd 4 th t The four Chinese tones

18 ToBI pitch accent tones ToBI tone Description Graph H* L* Peak accent a tone target on an accented syllable which is in the upper part of the speaker ss pitch range. Low accent a tone target on an accented syllable which is in the lowest part of the speaker s pitch range. L*+H Scooped accent a low tone target on an accented syllable which is immediately followed by a relatively sharp rise to a peak in the upper part of the speaker s pitch range. Scooped downstep accent a low tone target on an L*+!H accented syllable which is immediately followed by a relatively flat rise to a downstep peak. L+H*!H* Rising peak accent a high peak target on an accented syllable which is immediately preceded by a relatively sharp rise from a valley in the lowest part of the speaker s pitch range. Downstep high tone a clear step down onto an accented syllable from a high pitch which itself cannot be accounted for by an H phrasal tone ending the preceding phrase or by a preceding H pitch accent in the same phrase.

19 Marianna made the marmalade, with an H* accent on Marianna and marmalade, and final L-L% marking the characteristic sentence-final pitch drop. Note the use of 1 for the weak inter-word breaks, and 4 for the sentence-final break (after Beckman)

Modern TTS systems. CS 294-5: Statistical Natural Language Processing. Types of Modern Synthesis. TTS Architecture. Text Normalization

Modern TTS systems. CS 294-5: Statistical Natural Language Processing. Types of Modern Synthesis. TTS Architecture. Text Normalization CS 294-5: Statistical Natural Language Processing Speech Synthesis Lecture 22: 12/4/05 Modern TTS systems 1960 s first full TTS Umeda et al (1968) 1970 s Joe Olive 1977 concatenation of linearprediction