Speech Synthesis. Tokyo Institute of Technology Department of fcomputer Science

Similar documents
Modern TTS systems. CS 294-5: Statistical Natural Language Processing. Types of Modern Synthesis. TTS Architecture. Text Normalization

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Speech Emotion Recognition Using Support Vector Machine

Body-Conducted Speech Recognition and its Application to Speech Support System

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Expressive speech synthesis: a review

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Speech Recognition at ICSI: Broadcast News and beyond

Word Stress and Intonation: Introduction

Rachel E. Baker, Ann R. Bradlow. Northwestern University, Evanston, IL, USA

Rhythm-typology revisited.

THE RECOGNITION OF SPEECH BY MACHINE

L1 Influence on L2 Intonation in Russian Speakers of English

The IRISA Text-To-Speech System for the Blizzard Challenge 2017

THE PERCEPTION AND PRODUCTION OF STRESS AND INTONATION BY CHILDREN WITH COCHLEAR IMPLANTS

THE MULTIVOC TEXT-TO-SPEECH SYSTEM

A Cross-language Corpus for Studying the Phonetics and Phonology of Prominence

A comparison of spectral smoothing methods for segment concatenation based speech synthesis

Acoustic correlates of stress and their use in diagnosing syllable fusion in Tongan. James White & Marc Garellek UCLA

age, Speech and Hearii

Mandarin Lexical Tone Recognition: The Gating Paradigm

SEGMENTAL FEATURES IN SPONTANEOUS AND READ-ALOUD FINNISH

Segregation of Unvoiced Speech from Nonspeech Interference

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Automatic intonation assessment for computer aided language learning

WHEN THERE IS A mismatch between the acoustic

The Acquisition of English Intonation by Native Greek Speakers

Speaker Recognition. Speaker Diarization and Identification

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Voice conversion through vector quantization

Journal of Phonetics

Automatic segmentation of continuous speech using minimum phase group delay functions

Revisiting the role of prosody in early language acquisition. Megha Sundara UCLA Phonetics Lab

Perceived speech rate: the effects of. articulation rate and speaking style in spontaneous speech. Jacques Koreman. Saarland University

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY

First Grade Curriculum Highlights: In alignment with the Common Core Standards

Online Publication Date: 01 May 1981 PLEASE SCROLL DOWN FOR ARTICLE

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Perceptual scaling of voice identity: common dimensions for different vowels and speakers

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

A study of speaker adaptation for DNN-based speech synthesis

On Developing Acoustic Models Using HTK. M.A. Spaans BSc.

CS 598 Natural Language Processing

A survey of intonation systems

Consonants: articulation and transcription

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Human Emotion Recognition From Speech

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Designing a Speech Corpus for Instance-based Spoken Language Generation

SOUND STRUCTURE REPRESENTATION, REPAIR AND WELL-FORMEDNESS: GRAMMAR IN SPOKEN LANGUAGE PRODUCTION. Adam B. Buchwald

A Hybrid Text-To-Speech system for Afrikaans

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Part I. Figuring out how English works

Biome I Can Statements

Demonstration of problems of lexical stress on the pronunciation Turkish English teachers and teacher trainees by computer

The influence of metrical constraints on direct imitation across French varieties

SIE: Speech Enabled Interface for E-Learning

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

Proceedings of Meetings on Acoustics

Phonological Processing for Urdu Text to Speech System

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

Natural Language Processing. George Konidaris

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Lower and Upper Secondary

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Large Kindergarten Centers Icons

Linking object names and object categories: Words (but not tones) facilitate object categorization in 6- and 12-month-olds

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Learners Use Word-Level Statistics in Phonetic Category Acquisition

/$ IEEE

Surface Structure, Intonation, and Meaning in Spoken Language

Letter-based speech synthesis

1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all

To appear in the Proceedings of the 35th Meetings of the Chicago Linguistics Society. Post-vocalic spirantization: Typology and phonetic motivations

Discourse Structure in Spoken Language: Studies on Speech Corpora

Public Speaking Rubric

Cross Language Information Retrieval

Quarterly Progress and Status Report. Voiced-voiceless distinction in alaryngeal speech - acoustic and articula

A Privacy-Sensitive Approach to Modeling Multi-Person Conversations

Beginning primarily with the investigations of Zimmermann (1980a),

5 Guidelines for Learning to Spell

Outline. Dave Barry on TTS. History of TTS. Closer to a natural vocal tract: Riesz Von Kempelen:

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading

English Language and Applied Linguistics. Module Descriptions 2017/18

Speaker recognition using universal background model on YOHO database

REVIEW OF CONNECTED SPEECH

Problems of the Arabic OCR: New Attitudes

Statistical Parametric Speech Synthesis

Program Matrix - Reading English 6-12 (DOE Code 398) University of Florida. Reading

PRAAT ON THE WEB AN UPGRADE OF PRAAT FOR SEMI-AUTOMATIC SPEECH ANNOTATION

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

IEEE Proof Print Version

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Transcription:

Speech Synthesis Sadaoki Furui Tokyo Institute of Technology Department of fcomputer Science furui@cs.titech.ac.jp

0107-14 Pronouncing Acoustic dictionary segments and rules dictionary Text input Pronounce words based on rules or dictionary look-up. Synthesize waveform based on Concatenated parameters. Semantic preprocessing and phrase parsing Text-to- phoneme conversion Timing and intonation Segmental concatenation Synthesizer Expand abbreviations, numbers, etc.; assign phrase structure and stress based on grammatical heuristics. Assign pitch and duration. Concatenate parts of speech. Speech output Principal elements of text-to-speech conversion system

SH lever SH whistle Reed cutoff Speech sounds come out here Bellows Leather resonator Nostril Auxiliary bellows S whistle S lever Reed Speech sounds Compressed air chamber Mechanical speech synthesizer by von Kempelen

The sound production mechanism of Kempelen's speaking machine.

FOSAS NASALES FUELLE PRINCIPAL BOCA FUELLE AUXILLAR MUELLE Von Kempelen's speaking machine, as it can be seen in the Deutsches Museum in Munich, and seen from above, with the cover of the box

Voder synthesizer (1939)

0111-18 Random noise source Relaxation oscillator Constriction (Unvoiced source) Vocal cords (Voiced source) Vocal tract Resonance control Radiation Amplifier Qu uiet 2 3 4 7 1 6 5 10 8 9 Filter-control keys Loud speaker Energy switch (Wrist bar) t-d p-b k-g Stops Pitch-control control pedal Voder synthesizer

0105-16 Amplitud de [db] 40 30 20 F 10 0-10 -20-30 -40 F 1 F 2 F 3 F 4-50 0 1 2 3 4 Frequency [khz] Contribution of each formant to the amplitude spectrum

Operating controls Microphone Testomg equipt and clock Wrist bar Pitch-control control pedal The voder as demonstrated by Mrs. Harper at the Franklin institute

The voder being demonstrated at the New York world s fair 0202-06

0311-05 History of speech synthesis 1 The VODER of Homer Dudley 1939 11 The DAVO articulatory synthesizer developed 1958 by George Rosen at M.I.T. 6 Copying a natural sentence using the second generation of Gunnar Fant s OVE cascade formant synthesizer 13 Linear-prediction analysis and resynthesis of speech at a low-bit rate in the Texas Instruments Speak- n-spell toy, Richard Wiggins 30 The M.I.T. MITalk system by Jonathan Allen, Sheri Hunnicutt, and Dennis Klatt 33 The Klattalk system by Dennis Klatt of M.I.T. which formed the basis for Digital Equipment Corporation s DEC-talk commercial system 1962 1980 1979 1983 35 Several of the DECtalk voices 36 DECtalk speaking at about 300 words/munute

Speak- n-spell toy

Flow diagram showing CHATR s corpus processing Pre-existing existing language & prosody knowledge base New speaker database Text Speech Labeling the speech data Predicting gp prosody Input text (at synthesis time) Parameter estimation Learning db-specific prosodic knowledge Index creation Speaker database Predicted values (f0, pwr, dur, etc.) Unit Selection Waveform concatenation ti Synthesized speech

HMM-based speech synthesis system Speech database Speech signal Excitation parameter extraction Excitation parameter Spectral parameter extraction Spectral parameter Training part Label Training of HMM Text Context dependent HMMs Text analysis Label Parameter generation from HMM Synthesis part Excitation parameter Excitation generation Synthesis filter Spectral parameter Synthesized speech

Parsed text and phone string Pause insertion and prosodic phrasing Speech style Duration F0Contour Volume Enriched prosodic representation Block diagram of a prosody generation system; different prosodic Block diagram of a prosody generation system; different prosodic representations are obtained depending on the speaking style we use.

0108-12 Parsed text and phone string Symbolic prosody Pauses Prosodic phrases Accent Tone Tune Prosody attributes Pitch range Prominence Declination Speaking style F 0 contour F 0 Contour generation Pitch generation decomposed in symbolic and phonetic prosody

F 0 1 st 2 nd 3 rd 4 th t The four Chinese tones

ToBI pitch accent tones ToBI tone Description Graph 0108-15 H* L* Peak accent a tone target on an accented syllable which is in the upper part of the speaker ss pitch range. Low accent a tone target on an accented syllable which is in the lowest part of the speaker s pitch range. L*+H Scooped accent a low tone target on an accented syllable which is immediately followed by a relatively sharp rise to a peak in the upper part of the speaker s pitch range. Scooped downstep accent a low tone target on an L*+!H accented syllable which is immediately followed by a relatively flat rise to a downstep peak. L+H*!H* Rising peak accent a high peak target on an accented syllable which is immediately preceded by a relatively sharp rise from a valley in the lowest part of the speaker s pitch range. Downstep high tone a clear step down onto an accented syllable from a high pitch which itself cannot be accounted for by an H phrasal tone ending the preceding phrase or by a preceding H pitch accent in the same phrase.

Marianna made the marmalade, with an H* accent on Marianna and marmalade, and final L-L% marking the characteristic sentence-final pitch drop. Note the use of 1 for the weak inter-word breaks, and 4 for the sentence-final break (after Beckman)