Speech Synthesis: Then and Now

Similar documents
Speech Recognition at ICSI: Broadcast News and beyond

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Modern TTS systems. CS 294-5: Statistical Natural Language Processing. Types of Modern Synthesis. TTS Architecture. Text Normalization

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

Consonants: articulation and transcription

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Learning Methods in Multilingual Speech Recognition

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Expressive speech synthesis: a review

English Language and Applied Linguistics. Module Descriptions 2017/18

Mandarin Lexical Tone Recognition: The Gating Paradigm

A Hybrid Text-To-Speech system for Afrikaans

THE PERCEPTION AND PRODUCTION OF STRESS AND INTONATION BY CHILDREN WITH COCHLEAR IMPLANTS

A study of speaker adaptation for DNN-based speech synthesis

Rhythm-typology revisited.

Rachel E. Baker, Ann R. Bradlow. Northwestern University, Evanston, IL, USA

A comparison of spectral smoothing methods for segment concatenation based speech synthesis

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Phonological and Phonetic Representations: The Case of Neutralization

First Grade Curriculum Highlights: In alignment with the Common Core Standards

Body-Conducted Speech Recognition and its Application to Speech Support System

English for Life. B e g i n n e r. Lessons 1 4 Checklist Getting Started. Student s Book 3 Date. Workbook. MultiROM. Test 1 4

Constraining X-Bar: Theta Theory

THE RECOGNITION OF SPEECH BY MACHINE

Segregation of Unvoiced Speech from Nonspeech Interference

A Cross-language Corpus for Studying the Phonetics and Phonology of Prominence

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Phonological Processing for Urdu Text to Speech System

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Perceptual scaling of voice identity: common dimensions for different vowels and speakers

The Acquisition of English Intonation by Native Greek Speakers

Letter-based speech synthesis

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Fix Your Vowels: Computer-assisted training by Dutch learners of Spanish

Modeling function word errors in DNN-HMM based LVCSR systems

age, Speech and Hearii

THE MULTIVOC TEXT-TO-SPEECH SYSTEM

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Quarterly Progress and Status Report. Voiced-voiceless distinction in alaryngeal speech - acoustic and articula

Phonetics. The Sound of Language

Dyslexia/dyslexic, 3, 9, 24, 97, 187, 189, 206, 217, , , 367, , , 397,

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Test Blueprint. Grade 3 Reading English Standards of Learning

SEGMENTAL FEATURES IN SPONTANEOUS AND READ-ALOUD FINNISH

L1 Influence on L2 Intonation in Russian Speakers of English

Modeling function word errors in DNN-HMM based LVCSR systems

Learners Use Word-Level Statistics in Phonetic Category Acquisition

SOUND STRUCTURE REPRESENTATION, REPAIR AND WELL-FORMEDNESS: GRAMMAR IN SPOKEN LANGUAGE PRODUCTION. Adam B. Buchwald

Voiceless Stop Consonant Modelling and Synthesis Framework Based on MISO Dynamic System

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

Proceedings of Meetings on Acoustics

Speech Emotion Recognition Using Support Vector Machine

CS 598 Natural Language Processing

Achievement Level Descriptors for American Literature and Composition

Designing a Speech Corpus for Instance-based Spoken Language Generation

Building Text Corpus for Unit Selection Synthesis

The IRISA Text-To-Speech System for the Blizzard Challenge 2017

On Developing Acoustic Models Using HTK. M.A. Spaans BSc.

Arizona s English Language Arts Standards th Grade ARIZONA DEPARTMENT OF EDUCATION HIGH ACADEMIC STANDARDS FOR STUDENTS

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Different Task Type and the Perception of the English Interdental Fricatives

Unit Selection Synthesis Using Long Non-Uniform Units and Phonemic Identity Matching

Florida Reading Endorsement Alignment Matrix Competency 1

Prentice Hall Literature: Timeless Voices, Timeless Themes Gold 2000 Correlated to Nebraska Reading/Writing Standards, (Grade 9)

PHONETIC DISTANCE BASED ACCENT CLASSIFIER TO IDENTIFY PRONUNCIATION VARIANTS AND OOV WORDS

1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all

MISSISSIPPI OCCUPATIONAL DIPLOMA EMPLOYMENT ENGLISH I: NINTH, TENTH, ELEVENTH AND TWELFTH GRADES

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Speaker Recognition. Speaker Diarization and Identification

The pronunciation of /7i/ by male and female speakers of avant-garde Dutch

ADDIS ABABA UNIVERSITY SCHOOL OF GRADUATE STUDIES MODELING IMPROVED AMHARIC SYLLBIFICATION ALGORITHM

Cambridgeshire Community Services NHS Trust: delivering excellence in children and young people s health services

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

Prentice Hall Literature: Timeless Voices, Timeless Themes, Platinum 2000 Correlated to Nebraska Reading/Writing Standards (Grade 10)

Journal of Phonetics

Audible and visible speech

PM tutor. Estimate Activity Durations Part 2. Presented by Dipo Tepede, PMP, SSBB, MBA. Empowering Excellence. Powered by POeT Solvers Limited

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

CEFR Overall Illustrative English Proficiency Scales

Finding Translations in Scanned Book Collections

Teacher: Mlle PERCHE Maeva High School: Lycée Charles Poncet, Cluses (74) Level: Seconde i.e year old students

Literature and the Language Arts Experiencing Literature

Outline. Dave Barry on TTS. History of TTS. Closer to a natural vocal tract: Riesz Von Kempelen:

Highlighting and Annotation Tips Foundation Lesson

Online Publication Date: 01 May 1981 PLEASE SCROLL DOWN FOR ARTICLE

GENERAL COMMENTS Some students performed well on the 2013 Tamil written examination. However, there were some who did not perform well.

Word Stress and Intonation: Introduction

Universal contrastive analysis as a learning principle in CAPT

Natural Language Processing. George Konidaris

Self-Supervised Acquisition of Vowels in American English

Perceived speech rate: the effects of. articulation rate and speaking style in spontaneous speech. Jacques Koreman. Saarland University

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

MYP Language A Course Outline Year 3

Self-Supervised Acquisition of Vowels in American English

CDE: 1st Grade Reading, Writing, and Communicating Page 2 of 27

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Transcription:

Speech Synthesis: Then and Now Julia Hirschberg CS 4706 2/7/2011 1

Today Then: Early speech synthesizers Now: Overview of Modern TTS Systems Think about: how do we evaluate a synthesizer 2/7/2011 2

The First Speaking Machine Wolfgang von Kempelen, Mechanismus der menschlichen Sprache nebst Beschreibung einer sprechenden Maschine, 1791 (in Deutsches Museum still and playable) First to produce whole words, phrases in many languages 2/7/2011 3

Joseph Faber s Euphonia, 1846 2/7/2011 4

Constructed 1835 w/pedal and keyboard control Whispered and ordinary speech Model of tongue, pharyngeal cavity with manipulable shape Singing too: God Save the Queen Riesz s 1937 synthesizer with almost natural vocal tract shape Forerunners of Modern Articulatory Synthesis: George Rosen s DAVO synthesizer (1958) at MIT 2/7/2011 5

2/7/2011 6

World s Fair in NY, 1939 Requires much training to play Purpose: coding/compression Reduce bandwidth needed to transmit speech, so many phone calls can be sent over single line 2/7/2011 7

2/7/2011 8

2/7/2011 9

Answers: These days a chicken leg is a rare dish. It s easy to tell the depth of a well. Four hours of steady work faced us. Automatic synthesis from spectrogram but can also use hand-painted spectrograms as input Purpose: understand perceptual effect of spectral details 2/7/2011 10

Formant/Resonance/Acoustic Synthesis Parametric or resonance synthesis Specify minimal parameters, e.g. f0 and first 3 formants Pass electronic source signal thru filter Harmonic tone for voiced sounds Aperiodic noise for unvoiced Filter simulates the different resonances of the vocal tract E.g. Walter Lawrence s Parametric Artificial Talker (1953) for vowels and consonants Gunnar Fant s Orator Verbis Electris (1953) for vowels Formant synthesis download (M$demo) 2/7/2011 11

Synthesis by Computer Beginnings ~1960; dominant from 1970 2/7/2011 12

Concatenative Synthesis Most common type today First practical application in 1936: British Phone company s Talking Clock Optical storage for words, part-words, phrases Concatenated to tell time E.g. And a similar example from Radio Free Vestibule (1994) Bell Labs TTS (1977) (1985) 2/7/2011 13

Variants of Concatenative Synthesis Inventory units Diphone synthesis (e.g. Festival) Microsegment synthesis Unit Selection large, variable units Issues How well do units fit together? What is the perceived acoustic quality of the concatenated units? Is post-processing on the output possible, to improve quality? 2/7/2011 14

Overview: Synthesizer I/O Front end: From input to control parameters Acoustic/phonetic representations, naturally occurring text, constrained mark-up language, semantic/conceptual representations Back end: From control parameters to waveform Articulatory, formant/acoustic, concatenative, (diphone, unit-selection/corpus, HMM) synthesis 2/7/2011 15

TTS Production Levels Knowledge World Knowledge Syntax, semantics, lexicon Phonetics/phonology Acoustics/signal processing Task Text Normalization Pronunciation, intonation assignment Duration, f0, durations Waveform production 2/7/2011 16

Text Normalization Issues Reading is what W. hates most. Reading is what Wilde hated most. The NAACP just elected a new president. In 1996 she sold 2010 shares and deposited $42 in her 401(k). The duck dove supply. Homographs, numbers, abbreviations 2/7/2011 17

Pronunciation Issues Rules for disambiguation in context: bass Lexicon: comb, tomb, Punxsutawney Phil Letter-to-Sound Rules Hand built Learned from data (pronunciation dictionary) Hard to get good accuracy and coverage many exceptions Dictionary of pronunciations More accurate New (Out-of-Vocabulary) words a problem 2/7/2011 18

Intonation Assignment Issues: Phrasing Traditional: hand-built rules Use punctuation: 234-5682, New York, NY Context/function word: no breaks after function word: He went to dinner. He came to and went to dinner. Syntax: She favors the nuts and bolts approach. She went home and Dave stayed. Current: machine learning on large labeled corpus 2/7/2011 19

Intonation Assignment Issues: Accent Hand-built rules Function/content distinction He went out the back door/he threw out the trash Complex nominals: Main Street/Park Avenue city hall parking lot (stress shift) Statistical procedures trained on large corpora Need lots of data Why learn what you already know? 2/7/2011 20

Intonation Assignment Issues: Contours Simple rules. = declarative contour? = yes-no-question contour unless wh-word present at/near front of sentence Well then, how did he do it? And what do you know? Pretty monotonous in long stretches of speech Problem: no one knows how to assign other contours from text 2/7/2011 21

Phonological Specification Issues Task is to produce a phonological representation from phones and intonational assignment Align phones and f0 contour Specify durations and intensity Select/create appropriate acoustic realization from this specification: Acoustic transformation Concatenation: diphone, unit selection HMM 2/7/2011 22

Not Quite There Festival concatenative: Acuvoice concatenative: HMM synthesis (Rob Donovan): Rhetorical unit selection (acquired by Nuance) AT&T Labs Naturally Speaking 2/7/2011 23

Next Class Project Phase I assigned: building a TTS System Introduction to Festival TTS 2/7/2011 24