Speech Synthesis: Overview

Similar documents
Speech Recognition at ICSI: Broadcast News and beyond

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

A study of speaker adaptation for DNN-based speech synthesis

Letter-based speech synthesis

Expressive speech synthesis: a review

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Edinburgh Research Explorer

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Phonological Processing for Urdu Text to Speech System

Speech Emotion Recognition Using Support Vector Machine

Statistical Parametric Speech Synthesis

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Modeling function word errors in DNN-HMM based LVCSR systems

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Learning Methods in Multilingual Speech Recognition

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Human Emotion Recognition From Speech

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Mandarin Lexical Tone Recognition: The Gating Paradigm

English Language and Applied Linguistics. Module Descriptions 2017/18

Modern TTS systems. CS 294-5: Statistical Natural Language Processing. Types of Modern Synthesis. TTS Architecture. Text Normalization

Modeling function word errors in DNN-HMM based LVCSR systems

The IRISA Text-To-Speech System for the Blizzard Challenge 2017

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

CS Machine Learning

ADDIS ABABA UNIVERSITY SCHOOL OF GRADUATE STUDIES MODELING IMPROVED AMHARIC SYLLBIFICATION ALGORITHM

A Hybrid Text-To-Speech system for Afrikaans

ASSISTIVE COMMUNICATION

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

A Cross-language Corpus for Studying the Phonetics and Phonology of Prominence

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Unit Selection Synthesis Using Long Non-Uniform Units and Phonemic Identity Matching

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

SIE: Speech Enabled Interface for E-Learning

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Linking Task: Identifying authors and book titles in verbose queries

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

CEFR Overall Illustrative English Proficiency Scales

Experiments with Cross-lingual Systems for Synthesis of Code-Mixed Text

Body-Conducted Speech Recognition and its Application to Speech Support System

CS 446: Machine Learning

Individual Differences & Item Effects: How to test them, & how to test them well

SEGMENTAL FEATURES IN SPONTANEOUS AND READ-ALOUD FINNISH

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Building Text Corpus for Unit Selection Synthesis

Rachel E. Baker, Ann R. Bradlow. Northwestern University, Evanston, IL, USA

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

WHEN THERE IS A mismatch between the acoustic

Acoustic correlates of stress and their use in diagnosing syllable fusion in Tongan. James White & Marc Garellek UCLA

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

Natural Language Processing. George Konidaris

/$ IEEE

Listening and Speaking Skills of English Language of Adolescents of Government and Private Schools

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Rhythm-typology revisited.

The Perception of Nasalized Vowels in American English: An Investigation of On-line Use of Vowel Nasalization in Lexical Access

Designing a Speech Corpus for Instance-based Spoken Language Generation

A Case Study: News Classification Based on Term Frequency

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Revisiting the role of prosody in early language acquisition. Megha Sundara UCLA Phonetics Lab

A comparison of spectral smoothing methods for segment concatenation based speech synthesis

Voice conversion through vector quantization

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

Getting the Story Right: Making Computer-Generated Stories More Entertaining

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

First Grade Curriculum Highlights: In alignment with the Common Core Standards

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

FUZZY EXPERT. Dr. Kasim M. Al-Aubidy. Philadelphia University. Computer Eng. Dept February 2002 University of Damascus-Syria

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Rule Learning With Negation: Issues Regarding Effectiveness

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Investigation on Mandarin Broadcast News Speech Recognition

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Assignment 1: Predicting Amazon Review Ratings

Speaker recognition using universal background model on YOHO database

PRAAT ON THE WEB AN UPGRADE OF PRAAT FOR SEMI-AUTOMATIC SPEECH ANNOTATION

Abbey Academies Trust. Every Child Matters

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Perceived speech rate: the effects of. articulation rate and speaking style in spontaneous speech. Jacques Koreman. Saarland University

Proceedings of Meetings on Acoustics

Applications of memory-based natural language processing

Problems of the Arabic OCR: New Attitudes

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

Automatic Pronunciation Checker

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading

Calibration of Confidence Measures in Speech Recognition

Individual Component Checklist L I S T E N I N G. for use with ONE task ENGLISH VERSION

Software Maintenance

Segregation of Unvoiced Speech from Nonspeech Interference

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Syntactic surprisal affects spoken word duration in conversational contexts

On the Formation of Phoneme Categories in DNN Acoustic Models

Lecture 1: Machine Learning Basics

Transcription:

Speech Synthesis: Overview 11752

Overview Speech Synthesis History: From knowledgebased to data driven Formant to Diphone Diphone to Unit Selection Unit Selection to Statistical Parametric Optimizing the Problem The right measures, the right algorithm The right databases, the right things to synthesize Some Hard Problems Evaluation

Physical Models Blowing air through tubes von Kemplen s synthesizer 1791 Synthesis by physical models Homer Dudley s Voder. 1939

More Computation More Data Formant synthesis (60s80s) Waveform construction from components Diphone synthesis (80s90s) Waveform by concatenation of small number of instances of speech Unit selection (90s00s) Waveform by concatenation of very large number of instances of speech Statistical Parametric Synthesis (00s..) Waveform construction from parametric models

Waveform Generation Formant synthesis Random word/phrase concatenation Phone concatenation Diphone concatenation Subword unit selection Cluster based unit selection Statistical Parametric Synthesis

Building a Research Field Tools Allow others to easily join the field Common Data Sets Be able to concentrate on techniques Have common comparisons Evaluation Realistically compare techniques Have Users Some one has to care about your results Don t become stifled Ensure there are new tasks and directions

Festival Speech Synthesis System http://festvox.org/festival General system for multilingual TTS C/C++ code with Scheme scripting language General replaceable modules lexicons, LTS, duration, intonation, phrasing, POS tagging tokenizing, diphone/unit selection General Tools intonation analysis (F0, Tilt), signal processing CART building, ngrams, SCFG, WFST, OLS No fixed theories New languages without new C++ code Multiplatform (Unix, Windows, OSX) Full sources in distribution Free Software

CMU FestVox Project http://festvox.org I want it to speak like me! Festival is an engine, how do you make voices Building Synthetic Voices Tools, scripts, documentation Discussion and examples for building voices Example voice databases Step by Step walkthroughs of processes Support for English and other languages Support for different waveform techniques: diphone, unit selection, limit domain, HMM Other support: lexicon, prosody, text analysers

The CMU Flite project http://cmuflite.org But I want it to run on my phone! FLITE a fast, small, portable runtime synthesizer C based (no loaded files) Basic FestVox voices compiled into C/data Thread safe Suitable for embedded devices Ipaq, Linux, WinCE, PalmOS, Symbian Scalable: quality/size/speed trade offs frequency based lexicon pruning Sizes: 2.4Meg footprint (code+data+runtime RAM) < 0.025 secs timetospeak

Common Data Sets Data drive techniques need data Diphone Databases CSTR and CMU US English Diphone sets (kal and ked) CMU ARCTIC Databases 1200 phonetically balanced utterances (about 1 hour) 7 different speakers (2 male 2 female 3 accented) EGG, phonetically labeled Utterances chosen from outofcopyright text Easy to say Freely distributable Tools to build your own in your own language

Blizzard Challenge Realistic evaluation Under the same conditions Blizzard Challenge [Black and Tokuda] Participants build voice from common dataset Synthesis test sentences Large set of listening experiments Since 2005, now in 9 th year 1520 groups (Academia, Research Labs and Commercial Companies)

How to test synthesis Blizzard tests: Do you like it? (MOS scores) Can you understand it? SUS sentence The unsure steaks overcame the zippy rudder Can t this be done automatically? Not yet (at least not reliably enough) But we now have lots of data for training techniques Why does it still sound like robot? Need better (appropriate testing)

Speech Synthesis Techniques Unit selection Statistical parameter synthesis Automated voice building Database design Language portability Voice conversion

Unit Selection Target cost and Join cost [Hunt and Black 96] Target cost is distance from desired unit to actual unit in the databases Based on phonetic, prosodic metrical context Join cost is how well the selected units join

Clustering Units Cluster units [Donovan et al 96, Black et al 97]

Unit Selection Issues Cost metrics Finding best weights, best techniques etc Database design Best database coverage Automatic labeling accuracy Finding errors/confidence Limited domain: Target the databases to a particular application Talking clocks Targeted domain synthesis

Unit Selection vs Parametric Unit Selection The standard method Select appropriate subword units from large databases of natural speech Parametric Synthesis: [NITECH: Tokuda et al] HMMgeneration based synthesis Cluster units to form models Generate from the models Take average of units

Old vs New Unit Selection: large carefully labelled database quality good when good examples available quality will sometimes be bad no control of prosody Parametric Synthesis: smaller less carefully labelled database quality consistent resynthesis requires vocoder, (buzzy) can (must) control prosody model size much smaller than Unit DB

Parametric Synthesis Probabilistic Models Simplification Generative model Predict acoustic frames from text

SPSS ASR vs SPSS Similar techniques but not the same Model training techniques Alignment, and cluster features MLLR (adaptation from multispeaker models) Model improvement techniques Minimum generation error Label optimization Parameterization techniques MFCC, LSP, STAIGHT, HSM Excitation modeling techniques

SPSS Goals Require optimal paramerization that Is derivable from speech Can generate high quality speech Is predictable from text Candidates Spectral, F0, excitation Formants, nasality, aspiration Articulatory features

SPSS Systems HTS (NITECH) Based on HTK Predicts HMMstates (Default) uses MCEP and MLSA filter Supported in Festival Clustergen (CMU) No use of HTK Predicts Frames (Default) uses MCEP and MLSA filter More tightly coupled with Festival

Building Synthetic Voices The standard voice requires A phone set Pronunciations: Lexicon/lettertosound rules Phonetically and prosodically balanced corpus Spoken by a good speaker Text analysis: Number, symbol expansion, etc Prosodic modeling Phrasing, intonation, duration etc Waveform generation Diphones, unit selection, parametric synthesis Something else that is hard: No vowels (Arabic), no word segmentation, number declensions

Designing a good corpus From a large set of text Select nice utterances 5 to 15 words, easy to say All words in lexicon, no homographs Convert text to phoneme strings Possibly with lexical stress, onset/coda, tone etc Select utterances that maximize di/triphone coverage Looking for around 1000 utterances Can seed initial data with domain data CMU ARCTIC databases 7 x single speaker English DBS 1200 phonetically balanced utterances

Hard Synthesis Problems Text Normalization Intonation modeling Intonation evaluation Style modeling Choosing the right style Evaluating the result

Text Normalization Finding the words Tokenizing, homograph disambiguation etc $1.25 vs $1.25 million vs $1.25 song $1.25 vs $1.25 million vs $1.25 song Very large number of rare events Formalized systems exist Trained from data, optimized and outofdate Long term updated hacks rule systems ML Challenge Such a problem cannot be done by machine learning

Intonation Modeling Accents, Phrases and F0 Lots of statistical models available Lots of objective measures: RMSE, Correlation No good subjective measures Listening tests Natural Intonation: good Naïve intonation: bad Various cute models for intonation: meh

Improving Understanding Take reading comprehension stories For children s reading tests, or TOEFL Synthesis with: Natural Intonation Naïve models Various cute models Human listening tests Answer questions about stories Best system: Naïve models

Style Modeling Classic Emotion Modeling Happy, sad, angry and neutral But no one needs that Style Modeling Polite, command, empathic Style usage When can it be used? How much should be used?

Dialog with Style Record humanhuman dialog Label dialog states: Implicit confirmation, corrections, discourse markers Build dialog state sensitive voice Using dialog state in features Must be closely integrated into SDS Timing, dialog state appropriate But how do you test it?

Voice Transformation Collect small amount of data 50 utterances Adapt existing voice to target voice Adaptation: What makes a voice: Lexical choice Phonetic variation Prosody Spectral/vocal tract/articulatory movement Excitation mode Use articulatory modeling for transformation (Toth)

Voice Transformation Festvox GMM transformation suite (Toda) awb bdl jmk slt awb bdl jmk slt

Applications Speech output is only one component Need to integrate with larger applications Spoken Dialog Systems SpeechtoSpeech Translation Systems Talking Heads Conversational participants Information delivery

Conclusions Synthesis has improved But there is still much to do Isolated sentences are clear But conversational speech still in the future Speech Systems must adapt To their usage And their funding conditions But we can always fall back on our talents