This lecture. Some text-to-speech architectures. Some text-to-speech components. text into equivalent, audible speech waveforms.

Similar documents
Modern TTS systems. CS 294-5: Statistical Natural Language Processing. Types of Modern Synthesis. TTS Architecture. Text Normalization

On the Formation of Phoneme Categories in DNN Acoustic Models

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

A study of speaker adaptation for DNN-based speech synthesis

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Expressive speech synthesis: a review

A comparison of spectral smoothing methods for segment concatenation based speech synthesis

Letter-based speech synthesis

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Speech Recognition at ICSI: Broadcast News and beyond

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Speech Emotion Recognition Using Support Vector Machine

Learning Methods in Multilingual Speech Recognition

Phonological Processing for Urdu Text to Speech System

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

Books Effective Literacy Y5-8 Learning Through Talk Y4-8 Switch onto Spelling Spelling Under Scrutiny

Florida Reading Endorsement Alignment Matrix Competency 1

Human Emotion Recognition From Speech

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

SARDNET: A Self-Organizing Feature Map for Sequences

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

A Hybrid Text-To-Speech system for Afrikaans

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Voice conversion through vector quantization

The IRISA Text-To-Speech System for the Blizzard Challenge 2017

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Rachel E. Baker, Ann R. Bradlow. Northwestern University, Evanston, IL, USA

CEFR Overall Illustrative English Proficiency Scales

THE RECOGNITION OF SPEECH BY MACHINE

Calibration of Confidence Measures in Speech Recognition

ADDIS ABABA UNIVERSITY SCHOOL OF GRADUATE STUDIES MODELING IMPROVED AMHARIC SYLLBIFICATION ALGORITHM

THE MULTIVOC TEXT-TO-SPEECH SYSTEM

Speaker Recognition. Speaker Diarization and Identification

Speaker Identification by Comparison of Smart Methods. Abstract

Body-Conducted Speech Recognition and its Application to Speech Support System

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Program Matrix - Reading English 6-12 (DOE Code 398) University of Florida. Reading

Speaker recognition using universal background model on YOHO database

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

THE world surrounding us involves multiple modalities

Probabilistic Latent Semantic Analysis

First Grade Curriculum Highlights: In alignment with the Common Core Standards

Lecture 9: Speech Recognition

Lecture 1: Machine Learning Basics

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Mandarin Lexical Tone Recognition: The Gating Paradigm

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Modeling function word errors in DNN-HMM based LVCSR systems

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading

1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Consonants: articulation and transcription

Quarterly Progress and Status Report. Voiced-voiceless distinction in alaryngeal speech - acoustic and articula

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

age, Speech and Hearii

Modeling function word errors in DNN-HMM based LVCSR systems

Proceedings of Meetings on Acoustics

Sample Goals and Benchmarks

Statistical Parametric Speech Synthesis

Unit Selection Synthesis Using Long Non-Uniform Units and Phonemic Identity Matching

Automatic Pronunciation Checker

Demonstration of problems of lexical stress on the pronunciation Turkish English teachers and teacher trainees by computer

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

Taught Throughout the Year Foundational Skills Reading Writing Language RF.1.2 Demonstrate understanding of spoken words,

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Primary English Curriculum Framework

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Houghton Mifflin Reading Correlation to the Common Core Standards for English Language Arts (Grade1)

Word Stress and Intonation: Introduction

Word Segmentation of Off-line Handwritten Documents

Stages of Literacy Ros Lugg

Pobrane z czasopisma New Horizons in English Studies Data: 18/11/ :52:20. New Horizons in English Studies 1/2016

Cambridgeshire Community Services NHS Trust: delivering excellence in children and young people s health services

Natural Language Processing. George Konidaris

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

English for Life. B e g i n n e r. Lessons 1 4 Checklist Getting Started. Student s Book 3 Date. Workbook. MultiROM. Test 1 4

Investigation on Mandarin Broadcast News Speech Recognition

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

GOLD Objectives for Development & Learning: Birth Through Third Grade

Fisk Street Primary School

Segregation of Unvoiced Speech from Nonspeech Interference

Linking Task: Identifying authors and book titles in verbose queries

Python Machine Learning

Transcription:

This lecture Some text-to-speech architectures. Some text-to-speech components. Text-to-speech: n. the conversion of electronic text into equivalent, audible speech waveforms. CSC401/2511 Spring 2018 2

Insight? The computer can't tell you the emotional story. It can give you the exact mathematical design, but what's missing is the eyebrows. Frank Zappa Kismet CSC401/2511 Spring 2018 3

Some history In 1791 Wolfgang von Kempelen produced an acoustic-mechanical speech machine. This machine used bellows and models of the tongue and lips, enabling to produce rudimentary vowels and some consonants. In the 1930s, the first electronic speech synthesizer, VOCODER, was produced by Bell Labs. CSC401/2511 Spring 2018 4

Modern TTS architectures Formant synthesis An approach that synthesizes acoustics and formants based on rules and filters. Concatenative synthesis The use of databases of stored speech to assemble new utterances. Articulatory synthesis The modelling of the movements of the articulators and the acoustics of the vocal tract. CSC401/2511 Spring 2018 5

1. Formant synthesis Historically popular (MITalk in 1979, DECtalk in 1983). Stores a small number of parameters such as Formant frequencies and bandwidths for vowels, Lengths of sonorants in time, Periodicity of the fundamental frequency. Advantages: This method can be very intelligible, avoids clipping artefacts between phonemes of other methods, and is computationally inexpensive. Disadvantages: This method tends to produce unnatural robotic-sounding speech. CSC401/2511 Spring 2018 6

2. Concatenative synthesis Involves selecting short sections of recorded human speech and concatenating them together in time. Advantages: This method produces very human-like, natural-sounding speech. It is used in almost all modern commercial systems. Disadvantages: To be robust, this method requires a large (computationally expensive) database. Concatenating phones without appropriate blending can result in abrupt changes (clipping glitches). CSC401/2511 Spring 2018 7

3. Articulatory synthesis Often involves the uniform tube model or some other biologically-inspired model of air propagation through the vocal tract. Advantages: This method is computationally inexpensive and allows us to study speech production scientifically, and to account for particular articulatory constraints. Disadvantages: The resulting speech is not entirely natural, and it can be difficult to modify these systems to imitate new synthetic speakers, or even complex articulations. CSC401/2511 Spring 2018 8

3. Articulatory synthesis http://www.youtube.com/watch?v=bht96voreeo CSC401/2511 Spring 2018 9

3. Articulatory synthesis Note: this is singing, not speech (in case it s not obvious) CSC401/2511 Spring 2018 10

3. Articulatory synthesis https://dood.al/pinktrombone/ CSC401/2511 Spring 2018 11

Components of TTS systems Some components are common to all TTS systems, namely: 1. Text analysis. Text normalization Homograph ( same spelling ) disambiguation Grapheme-to-phoneme (letter-to-sound) Intonation (prosody) 2. Waveform generation. Unit and diphone selection. And now we define these terms CSC401/2511 Spring 2018 12

Text analysis How do we analyze the text the system is given to read? CSC401/2511 Spring 2018 13

Text analysis First we need to normalize the text. This involves splitting the text into sentences and word tokens and sometimes chunking tokens into reasonable sections. CSC401/2511 Spring 2018 14

Rules for sentence detection You ve seen heuristics for this in assignment 1. You can also use ID3 or C4.5 for inducing decision trees automatically. CSC401/2511 Spring 2018 15

Identifying the types of tokens Pronunciation of a single word token can depend on its type or its usage. e.g., 1867 is eighteen sixty seven if it s a year, one eight six seven if it s in a phone number, one thousand eight hundred and sixty seven if it s a quantifier. e.g., 25 is twenty five if it s an age, twenty fifth if it s a day of the month. CSC401/2511 Spring 2018 16

Homograph disambiguation Homograph: n. a set of words that share the same spelling but have different meanings or pronunciations. E.g., close the door! The monsters are getting close! I object to that horrible object! I refuse to take that refuse! I m content with the content. It s important to pronounce these homographs correctly, or the meaning will be lost. CSC401/2511 Spring 2018 17

Homograph disambiguation Homographs can often be distinguished by their part-ofspeech. E.g. live as a verb (/l ih v/) or an adjective (/l ay v/). Verb Noun Use /y uw z/ Use /y uw s/ House /h aw z/ House /h aw s/ record REcord discount DIScount CSC401/2511 Spring 2018 18

From words to phonemes There are at least two methods to convert words to sequences of phonemes: Dictionary lookup. Letter-to-sound (LTS) rules (if the word is not in the dictionary). Modern systems tend to use a combination of approaches, relying on large dictionaries and samples for common words, but using rules to guess/assemble unknown words. CSC401/2511 Spring 2018 19

Pronunciation dictionaries: CMU The CMU dictionary has 127K words. Unfortunately, It only contains American pronunciations, It does not contain syllable boundaries (for timing), It does not contain parts-of-speech (it contains no knowledge of homographs), It does not distinguish case, E.g. US is transcribed as both /ah s/ and as /y uw eh s/. CSC401/2511 Spring 2018 20

Other pronunciation dictionaries The UNISYN dictionary has about 110K words, and includes syllabification, stress, and morphology. Other dictionaries, like CELEX, are sometimes used but are often too small, or too specific to one dialect. CSC401/2511 Spring 2018 21

Dictionaries are insufficient Unknown words (a.k.a. out of vocabulary (OOV)) increase with the square root of the number of words in a new, previously unseen text. Of 39,923 tokens in a test of the Penn Treebank, 1775 tokens were OOV (4.4%, 943 unique types). Of these, 1360 were names, and about 64 were typos. Commercial systems often use dictionaries, but back off to special name and acronym routines when necessary. CSC401/2511 Spring 2018 22

Names About 20% of tokens in a typical newswire are names. Some are common and can be predicted (e.g., Drumpf, Putin). Others may become common only after a system is deployed. Given an unknown name, we can perform morphology according to prescribed rules (e.g., if you know Walter, you can infer Walters ), or you can train statistical LTS systems on names. CSC401/2511 Spring 2018 23

Letter-to-sound rules Early algorithms used handwritten rules, e.g., ( <WORDSTART> [ch] <CONSONANT>) = say /k/ ( <WORDSTART> [ch] <VOWEL>) = say /ch/ This correctly pronounces Christmas and Choice, but mispronounces Chord. English is notoriously full of exceptions, and these handwritten rules don t generalize to other languages. A modern approach is to learn LTS rules by automatic induction. CSC401/2511 Spring 2018 24

Induction of letter-to-sound rules First, we must align letters and phonemes, If you have access to these alignments, you can learn these with maximum likelihood estimation, e.g., Letter, $% Phoneme, "h c h e c k e d ch eh k t! "h $% = '()*+("h $%) '()*+($%) If you don t have these alignments, they can be learned using expectation-maximization as we saw with, e.g., statistical machine translation. CSC401/2511 Spring 2018 25

Induction of letter-to-sound rules Alignments can be improved by using hand-written rules that restrict the translation of letters to phonemes (e.g., C goes to /k, ch, s, sh/, or W goes to /w, v, f/). Some words have to be dealt with specifically, since their spelling is so different from their pronunciation. E.g., abbreviations: dept /d ih p aa r t m ah n t/ wtf /w aw dh ae t s f ah n iy/ CSC401/2511 Spring 2018 26

Prosody Once you have a phoneme sequence, you may need to adjust other acoustic characteristics, based on the semantic context. Prosodic phrasing: You need to mark phrase boundaries, You need to emphasize certain syllables by modifying either F0, loudness, or the duration of some phonemes. CSC401/2511 Spring 2018 27

Three aspects for prosody in TTS Prominence: Structure: Tune: some syllables or words are more prominent than others, especially content words. Sentences have inherent prosodic structure. Some words group naturally together, others require a noticeable disjunction. To sound natural, one has to account for the intonational melody of an utterance. These are reasons to modify prosody, not the way prosody is modified CSC401/2511 Spring 2018 28

Deciding on word emphasis Word emphasis depends on context. The new information in the answer to a question is often emphasized. Q1: What types of foods are a good source of vitamins? A1: LEGUMES are a good source of vitamins. Q2: Are legumes a bad source of vitamins? A2: Legumes are a GOOD source of vitamins. Q3: What sorts of things do legumes give you, healthwise? A3: Legumes are a good source of VITAMINS. CSC401/2511 Spring 2018 29

Emphasis in noun phrases Proper names: the emphasis is often on the right-most word. E.g., New York CITY; Paris, FRANCE Noun-noun compounds: emphasis is often on the left noun. E.g., TABLE lamp; DISK drive, Adjective-noun compounds: stress on the noun E.g., large HOUSE; new CAR Counterexamples exist, but with some predictability MEDICAL building; cherry PIE CSC401/2511 Spring 2018 30

Waveform generation How do we transform the analyzed text into sound? CSC401/2511 Spring 2018 31

Waveform synthesis Given a string of phonemes and a desired prosody, we need to generate a waveform. The three architectures do this in unique ways. Formant synthesis produces waveforms by synthesizing the desired spectrograms directly. Concatenative synthesis combines pre-recorded samples of human speech. Articulatory synthesis produces waveforms with biologically-inspired models of the vocal tract. CSC401/2511 Spring 2018 32

Waveforms from formant synthesis The Klatt synthesizer produces either a periodic pulse (for sonorants like vowels) or noise (for fricatives) and passes these signals through filters one for each formant. These filters were parameterized by desired frequencies and bandwidths. Don t worry about the details here CSC401/2511 Spring 2018 33

Aside linear predictive coding Formant synthesis is often performed by linear predictive coding (LPC), which is beyond the scope of this course. LPC is a very simple linear function which acts like a moving average filter over a signal!, e.g., " # = % ) &'() * +,&![# + /] LPC results in very smooth spectra, which can result in high intelligibility, but low naturalness (real human spectra tend to be less smooth). CSC401/2511 Spring 2018 34

Waveforms from concatenation Diphone: n. Middle of one phoneme to the middle of the next. Diphones are useful units because the middle of a phoneme is often in a steady state and recording diphones allows us to capture relevant acoustic transitions between phonemes. One speaker will record at least one version of each diphone, and in some cases whole (popular) words. CSC401/2511 Spring 2018 35

Waveforms from concatenation Given a phoneme dictionary of 50 phonemes, we might expect a (reduced) diphone dictionary of 1000 to 2000 diphones (multiplicatively more if we need to record diphones with/without stress, etc.) When synthesizing an utterance, we extract relevant sequences of diphones, concatenate them together, and often perform some acoustic post-processing on the boundaries, or on the overall prosody of the utterance. CSC401/2511 Spring 2018 36

Aside TD-PSOLA Time-domain pitch synchronous overlap and add (TD-PSOLA) is a very efficient method for combining waveforms while preserving pitch. CSC401/2511 Spring 2018 37

Duration modification Duration modification can be as simple as duplication or removal of short-term periodic sequences. Phase vocoding is better CSC401/2511 Spring 2018 38

Pitch modification Duration modification can be as simple as squishing or stretching signals using decimation or interpolation. CSC401/2511 Spring 2018 39

TTS from HMMs Use a trained HMM and sample from it. b0 b1 b2 tristate phoneme model (e.g., /oi/) Festival (http://www-2.cs.cmu.edu/~awb/festival_demos/index.html) Y.-J. Wu and K. Tokuda (2008) Minimum generation error training with direct log spectral distortion on LSPs for HMM-based speech synthesis. In Proc. Interspeech, pages 577 580, 2008. CSC401/2511 Spring 2018 40

TTS from NNs RNNs can predict smoothly-changing acoustic features. It can be difficult to learn high-dimensional acoustic features (e.g., MFCCs or raw spectra). Solution? Learn better features using an autoencoder. #" h " Train a NN that learns to recreate its own input audio signal " #" h $ And later use the resulting latent features to learn a mapping from words $ Y. Fan, Y. Qian, F.-L. Xie, and F. Soong. (2014) TTS synthesis with bidirectional LSTM based recurrent neural networks. In Proc. Interspeech, pages 1964 1968. H. Zen, Y. Agiomyrgiannakis, N. Egberts, F. Henderson, and P. Szczepaniak. (2016) Fast, compact, and high quality LSTM-RNN based statistical parametric speech synthesizers for mobile devices.in Proc. Interspeech. S. Takaki and J. Yamagishi (2016) A deep auto-encoder based low-dimensional feature extraction from FFT spectral envelopes for statistical parametric speech synthesis. In Proc. ICASSP, pages 5535 5539. CSC401/2511 Spring 2018 41

TTS from NNs If! is raw audio, and we use even a modest window (e.g., 100ms), your input can be a 1000+ dimensional dense vector, which can be too long for an RNN (or autoencoder). Solution? Exponentially increase receptive field across layers. A Senior (2017) Generative Model-Based Text-to-Speech Synthesis CSC401/2511 Spring 2018 42

Evaluation of TTS Intelligibility tests. E.g., the diagnostic rhyme test involves humans identifying synthetic speech from two word choices that differ by a single phonetic feature (e.g., voicing, nasality). E.g., dense vs. tense, maze vs. mace Mean opinion score Have listeners rate synthetic speech on a Likert-like scale (i.e., a goodness-badness scale). http://www.synsig.org/index.php/blizzard_challenge_2013_rules CSC401/2511 Spring 2018 43