Hidden Markov Model-based speech synthesis

Similar documents
On the Formation of Phoneme Categories in DNN Acoustic Models

A study of speaker adaptation for DNN-based speech synthesis

Edinburgh Research Explorer

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Letter-based speech synthesis

Speech Recognition at ICSI: Broadcast News and beyond

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Learning Methods in Multilingual Speech Recognition

Lecture 9: Speech Recognition

Acoustic correlates of stress and their use in diagnosing syllable fusion in Tongan. James White & Marc Garellek UCLA

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Modeling function word errors in DNN-HMM based LVCSR systems

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

WHEN THERE IS A mismatch between the acoustic

Modeling function word errors in DNN-HMM based LVCSR systems

Speech Emotion Recognition Using Support Vector Machine

Books Effective Literacy Y5-8 Learning Through Talk Y4-8 Switch onto Spelling Spelling Under Scrutiny

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Statistical Parametric Speech Synthesis

Probabilistic Latent Semantic Analysis

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Mandarin Lexical Tone Recognition: The Gating Paradigm

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

A Hybrid Text-To-Speech system for Afrikaans

Lecture 1: Machine Learning Basics

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

SARDNET: A Self-Organizing Feature Map for Sequences

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Proceedings of Meetings on Acoustics

The IRISA Text-To-Speech System for the Blizzard Challenge 2017

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

Unit Selection Synthesis Using Long Non-Uniform Units and Phonemic Identity Matching

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Human Emotion Recognition From Speech

Investigation on Mandarin Broadcast News Speech Recognition

Expressive speech synthesis: a review

Speaker recognition using universal background model on YOHO database

Automatic Pronunciation Checker

Body-Conducted Speech Recognition and its Application to Speech Support System

Rachel E. Baker, Ann R. Bradlow. Northwestern University, Evanston, IL, USA

Statewide Framework Document for:

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

CSL465/603 - Machine Learning

Voice conversion through vector quantization

An Online Handwriting Recognition System For Turkish

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques

A comparison of spectral smoothing methods for segment concatenation based speech synthesis

Comparison of EM and Two-Step Cluster Method for Mixed Data: An Application

(Sub)Gradient Descent

Rule Learning With Negation: Issues Regarding Effectiveness

Learning Methods for Fuzzy Systems

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Natural Language Processing. George Konidaris

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

The Good Judgment Project: A large scale test of different methods of combining expert predictions

On Developing Acoustic Models Using HTK. M.A. Spaans BSc.

Python Machine Learning

Visual CP Representation of Knowledge

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

Lecture 1: Basic Concepts of Machine Learning

Calibration of Confidence Measures in Speech Recognition

ScienceDirect. A Framework for Clustering Cardiac Patient s Records Using Unsupervised Learning Techniques

A Cross-language Corpus for Studying the Phonetics and Phonology of Prominence

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Modern TTS systems. CS 294-5: Statistical Natural Language Processing. Types of Modern Synthesis. TTS Architecture. Text Normalization

arxiv: v2 [cs.cv] 30 Mar 2017

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Segregation of Unvoiced Speech from Nonspeech Interference

Year 4 National Curriculum requirements

A Model to Predict 24-Hour Urinary Creatinine Level Using Repeated Measurements

Generative models and adversarial training

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

THE RECOGNITION OF SPEECH BY MACHINE

CS 446: Machine Learning

1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all

Comparison of network inference packages and methods for multiple networks inference

Phonological and Phonetic Representations: The Case of Neutralization

The ABCs of O-G. Materials Catalog. Skills Workbook. Lesson Plans for Teaching The Orton-Gillingham Approach in Reading and Spelling

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

PRAAT ON THE WEB AN UPGRADE OF PRAAT FOR SEMI-AUTOMATIC SPEECH ANNOTATION

Transcription:

Hidden Markov Model-based speech synthesis Junichi Yamagishi, Korin Richmond, Simon King and many others Centre for Speech Technology Research University of Edinburgh, UK www.cstr.ed.ac.uk

Note I did not invent HMM-based speech synthesis! Core idea: Tokuda (Nagoya Institute of Technology, Japan) Developments: many other people Speaker adaptation: Junichi Yamagishi (Edinburgh) and colleagues

Background

Speech synthesis mini-tutorial Text to speech input: text output: a waveform that can be listened to Two main components front end: analyses text and converts to linguistic specification waveform generation: converts linguistic specification to speech

Speech synthesis mini-tutorial Text to speech input: text output: a waveform that can be listened to Two main components front end: analyses text and converts to linguistic specification waveform generation: converts linguistic specification to speech

From words to linguistic specification "the cat sat"

From words to linguistic specification "the cat sat" DET NN VB

From words to linguistic specification "the cat sat" DET NN VB ((the cat) sat)

From words to linguistic specification sil dh ax k ae t s ae t sil "the cat sat" DET NN VB ((the cat) sat)

From words to linguistic specification phrase initial pitch accent phrase final sil dh ax k ae t s ae t sil "the cat sat" DET NN VB ((the cat) sat)

From words to linguistic specification phrase initial pitch accent phrase final sil dh ax k ae t s ae t sil "the cat sat" DET NN VB ((the cat) sat) sil^dh-ax+k=ae, "phrase initial", "unstressed syllable",...

Full context models used in synthesis aa^b-l+ax=s@1_3/a:1_1_3/b:0-0-3@2-1&3-3#2-2$2-3!1-...

Full context models used in synthesis aa^b-l+ax=s@1_3/a:1_1_3/b:0-0-3@2-1&3-3#2-2$2-3!1-... phonetic

Full context models used in synthesis aa^b-l+ax=s@1_3/a:1_1_3/b:0-0-3@2-1&3-3#2-2$2-3!1-... phonetic prosodic

Example linguistic specification pau^pau-pau+ao=th@x_x/a:0_0_0/b:x-x-x@x-x&x-x#x-x$... pau^pau-ao+th=er@1_2/a:0_0_0/b:1-1-2@1-2&1-7#1-4$... pau^ao-th+er=ah@2_1/a:0_0_0/b:1-1-2@1-2&1-7#1-4$... ao^th-er+ah=v@1_1/a:1_1_2/b:0-0-1@2-1&2-6#1-4$... th^er-ah+v=dh@1_2/a:0_0_1/b:1-0-2@1-1&3-5#1-3$... er^ah-v+dh=ax@2_1/a:0_0_1/b:1-0-2@1-1&3-5#1-3$... ah^v-dh+ax=d@1_2/a:1_0_2/b:0-0-2@1-1&4-4#2-3$... v^dh-ax+d=ey@2_1/a:1_0_2/b:0-0-2@1-1&4-4#2-3$... Author of the...

From linguistic specification to speech Two possible methods Concatenate small pieces of pre-recorded speech Generate speech from a model

From linguistic specification to speech Two possible methods Concatenate small pieces of pre-recorded speech Generate speech from a model

HMM mini-tutorial HMMs are models of sequences speech signals gene sequences etc

HMMs a HMM consists of sequence model: a weighted finite state network of states and transitions observation model: multivariate Gaussian distribution in each state can generate from the model can also use for pattern recognition (e.g., automatic speech recognition)

HMMs are generative models

HMMs are generative models

HMMs are generative models

HMM-based speech synthesis mini-tutorial HMMs are used to generate sequences of speech (in a parameterised form) From the parameterised form, we can generate a waveform The parameterised form contains sufficient information to generate speech: spectral envelope fundamental frequency (F0) - sometimes called pitch aperiodic (noise-like) components (e.g. for sounds like sh and f )

Trajectory HMMs Using an HMM to generate speech parameters because of the Markov assumption, the most likely output is the sequence of the means of the Gaussians in the states visited this is piecewise constant, and ignores important dynamic properties of speech Trajectory HMM algorithm (Tokuda and colleagues) solves this problem, by correctly using statistics of the dynamic properties during the generation process

Generation Generate the most likely observation sequence from the HMM but take the statistics of not only the static coefficients, but also the delta and delta-delta too Maximum Likelihood Parameter Generation Algorithm

Trajectory HMMs

Trajectory HMMs

Trajectory HMMs

Trajectory HMMs speech parameter time

Trajectory HMMs speech parameter time

Trajectory HMMs speech parameter time

Trajectory HMMs speech parameter time

Trajectory HMMs speech parameter time

Trajectory HMMs speech parameter time

Trajectory HMMs speech parameter time

Constructing the HMM Linguistic specification (from the front end) is a sequence of phonemes, annotated with contextual information There is one 5-state HMM for each phoneme, in every required context To synthesise a given sentence, use front end to predict the linguistic specification concatenate the corresponding HMMs generate from the HMM

Constructing the HMM Linguistic specification (from the front end) is a sequence of phonemes, annotated with contextual information There is one 5-state HMM for each phoneme, in every required context To synthesise a given sentence, use front end to predict the linguistic specification concatenate the corresponding HMMs generate from the HMM Sparsity problem!

Example linguistic specification pau^pau-pau+ao=th@x_x/a:0_0_0/b:x-x-x@x-x&x-x#x-x$... pau^pau-ao+th=er@1_2/a:0_0_0/b:1-1-2@1-2&1-7#1-4$... pau^ao-th+er=ah@2_1/a:0_0_0/b:1-1-2@1-2&1-7#1-4$... ao^th-er+ah=v@1_1/a:1_1_2/b:0-0-1@2-1&2-6#1-4$... th^er-ah+v=dh@1_2/a:0_0_1/b:1-0-2@1-1&3-5#1-3$... er^ah-v+dh=ax@2_1/a:0_0_1/b:1-0-2@1-1&3-5#1-3$... ah^v-dh+ax=d@1_2/a:1_0_2/b:0-0-2@1-1&4-4#2-3$... v^dh-ax+d=ey@2_1/a:1_0_2/b:0-0-2@1-1&4-4#2-3$... Author of the...

HMM-based speech synthesis Differences from automatic speech recognition include Synthesis uses a much richer model set, with a lot more context For speech recognition: triphone models For speech synthesis: full context models Full context = both phonetic and prosodic factors Observation vector for HMMs contains the necessary parameters to generate speech, such as spectral envelope + F0 + multi-band noise amplitudes

Sparsity In practically all speech or language applications, sparsity is a problem Distribution of classes is usually long-tailed (Zipf-like) We also create even more sparsity by using context-dependent models thus, most models have no training data at all Common solution is to merge classes or contexts i.e., use the same model for several classes or contexts for HMMs, we call this parameter tying

Decision-tree-based clustering Description length for Yes No Yes No State occupancy probability for node Dimension Covariance matrix for node 20 Clustering Context Dependent HMMs

Model parameter estimation from labelled data Actually, we only have word labels for the training data Convert these to full linguistic specification using the front end of our text-tospeech system (text processing, pronunciation, prosody) these labels will not exactly match the speech signal (we do a few tricks to try to make the match closer, but it s never perfect) We still only know the model sequence, but no information about the state alignment So, we use EM (we could call this semi-supervised learning)

Model adaptation Training the models needs 1000+ sentences of data from one speaker What if we have insufficient data for this target speaker? Adaptation: Train the model on lots of data from other speakers Adapt the trained model s parameters using a small amount of target speaker data estimate linear transforms to maximise the likelihood (MLLR) also in combination with MAP

Training, adaptation, synthesis

Training, adaptation, synthesis awb awb clb... Train rms clb... rms speech labels

Training, adaptation, synthesis awb awb clb... Train rms clb... rms speech labels

Training, adaptation, synthesis awb awb clb... Train clb... rms speech Average voice model rms labels

Training, adaptation, synthesis speech Average voice model labels

Training, adaptation, synthesis speech Average voice model labels bdl Adapt bdl

Training, adaptation, synthesis speech Average voice model labels bdl Adapt bdl

Training, adaptation, synthesis speech Average voice model labels bdl Adapt bdl Transforms

Training, adaptation, synthesis speech Average voice model labels bdl Adapt bdl Transforms Recognise

Training, adaptation, synthesis speech Average voice model labels bdl Adapt bdl Transforms

Training, adaptation, synthesis speech Average voice model labels Transforms

Training, adaptation, synthesis speech Test sentence labels Average voice model Synthesise labels Synthetic speech Transforms

Training, adaptation, synthesis speech Test sentence labels Average voice model Synthesise labels Synthetic speech Transforms

Training, adaptation, synthesis speech Test sentence labels Average voice model Synthesise labels Synthetic speech Transforms

Training, adaptation, synthesis speech Test sentence labels Average voice model Synthesise labels Synthetic speech Transforms

Training, adaptation, synthesis speech Test sentence labels Average voice model Synthesise labels Synthetic speech Transforms

Evaluation Objective measures that compare synthetic speech with a natural example (e.g., spectral distortion) have their uses, but don t necessarily correlate with human perception main problem: there is more than one correct answer in speech synthesis a single natural example does not capture this So, we mainly rely on playing examples to listeners opinion scores for quality & naturalness, typically on 5 point scales objective measures of intelligibility (type-in tests)

Intelligibility (WER), English Word error rate for voice A (All listeners) Word error rate for voice B (All listeners) WER (%) 0 20 40 60 80 100 WER (%) 0 20 40 60 80 100 n 245 248 245 245 245 246 247 246 245 246 246 245 246 248 245 248 245 246 246 246 248 n 198 198 198 198 198 199 199 198 199 198 198 198 198 199 198 198 199 198 199 A J S K B P O V M C L E G Q T F H D R I N A J S B O V M C L E G Q T F H D R I N System A natural speech B Festival benchmark C HTS 2005 benchmark V HTS 2008 (aka HTS 2007 ) System

Intelligibility (WER), English Word error rate for voice A (All listeners) Word error rate for voice B (All listeners) WER (%) 0 20 40 60 80 100 No significant difference between A, V and T WER (%) 0 20 40 60 80 100 n 245 248 245 245 245 246 247 246 245 246 246 245 246 248 245 248 245 246 246 246 248 n 198 198 198 198 198 199 199 198 199 198 198 198 198 199 198 198 199 198 199 A J S K B P O V M C L E G Q T F H D R I N A J S B O V M C L E G Q T F H D R I N System A natural speech B Festival benchmark C HTS 2005 benchmark V HTS 2008 (aka HTS 2007 ) System

Intelligibility (WER), English Word error rate for voice A (All listeners) Word error rate for voice B (All listeners) WER (%) 0 20 40 60 80 100 No significant difference between A, V and T WER (%) 0 20 40 60 80 100 No significant difference between A, C, V and T n 245 248 245 245 245 246 247 246 245 246 246 245 246 248 245 248 245 246 246 246 248 n 198 198 198 198 198 199 199 198 199 198 198 198 198 199 198 198 199 198 199 A J S K B P O V M C L E G Q T F H D R I N A J S B O V M C L E G Q T F H D R I N System A natural speech B Festival benchmark C HTS 2005 benchmark V HTS 2008 (aka HTS 2007 ) System

Intelligibility (WER), English Word error rate for voice A (All listeners) Word error rate for voice B (All listeners) WER (%) 0 20 40 60 80 100 HTS is as intelligible as human speech No significant difference between A, V and T WER (%) 0 20 40 60 80 100 No significant difference between A, C, V and T n 245 248 245 245 245 246 247 246 245 246 246 245 246 248 245 248 245 246 246 246 248 n 198 198 198 198 198 199 199 198 199 198 198 198 198 199 198 198 199 198 199 A J S K B P O V M C L E G Q T F H D R I N A J S B O V M C L E G Q T F H D R I N System A natural speech B Festival benchmark C HTS 2005 benchmark V HTS 2008 (aka HTS 2007 ) System

Recent extensions

Articulatory-controllable HMM-based speech synthesis can manipulate articulator positions explicitly ability to synthesise new phonemes, not seen in training data requires parallel articulatory+acoustic corpus, which we have in CSTR

Articulatory-controllable HMM-based speech synthesis Tongue height (cm) +1.5 +1.0 +0.5 default -0.5-1.0-1.5

Articulatory-controllable HMM-based speech synthesis Tongue height (cm) +1.5 +1.0 +0.5 default -0.5-1.0-1.5 set

Dirichlet process HMMs Fixed number of states may not be optimal Cross-validation, information criteria (AIC, BIC, or MDL) or variational Bayes can be used for determining the number of states Or use Dirichlet process (HDP-HMM or infinite HMM) Japanese vowel English vowel Duration [ms] 0 100 200 300 400 500 Duration [ms] 0 100 200 300 400 500 a e i o u aa ae ah ao aw ax ay eh el em en er ey ih iy ow oy uh uw Mandarin final Duration [ms] 0 100 200 300 400 500 a ai an ang ao e ei en eng er i ia ian iang iao ic ich ie in ing iong iu o ong ou u ua uai uan uang ui un uo v van ve vn

Summary HMM-based speech synthesis has many opportunities for using machine learning: learning the model from data parameters (alternatives to maximum likelihood such as minimum generation error) model complexity (context clustering, number of mixture components, number of states,...) semi-supervised and unsupervised learning (labels for data are unreliable or missing) adapting the model, given limited new data generation algorithms