A study of speaker adaptation for DNN-based speech synthesis

Similar documents
UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

Modeling function word errors in DNN-HMM based LVCSR systems

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Modeling function word errors in DNN-HMM based LVCSR systems

Spoofing and countermeasures for automatic speaker verification

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Edinburgh Research Explorer

Human Emotion Recognition From Speech

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Speech Emotion Recognition Using Support Vector Machine

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

On the Formation of Phoneme Categories in DNN Acoustic Models

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Deep Neural Network Language Models

Speaker recognition using universal background model on YOHO database

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

WHEN THERE IS A mismatch between the acoustic

Vowel mispronunciation detection using DNN acoustic models with cross-lingual training

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Statistical Parametric Speech Synthesis

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Learning Methods in Multilingual Speech Recognition

The IRISA Text-To-Speech System for the Blizzard Challenge 2017

Speech Recognition at ICSI: Broadcast News and beyond

Speaker Identification by Comparison of Smart Methods. Abstract

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

Calibration of Confidence Measures in Speech Recognition

DNN ACOUSTIC MODELING WITH MODULAR MULTI-LINGUAL FEATURE EXTRACTION NETWORKS

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Assignment 1: Predicting Amazon Review Ratings

Support Vector Machines for Speaker and Language Recognition

Probabilistic Latent Semantic Analysis

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, VOL XXX, NO. XXX,

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Word Segmentation of Off-line Handwritten Documents

Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

Python Machine Learning

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Automatic Pronunciation Checker

LOW-RANK AND SPARSE SOFT TARGETS TO LEARN BETTER DNN ACOUSTIC MODELS

Letter-based speech synthesis

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Evolutive Neural Net Fuzzy Filtering: Basic Description

Improvements to the Pruning Behavior of DNN Acoustic Models

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Speech Recognition by Indexing and Sequencing

Speech Translation for Triage of Emergency Phonecalls in Minority Languages

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Voice conversion through vector quantization

Expressive speech synthesis: a review

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques

Speaker Recognition For Speech Under Face Cover

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

Affective Classification of Generic Audio Clips using Regression Models

Lecture 1: Machine Learning Basics

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS

Generative models and adversarial training

Cultivating DNN Diversity for Large Scale Video Labelling

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

A Deep Bag-of-Features Model for Music Auto-Tagging

An Online Handwriting Recognition System For Turkish

arxiv: v1 [cs.lg] 7 Apr 2015

Investigation on Mandarin Broadcast News Speech Recognition

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

arxiv: v2 [cs.cv] 30 Mar 2017

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Speaker Recognition. Speaker Diarization and Identification

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Learning Methods for Fuzzy Systems

A Review: Speech Recognition with Deep Learning Methods

Rule Learning With Negation: Issues Regarding Effectiveness

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY

Model Ensemble for Click Prediction in Bing Search Ads

A comparison of spectral smoothing methods for segment concatenation based speech synthesis

SUPRA-SEGMENTAL FEATURE BASED SPEAKER TRAIT DETECTION

Using dialogue context to improve parsing performance in dialogue systems

Transcription:

A study of speaker adaptation for DNN-based speech synthesis Zhizheng Wu, Pawel Swietojanski, Christophe Veaux, Steve Renals, Simon King The Centre for Speech Technology Research (CSTR) University of Edinburgh, United Kingdom

Background A speaker-dependent TTS system requires several hours recordings in studio It is expensive to collect Adaptation for speech synthesis Create a new voice using minimal data, for example 1 minute speech 2

Related work Speaker adaptation for statistical parametric speech synthesis MLLR, CMLLR, MAP, MAPLR, CSMAPLR, etc Voice conversion for unit-selection concatenation speech synthesis Yamagishi, Junichi, Takao Kobayashi, Yuji Nakano, Katsumi Ogata, and Juri Isogai. "Analysis of speaker adaptation algorithms for HMM-based speech synthesis and a constrained SMAPLR adaptation algorithm." IEEE Transactions on Audio, Speech, and Language Processing, 17, no. 1 (2009): 66-83. Kain, Alexander, and Michael W. Macon. "Spectral voice conversion for text-to-speech synthesis." In IEEE International Conference on Acoustics, Speech and Signal Processing, 1998. vol. 1, pp. 285-288. 3

DNN-based speech synthesis Mapping linguistic features to vocoder parameters using a deep neural network Outperform HMM-based speech synthesis in terms of naturalness Heiga Zen, Andrew Senior, and Mike Schuster. "Statistical parametric speech synthesis using deep neural networks." ICASSP 2013 Yao Qian, Yuchen Fan, Wenping Hu, and Frank K. Soong. "On the training aspects of deep neural network (DNN) for parametric TTS synthesis." ICASSP 2014 4

Proposed adaptation framework for DNN-based speech synthesis Performing speaker adaptation at three different levels y Vocoder parameters Feature mapping y ' Vocoder parameters h 4 LHUC h 3 h 2 i-vector x Gender code h 1 Linguistic features LHUC: Learning hidden unit contributions 5

Adaptation framework: i-vector I-vector extraction s m + Ti, i N (0, I) m is the mean supervector of a speaker-independent universal background model (UBM) s is the mean supervector of the speaker-dependent GMM model (adapted from the UBM) T is the total variability matrix estimated on the background data i is the speaker identity vector, also called the i-vector Dehak, Najim, Patrick Kenny, Réda Dehak, Pierre Dumouchel, and Pierre Ouellet. "Front-end factor analysis for speaker verification." IEEE Transactions on Audio, Speech, and Language Processing, 19, no. 4 (2011): 788-798. 6

Adaptation framework: LHUC Learning hidden unit contribution h l mis the activations of the hidden layer (r l m) is an element-wise function to constrain the range of (r l m is the weight matrix of the hidden layer (W l> h l m = (r l m) (W l> h l 1 m ) l th l th setting (r l m) h l m = W l> h l 1 = 1, the hidden activation will become the normal one Swietojanski, Pawel, and Steve Renals. "Learning hidden unit contributions for unsupervised speaker adaptation of neural network acoustic models." In IEEE Spoken Language Technology Workshop (SLT), 2014 7

Adaptation framework: feature space adaptation Feature transformation: Transform the output of DNN using a linear transformation y (ŷ A = A is a linear transformation matrix 8

Adaptation framework: combination of individual techniques As each adaptation method is applied at different level, they can easily combined y Feature mapping Vocoder parameters y ' Vocoder parameters h 4 LHUC h 3 h 2 i-vector x Gender code h 1 Linguistic features 9

Experimental setups Corpus Voice bank database: 96 speakers (41 male, 55 female) To build speaker-independent average DNN model Sampling rate: 48 khz Each speaker has around 300 utterances Two target speakers (one male, one female) 10 utterances for adaptation, 70 development, 72 testing Vocoder parameters (extracted by STRAIGHT) 60-D Mel-Cepstral Coefficients with delta, delta-delta 25-D Band Aperiodicities (BAP) with delta, delta-delta 1-D fundamental frequency (F0) (linearly interpolated) with delta, delta-delta 1-D voiced/unvoiced binary feature In total 259 dimension 10

Experimental setups Neural network architecture 6 hidden layers, each layer has 1536 hidden units Tangent activation function for hidden layers, linear activation function for output layer Data normalisation Vocoder parameters: speaker-dependent normalisation to zero mean and unit variance Linguistic features: normalised to [0.01 0.99] on the whole database 11

Experimental setups (cont d) Baseline HMM system The open-source HTS toolkit, and the best the setting on our dataset CSMAPLR adaptation algorithm Adaptation i-vector background model: voice bank database i-vector dimension: 32 Toolkit: ALIZE LHUC applied to all the hidden layers Feature transformation Joint density Gaussian mixture model based voice conversion 12

Subjective results DNN adaptation methods Naturalness MSHRA (MUltiple Stimuli with Hidden Reference and Anchor) test 30 listeners i vector LHUC FT i vector+lhuc i vector+ft LHUC+FT i vector+lhuc+ft 100 80 60 40 20 0 only i-vector+lhuc+ft vs LHUC+FT, and LHUC vs i-vector+lhuc are not significantly different 13

Subjective results DNN adaptation methods Similarity - 30 listeners 0 20 40 60 80 100 i vector LHUC FT i vector+lhuc i vector+ft LHUC+FT i vector+lhuc+ft only i-vector+lhuc+ft vs LHUC+FT, FT vs i-vector+lhuc and LHUC vs i-vector+ft are not significantly different 14

Subjective results DNN vs HMM Preference test 30 native English speakers Naturalness:10 DNN HMM Similarity:10 DNN HMM 0 20 40 60 80 100 Preference score (%) 15

Conclusions Adaptation for DNN-based synthesis can be applied at three different levels The performance of DNN adaptation is significantly better than HMM adaptation Future work Speaker adaptive training for the average DNN model Joint optimisation of adaptation at three different levels All the samples used in the listening tests are available at: http://dx.doi.org/10.7488/ds/259 16