Modified Post-filter to Recover Modulation Spectrum for HMM-based Speech Synthesis

Similar documents
UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

A study of speaker adaptation for DNN-based speech synthesis

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Statistical Parametric Speech Synthesis

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Speech Emotion Recognition Using Support Vector Machine

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

WHEN THERE IS A mismatch between the acoustic

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Body-Conducted Speech Recognition and its Application to Speech Support System

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

Edinburgh Research Explorer

Letter-based speech synthesis

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Human Emotion Recognition From Speech

Learning Methods in Multilingual Speech Recognition

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Speech Recognition at ICSI: Broadcast News and beyond

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Speaker recognition using universal background model on YOHO database

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Voice conversion through vector quantization

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Expressive speech synthesis: a review

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Mandarin Lexical Tone Recognition: The Gating Paradigm

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

Segregation of Unvoiced Speech from Nonspeech Interference

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

Probabilistic Latent Semantic Analysis

Speech Recognition by Indexing and Sequencing

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

On the Formation of Phoneme Categories in DNN Acoustic Models

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Spoofing and countermeasures for automatic speaker verification

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

The IRISA Text-To-Speech System for the Blizzard Challenge 2017

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques

Improvements to the Pruning Behavior of DNN Acoustic Models

Word Segmentation of Off-line Handwritten Documents

DNN ACOUSTIC MODELING WITH MODULAR MULTI-LINGUAL FEATURE EXTRACTION NETWORKS

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

Calibration of Confidence Measures in Speech Recognition

/$ IEEE

Assignment 1: Predicting Amazon Review Ratings

Investigation on Mandarin Broadcast News Speech Recognition

An Online Handwriting Recognition System For Turkish

Vowel mispronunciation detection using DNN acoustic models with cross-lingual training

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY

Proceedings of Meetings on Acoustics

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Learning Methods for Fuzzy Systems

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Lecture 1: Machine Learning Basics

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Constructing a support system for self-learning playing the piano at the beginning stage

Rachel E. Baker, Ann R. Bradlow. Northwestern University, Evanston, IL, USA

A comparison of spectral smoothing methods for segment concatenation based speech synthesis

Speaker Recognition. Speaker Diarization and Identification

A new Dataset of Telephone-Based Human-Human Call-Center Interaction with Emotional Evaluation

Unit Selection Synthesis Using Long Non-Uniform Units and Phonemic Identity Matching

Australian Journal of Basic and Applied Sciences

Spoken Language Parsing Using Phrase-Level Grammars and Trainable Classifiers

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Speaker Identification by Comparison of Smart Methods. Abstract

Author's personal copy

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Rhythm-typology revisited.

Affective Classification of Generic Audio Clips using Regression Models

Evaluation of a Simultaneous Interpretation System and Analysis of Speech Log for User Experience Assessment

LOW-RANK AND SPARSE SOFT TARGETS TO LEARN BETTER DNN ACOUSTIC MODELS

Python Machine Learning

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Transcription:

GlobalSIP 24: Machine Learning Applications in Speech Processing Modified Postfilter to Recover Modulation Spectrum for HMMbased Speech Synthesis Shinnosuke Takamichi, Tomoki Toda, Alan W Black, and Satoshi Nakamura Graduate School of Information Science, Nara Institute of Science and Technology NAIST, Japan Email: shinnosuket@isnaistjp Language Technologies Institute, Carnegie Mellon University CMU, USA Abstract This paper proposes a modified postfilter to recover a Modulation Spectrum MS in HMMbased speech synthesis To alleviate the oversmoothing effect which is one of the major problems in HMMbased speech synthesis, the MSbased postfilter has been proposed It recovers the utterancelevel MS of the generated speech trajectory, and we have reported its benefit to the quality improvement However, this postfilter is not applicable to various lengths of speech parameter trajectories, such as phrases or segments, which are shorter than an utterance To address this problem, we propose two modified postfilters, the timeinvariant filter with a simplified conversion form and 2 the segmentlevel postfilter which applicable to a shortterm parameter sequence Furthermore, we also propose 3 the postfilter to recover the phonemelevel MS of HMMstate duration Experimental results show that the modified postfilters also yield significant quality improvements in synthetic speech as yielded by the conventional postfilter Index Terms HMMbased speech synthesis, modulation spectrum, postfilter, oversmoothing I INTRODUCTION Parametric speech synthesis based on Hidden Markov Models HMMs is an effective framework for generating diverse synthetic speech In HMMbased speech synthesis, speech parameters, ie, spectral and excitation features and HMMstate duration are simultaneously modeled with contextdependent HMMs in a unified framework This approach allows us not only to produce smooth speech parameter trajectories under a small footprint 2 but also to apply several techniques for flexibly controlling them 3, 4, 5 to various speechbased systems 6, 7 One of the critical problems of HMMbased speech synthesis is that the trajectories generated from the trained HMMs are often oversmoothed This phenomenon causes the degradation of perceptual quality, and synthetic speech sounds muffled 8 One approach to addressing this problem is to combine a unit selection framework 9,, and the other approach is to enhance specific features not well reproduced from the traditional HMMs due to the oversmoothing effect, 2 The latter approach can achieve the production of highquality speech while preserving its small footprint As one of the methods based on the latter approach, we have proposed the Modulation Spectrum MSbased postfilter 3 The MS is known as a perceptual cues 4, 5, and the proposed postfilter can improve the quality by recovering the utterancelevel MS of the generated speech parameters However the postfiltering process needs to calculate the MS of the fixed length of speech parameter trajectories, and therefore, it is not applicable to various lengths of speech parameter trajectories, such as phrases or segments This constraint causes some limitations; eg, it prevents a recursive speech parameter generation algorithm 6 from being used for lowdelay speech waveform generation In this paper, we propose two modified postfilters capable of being widely used by relaxing the constraint: the timeinvariant filter and 2 the segmentlevel postfilter The timeinvariant filter makes the filtering process independent of the length of the generated trajectories The segmentlevel filter achieves a segmentbysegment filtering process to recover the MS of a shorter length of speech parameter trajectories compared to the conventional utterancelevel filter Furthermore, to further improve naturalness of synthetic speech, 3 we propose the postfilter for HMMstate duration to recover the MS of a phonelevel duration sequence in a similar manner to in the conventional postfilter We evaluate performance of the individual proposed methods separately to investigate the effect of them on naturalness of synthetic speech II PARAMETER GENERATION In synthesis, HMMs corresponding to input text are constructed from contextdependent HMMs build using natural speech parameters in training After determining the HMMstate sequence q = q,, q T to maximize the duration likelihood, the parameter trajectory is generated to maximize HMM likelihood under a constraint on the relationship between static and dynamic features as follows: ĉ = argmax P W c q, λ, c where c = c,, c T is a speech parameter vector sequence of T frames, c t = c t,, c t d,, c t D is a Ddimensional parameter vector at frame t, d is a dimensional index, W is the weighting matrix for calculating the dynamic features 7, q t is a HMMstate index at frame t, and λ is a HMM parameter set To alleviate the oversmoothness of the generated parameters, Global Variance GV can be also considered in parameter generation III CONVENTIONAL MSBASED POSTFILTER 3 A MSbased PostFiltering Process The MS s c is defined as a logscaled power spectrum of the temporal sequence c, which is calculated as 9784799788943 24 IEEE 7

GlobalSIP 24: Machine Learning Applications in Speech Processing s c = s,, s d,, s D, 2 s d = s d,, s d f,, s d F s, 3 where s d f is the fth MS of the dth dimension of the parameter sequence c d,, c T d, f is a modulation frequency index, F s is a half number of the DFT length In synthesis, the speech parameter sequence generated from the HMM is transformed to the modulation frequency domain Then, its MS is converted as follows: s d f = ks d f N σ + k σ G s d f µ G + µ N, 4 where µ and σ are mean and standard deviation of s d f, N and G indicate of MS of the natural parameter and the generated speech parameter sequence, respectively The MS statistics are estimated in advance from natural and generated speech parameter sequences for training data The coefficient k is a parameter to control the degree of emphasis, which is determined manually Finally, the filtered speech parameter sequence is generated from the converted MS and its original phase B Problems In 3, the MS is calculated utterance by utterance, The DFT length for the MS calculation needs to be set large enough to cover various lengths of utterances This MS calculation causes some problems: if the length of an utterance to be synthesized is longer than the previously determined DFT length, the MS can not be calculated accurately; the utterancelevel filtering process is hard to be applied to a lowlatency speech synthesis frame work 8 where a framelevel or segmentlevel processing based on the recursive parameter generation 6 is essential Moreover, it has been reported that postprocessing to enhance speech parameters, such as the GVbased parameter generation, is also effective for not only spectral and F parameters but also HMMstate duration 9 Although we have applied the MSbased postfilter to only spectrum and F, it is worthwhile to also apply the MSbased postfilter to the HMMstate duration and investigate its effectiveness IV PROPOSED MODIFICATION METHODS FOR MSBASED POSTFILTER To address the problems in the conventional MSbased postfilter, we propose two modification methods for the MSbased postfilter Moreover, we also propose the MSbased postfilter for the HMMstate duration A Method : TimeInvariant PostFilter A timeinvariant postfilter is derived by assuming that σ N is equal to σ G in Eq 4 as follows: s d f = ks d f + k = s d f + k µ N µg s d f µ G + µn 5 Because the second term in RHS is independent of s d f, this conversion process can be represented as a filtering process for the generated speech parameter sequence with a timeinvariant FIR filter B Method 2: SegmentLevel PostFilter A segmentlevel postfilter is derived by localizing the postfiltering process as illustrated in the lefthand side of Figure A part of the speech parameter sequence that is windowed by a triangular window with constant length is used as a segment to calculate the MS and its statistics The window shift length is set to a half of the window length The MSbased postfiltering process is performed segment by segment in the same manner as the conventional filtering The filtered speech parameter sequence is generated by overlapping and adding the filtered segments The hanning window may also be used instead of the triangular window Note that for the spectrum parameter, silence frames are removed in calculating the MS statistics to alleviate the overfitting problem 2 For F, continuous F pattern 2 is used 3 The segmentlevel postfiltering can be applicable to the lowdelay speech waveform generation Moreover, it is possible to further implement contextdependent postfiltering C Method 3: MSBased PostFilter for Duration Although the state duration is not an actual parameter trajectory, it is affected by the oversmoothing effect due to a statistical averaging process as in spectrum and F parameters 22 As illustrated in Figure 2, we can interestingly find the MS degradation of the modulation frequency of phonemelevel duration sequences Therefore, it is expected that quality improvements in synthetic speech are yielded by recovering their MS The overview of the proposed method is illustrated in the right side of Figure First, phonemelevel duration is calculated from the determined statelevel duration Then, a phonemelevel duration sequence over an utterance is constructed by excluding the silence parts and its mean value is normalized as in F parameters 3 The resulting sequence is used to calculate the MS and is also filtered in the same manner as the conventional postfiltering After restoring the utterancelevel mean, the phonemelevel duration is revised if it is smaller than the number of states of the phoneme HMM Finally, the HMMstate duration is updated by maximizing the state duration while fixing the phoneme duration to the filtered values V EXPERIMENTAL EVALUATIONS A Experimental Conditions We trained a contextdependent phoneme Hidden Semi Markov Model HSMM 23 for a Japanese female speaker Nyquist frequency is set to 7

43 2 GlobalSIP 24: Machine Learning Applications in Speech Processing Fig An overview of the proposed methods left: the segmentlevel postfilter, right: the postfilter for duration We used 45 sentences for training and 53 sentences for evaluation from phonetically balanced 53 sentences included in the ATR Japanese speech database 24 Speech signals were sampled at 6 khz The shift length was set to 5 ms The ththrough24th melcepstral coefficients were extracted as a spectral parameter and logscaled F and 5 bandaperiodicity 25, 26 were extracted as excitation parameters The STRAIGHT analysissynthesis system 27 was employed for parameter extraction and waveform generation The feature vector consisted of spectral and excitation parameters and their delta and deltadelta features Fivestate lefttoright HSMMs were used B Evaluation : TimeInvariant PostFilter To confirm the effect by the timeinvariant filter, we conducted the subjective evaluation to compare the following speech samples: HMM: original parameter by Eq HMM+MS ti: parameters filtered by the timeinvariant filter HMM+MS: parameters filtered by the conventional filter From 3, the emphasis coefficient and DFT length were set to 85 and 496, respectively We applied the MSbased postfilter to both spectrum and F We conducted a preference test AB test on speech quality Every pair of three types of synthetic speech was presented to listeners in random order 6 listeners were asked which sample sounds better in terms of speech quality The preference result is shown in Figure 3 We can see that a significant quality improvement is yielded by applying the timeinvariant postfilter to the generated speech parameters Although the improved quality is not comparable to that yielded by the conventional postfilter, the timeinvariant postfilter is applicable to various lengths of speech parameter sequences C Evaluation 2: SegmentLevel PostFilter The window length and window shift length were set to 25 ms 25 samples 28 and 6 ms 2 samples, respectively 64taps FFT was used We compared the following speech samples: HMM: original parameter generated by Eq : *9 8 7 6 + 5 *?A@CBD ;=<=> Fig 2 Averaged MSs of phonelevel duration sequences DUR : generated duration 3 2 * 4657598?>A@?8B:C5;= DEGFHJIKEJL 4657598;:<5;= *+, Fig 3 Preference scores with 95 confidence intervals the timeinvariant postfilter Fig 4 HMM likelihoods for the Fig 5 HMM likelihoods for the filtered spectrum filtered F +, + +, * 23254768279 23254;:<=4768279 >@?BADCFE?FG HMM+LMS: HMM parameters filtered by the segmentlevel filter HMM+GV: parameters generated by Eq with the GV HMM+GV+LMS: HMM+GV parameters filtered by the segmentlevel filter Tuning Emphasis Coefficient: We calculated the HMM likelihood, GV likelihood, and MS likelihood for the filtered both spectral parameters and F contours while varying the emphasis coefficient from to For comparison, the likelihood for natural speech parameters was calculated, which is labeled as Natural The results are shown in Figs 4 to 9 Their tendencies are similar to those of the conventional postfilter as reported in 3 The degradation of HMM likelihoods by the postfiltering process, but they are sill greater than that of natural parameters Almost likelihoods tend to increase as the filter coefficient is close to we observed the degradation of the MS likelihood for F but it is always greater than that of natural parameters From these results, we tuned the emphasis coefficient to for both spectrum and F 2 Subjective Assessment on Speech Quality: AB test using the above 4 methods on speech quality by 7 listeners was 72

4 * GlobalSIP 24: Machine Learning Applications in Speech Processing, +* 56>678@?BA@8=9;6:< CDFEG HD I 567678:9;6=<,,, + * 2435376>=?>698:3<; 243537698:3<; @ACBDFEAFG Fig 6 GV likelihoods for the filtered Fig 7 GV likelihoods for the filtered spectrum F ; 7: 7 798 6 5, *+ JKMLNPOKPQ <=?=A@HGIH@CBD=FE <>=?=A@CBD=FE 3 2 * 456587?>@?7:9;5=< 456587:9;5=< ACBEDF GB H Fig 8 MS likelihoods for the filtered Fig 9 MS likelihoods for the filtered spectrum F conducted in the same manner as in the previous section The postfiltering was applied to both spectrum and F The preference score is shown in Figure It is observed that the significant quality gain is yielded by HMM+LMS compared to HMM, and its comparable to that yielded by HMM+GV Furthermore, we can see that the additional gain is yielded by HMM+GV+LMS compared to HMM+GV This tendency is similar to that observed in the conventional postfilter as reported in 3 Please note that the segmentlevel postfilter is applicable to various lengths of a speech parameter sequence but the conventional one cannot D Evaluation 3: MSBased PostFiltering for Duration We evaluated the effectiveness of the postfilter for duration 64taps FFT was used The spectrum and F is not filtered Compared speech samples are below: DUR: original duration DUR+MS: duration filtered by the proposed the postfilter The duration likelihood and MS likelihood are shown in Figure 2 and Figure 3, respectively We can see that the MS likelihood increases as the filter coefficient is close to while preserving the duration likelihood high enough Therefore, the emphasis coefficient was set to in the subjective evaluation Fig Preference scores with 95 Fig Preference scores with 95 confidence intervals local MSbased confidence intervals postfilter for duration postfilter : ;=<=>?A@CB 9 687 5 3 2 * DFEGHD Fig 2 Duration likelihoods for the Fig 3 MS likelihoods for the filtered duration filtered duration, *+ @BACD@C 57698;:=<?> We can also see discontinuous transition of the MS likelihood We expect that this was caused by the effect of rounding the filtered duration values into integer values after filtering The result of AB test by 6 listeners is shown in Figure We can see that te MSbased postfilter for duration tends to slightly improve speech quality VI SUMMARY This paper have proposed the modified Modulation Spectrum MSbased postfilters in HMMbased speech synthesis We have reported that the postfilters can avoid the conventional limitation while preserving the quality gain Furthermore, we have applied the MSbased postfilter to phonelevel duration, and have yielded the effectiveness on speech quality We will investigate the benefits of the postfilter and MS itself on various situation Acknowledgements: Part of this work was supported by JSPS KAKENHI Grant Number 26286 and GrantinAid for JSPS Fellows Grant Number 26 354, and part of this work was executed under JSPS Strategic Young Researcher Overseas Visits Program for Accelerating Brain Circulation 73

GlobalSIP 24: Machine Learning Applications in Speech Processing REFERENCES K Tokuda, Y Nankaku, T Toda, H Zen, J Yamagishi, and K Oura Speech synthesis based on hidden markov models Proceedings of the IEEE, Vol, No 5, pp 234 252, 23 2 K Oura, H Zen, Y Nankaku, A Lee, and K Tokuda Tying covariance matrices to reduce the footprint of HMMbased speech synthesis systems In Proc INTERSPEECH, pp 759 762, Brighton, U K, 29 3 T Yoshimura, T Masuko, K Tokuda, T Kobayashi, and T Kitamura Speaker interpolation for HMMbased speech synthesis system J Acoust Soc Jpn E, Vol 2, No 4, pp 99 26, 2 4 J Yamagishi and T Kobayashi Averagevoicebased speech synthesis using HSMMbased speaker adaptation and adaptive training IEICE Trans, Inf and Syst, Vol E9D, No 2, pp 533 543, 27 5 T Nose, J Yamagishi, T Masuko, and T Kobayashi A style control technique for HMMbased expressive speech synthesis IEICE Trans, Inf and Syst, Vol E9D, No 9, pp 46 43, 27 6 K Shirota, K Nakamura, K Hashimoto, K Oura, Y Nankaku, and K Tokuda Integration of speaker and pitch adaptive training for HMMbased singing voice synthesis In Proc ICASSP, pp 2578 2582, Florence, Italy, May 24 7 J Yamagishi, C Veaux, S King, and S Renals Speech synthesis technologies for individuals with vocal diabilities: Voice banking and reconstruction Acoust Sci technol, Vol 33, pp 5, 22 8 S King and V Karaiskos The blizzard challenge 2 In Proc Blizzard Challenge workshop, Turin, Italy, Sept 2 9 Z Ling, L Qin, H Lu, Y Gao, L Dai, R Wang, Y Jiang, Z Zhao, J Yang, J Chen, and G Hu The USTC and iflytek speech synthesis systems for blizzard challenge 27 In Proc Blizzard Challenge workshop, Bonn, Germany, Aug 27 S Takamichi, T Toda, Y Shiga, S Sakti, G Neubig, and S Nakamura Parameter generation methods with rich context models for highquality and flexible texttospeech synthesis IEEE Journal of Selected Topics in Signal Processing, Vol 8, No 2, pp 239 25, May 24 T Toda and K Tokuda A speech parameter generation algorithm considering global variance for HMMbased speech synthesis IEICE Trans, Vol E9D, No 5, pp 86 824, 27 2 T Nose, V Chunwijitra, and T Kobayashi A parameter generation algorithm using local variance for HMMbased speech synthesis IEEE Journal of Selected Topics in Signal Processing, Vol 8, No 2, pp 22 228, 24 3 S Takamichi, T Toda, G Neubig, S Sakti, and S Nakamura A postfilter to modify modulation spectrum in HMMbased speech synthesis In Proc ICASSP, pp 29 294, Florence, Italy, May 24 4 R Drullman, J M Festen, and R Plomp Effect of reducing slow temporal modulations on speech reception J Acoust Soc of America, Vol 95, pp 267 268, 994 5 S Thomas, S Ganapathy, and H Hermansky Phoneme recognition using spectral envelop and modulation frequency features In Proc ICASSP, pp 4453 4456, Taipei, Taiwan, April 29 6 K Tokuda, T Kobayashi, and S Imai Speech parameter generation from HMM using dynamic features In Proc ICASSP, pp 66 663, Detroit, USA, May 995 7 K Tokuda, T Yoshimura, T Masuko, T Kobayashi, and T Kitamura Speech parameter generation algorithms for HMMbased speech synthesis In Proc ICASSP, pp 35 38, Istanbul, Turkey, June 2 8 T Baumann and D Schlangen INPRO iss A component for justintime incremental speech synthesis Proc ACL, pp 3 8, Jul 22 9 S Pan, Y Nankaku, K Tokuda, and JTao Global variance modelinf on the log power spectrum of lsps for HMMbased speech synthesis In Proc ICASSP, pp 476 479, Prague, Czech Republic, 2 2 H Zen and A Senior Deep mixture density networks for acoustic modeling in statistical parametric speech synthesis In Proc ICASSP, pp 3872 3876, Florence, Italy, May 24 2 K Yu and S Young Continuous F modeling for HMM based statistical parametric speech synthesis IEEE Trans Audio, Speech and Language, Vol 9, No 5, pp 7 79, 2 22 T Yoshimura, K Tokuda, T Masuko, T Kobayashi, and T Kitamura Simultaneous modeling of spectrum, pitch and duration in HMMbased speech synthesis In Proc EUROSPEECH, pp 2347 235, Budapest, Hungary, Apr 999 23 H Zen, K Tokuda, T Kobayashi T Masuko, and T Kitamura Hidden semimarkov model based speech synthesis system IEICE Trans, Inf and Syst, E9D, No 5, pp 825 834, 27 24 Y Sagisaka, K Takeda, M Abe, S Katagiri, T Umeda, and H Kuawhara A largescale Japanese speech database In ICSLP9, pp 89 92, Kobe, Japan, Nov 99 25 H Kawahara, Jo Estill, and O Fujimura Aperiodicity extraction and control using mixed mode excitation and group delay manipulation for a high quality speech analysis, modification and synthesis system STRAIGHT In MAVEBA 2, pp 6, Firentze, Italy, Sept 2 26 Y Ohtani, T Toda, H Saruwatari, and K Shikano Maximum likelihood voice conversion based on GMM with STRAIGHT mixed excitation In Proc INTERSPEECH, pp 2266 2269, Pittsburgh, USA, Sep 26 27 H Kawahara, I MasudaKatsuse, and A D Cheveigne Restructuring speech representations using a pitchadaptive timefrequency smoothing and an instantaneousfrequencybased F extraction: Possible role of a repetitive structure in sounds Speech Commun, Vol 27, No 3 4, pp 87 27, 999 28 V Tyagi, I McCowan, H Misra, and H Bourlard Melcepstrum modulation spectrum MCMS features for robust ASR Proc ASRU, pp 399 44, Nov 23 74