Overview of SHRC-Ginkgo speech synthesis system for Blizzard Challenge 2013

Similar documents
arxiv: v1 [cs.dl] 22 Dec 2016

Natural language processing implementation on Romanian ChatBot

E-LEARNING USABILITY: A LEARNER-ADAPTED APPROACH BASED ON THE EVALUATION OF LEANER S PREFERENCES. Valentina Terzieva, Yuri Pavlov, Rumen Andreev

Management Science Letters

Consortium: North Carolina Community Colleges

'Norwegian University of Science and Technology, Department of Computer and Information Science

Application for Admission

Fuzzy Reference Gain-Scheduling Approach as Intelligent Agents: FRGS Agent

CONSTITUENT VOICE TECHNICAL NOTE 1 INTRODUCING Version 1.1, September 2014

HANDBOOK. Career Center Handbook. Tools & Tips for Career Search Success CALIFORNIA STATE UNIVERSITY, SACR AMENTO

VISION, MISSION, VALUES, AND GOALS

part2 Participatory Processes

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

A study of speaker adaptation for DNN-based speech synthesis

Learning Methods in Multilingual Speech Recognition

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

also inside Continuing Education Alumni Authors College Events

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Speech Recognition at ICSI: Broadcast News and beyond

2014 Gold Award Winner SpecialParent

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

Edinburgh Research Explorer

On March 15, 2016, Governor Rick Snyder. Continuing Medical Education Becomes Mandatory in Michigan. in this issue... 3 Great Lakes Veterinary

The IRISA Text-To-Speech System for the Blizzard Challenge 2017

Speech Emotion Recognition Using Support Vector Machine

Letter-based speech synthesis

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Statistical Parametric Speech Synthesis

Modeling function word errors in DNN-HMM based LVCSR systems

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Modeling function word errors in DNN-HMM based LVCSR systems

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Human Emotion Recognition From Speech

WHEN THERE IS A mismatch between the acoustic

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Affective Classification of Generic Audio Clips using Regression Models

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Body-Conducted Speech Recognition and its Application to Speech Support System

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Investigation on Mandarin Broadcast News Speech Recognition

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

Improvements to the Pruning Behavior of DNN Acoustic Models

Speaker recognition using universal background model on YOHO database

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Calibration of Confidence Measures in Speech Recognition

DERMATOLOGY. Sponsored by the NYU Post-Graduate Medical School. 129 Years of Continuing Medical Education

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING

BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY

PRAAT ON THE WEB AN UPGRADE OF PRAAT FOR SEMI-AUTOMATIC SPEECH ANNOTATION

CEFR Overall Illustrative English Proficiency Scales

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

Acquiring Competence from Performance Data

The Effect of Extensive Reading on Developing the Grammatical. Accuracy of the EFL Freshmen at Al Al-Bayt University

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

Transfer Learning Action Models by Measuring the Similarity of Different Domains

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

Spoofing and countermeasures for automatic speaker verification

Voice conversion through vector quantization

Multi-View Features in a DNN-CRF Model for Improved Sentence Unit Detection on English Broadcast News

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Lecture 9: Speech Recognition

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

Syntactic Patterns versus Word Alignment: Extracting Opinion Targets from Online Reviews

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

Using dialogue context to improve parsing performance in dialogue systems

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

/$ IEEE

Universiteit Leiden ICT in Business

arxiv: v2 [cs.cv] 30 Mar 2017

AMULTIAGENT system [1] can be defined as a group of

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

Advanced Grammar in Use

user s utterance speech recognizer content word N-best candidates CMw (content (semantic attribute) accept confirm reject fill semantic slots

Speech Recognition by Indexing and Sequencing

A Hybrid Text-To-Speech system for Afrikaans

Evaluation of a Simultaneous Interpretation System and Analysis of Speech Log for User Experience Assessment

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data

Rhythm-typology revisited.

Speech Translation for Triage of Emergency Phonecalls in Minority Languages

Transcription:

Overview of SHRC-Gikgo speech sythesis system for Blizzard Challege 2013 Yasuo Yu, Fegyu Zhu, Xiagag Li, Yi Liu, Ju Zou, Yuig Yag, Guili Yag, Ziye Fa, Xihog Wu ad Hearig Research Ceter Key Laboratory of Machie Perceptio (Miistry of Educatio) Pekig Uiversity, Beijig, 100871, Chia {yuys}@cis.pku.edu.c Abstract This paper itroduces the SHRC-Gikgo speech sythesis system for Blizzard Challege 2013. A uit selectio based approach is adopted to develop our speech sythesis system usig audiobook speech corpus. Aimig at roughly labeled corpora with several hudred hours of speech, our system adopts lightlysupervised acoustic model traiig of speech recogitio to s- elect clea speech data with accurate text. Moreover, rich sytactic cotexts istead of prosodic structure are utilized to refie traditioal acoustic models. Through automatic sytactic parsig, this way ca also help to label the corpora of several tes or eve hudreds of hours automatically, thus avoidig maually prosodic aotatio with time-cosumig ad expesive effort. I order to solve the problems of memory space expasio ad ruig time burde for acoustic model traiig of large-scale corpora, a fast traiig method, which ca esure the accuracy of acoustic model, is realized. Subjective evaluatio results show that our system performs well i almost all evaluatio tests, especially i the case of large-scale corpora. Idex Terms: speech sythesis, speech data selectio, sytactic parsig, uit selectio 1. Itroductio We have bee ivestigatig may aspects of speech sythesis techology for years, especially i Madari. We oce atteded the Madari tasks of Blizzard Challege at 2009. Ad this is our secod etry to Blizzard Challege. This year s challege ivolves lots of uder-researched topics, such as suboptimal recordigs, several hudred hours of same speaker s speech without fie labelig, ovels with differet styles i both dialogue ad aside. Aimig at this situatio, may ovel techologies, icludig lightly-supervised acoustic model traiig for speech data selectio, speech labelig based o automatic sytactic aalysis ad a fast model traiig approach with low resources, are developed to costruct our uit selectio based speech sythesis system. The paper is orgaized as follows. Sectio 2 itroduces the basic situatio of the Eglish tasks i Blizzard 2013. A overview of the system will be discussed thoroughly i Sectio 3. The results of the evaluatio are further described i Sectio 4. Fially, the coclusio is draw i Sectio 5. 2. The Eglish Tasks i Blizzard 2013 I Blizzard Challege 2013, the Eglish evaluatio cosists of two tasks as follows: EH1 - build a voice from the provided usegmeted audio; text is ot provided, so must be obtaied by participats from the web ad aliged with the audio. EH2 - build a voice from the provided segmeted audio; the accompayig aliged text may be used, or text may be obtaied from the web. For both EH1 ad EH2 tasks, the audiobook data is kidly provided by The Voice Factory, from a sigle female speaker. I E- H1 task, this year s challege provides approximately 300 hours of chapter-sized mp3 files. I EH2 task, approximately 19 hours of o-compressed wav files are prepared ad further labeled by Lessac Techologies, Ic. This task remais the same way as previous challeges. I the followig sectios we will itroduce the whole process of costructig the speech sythesis system for both EH1 ad EH2. Database 3. Overview of the Aalysis Data Selectio Aalysis Sytactic Parsig Feature Extractio Sytactic Parsig Traiig of HMMs HMMs Sythesis Database Traiig Sythesis Figure 1: Flowchart of SHRC-Gikgo speech sythesis system. The overview of the text-to-speech (TTS) system, which cosists of both traiig ad sythesis parts, is show i Figure 1. At traiig stage, the clea speech with accurate text is firstly chose from roughly labeled corpora with several hudred hours of speech by meas of speech recogitio ad text aligmet. Afterwards the acoustic features icludig spectral evelope ad F 0 are extracted from this chose speech ad the correspodig text are labeled with both phoe-related ad sytax-related tags through text aalysis ad sytactic parsig respectively. Based o these acoustic features ad the cotextdepedet labels, the correspodig HMMs are estimated i

the maximum likelihood (ML) sese [1]. At sythesis stage, the cotext-depedet label sequece of sythesized text is first predicted by the frot-ed text aalysis ad sytactic parsig. The this label sequece is adopted to choose optimal waveform segmet sequece from speech corpora uder the statistical criterios, such as maximum likelihood [2], miimum Kullback- Leibler divergece (KLD) [3] or a combiatio of both criterios [4]. Fially, all cosecutive waveform segmets i the optimal sequece are cocateated to produce the sythesized speech. The followig subsectios will itroduce the whole system i detail. 3.1. Data Preparatio 3.1.1. Data Selectio Through detailed aalysis about the speech data of EH1 task, it ca be foud that this corpus has the followig characteristics: (1) all data come from a umber of ovels or stories read by the same perso; (2) there are ot accurate trascriptios a- log with speech ad these texts dowloaded from the Iteret ca t be guarateed to be cosistet with the speech cotet; (3) The average gai varies from oe speech segmet to aother due to differet recordig eviromet; (4) Because the reader will employ various timbre ad rhythm to show the characteristics of the fictio roles, such as age, geder ad mood, speech segmets from the dialogue or the aside differ widely i both acoustic ad prosodic aspects. Hece it s ecessary to first choose clea speech data from raw corpus i order to costruct the followig speech sythesis system (Here clea speech data maily meas that the speech has both relative high quality ad quite accurate trascriptio). Referrig to [5, 6], the basic process of speech data selectio is desiged as follows. The whole process maily adopts the method of speech recogitio based o trascriptio-related laguage model (LM). This LM leads to the effect similar to traiig set i the process of speech recogitio. Based o this, if the recogitio result is ot idetical with raw trascriptio, it s likely that the trascriptio has the errors, such as isertio error, deletio error or substitutio error. Afterwards, the word error rate (WER) for every setece is calculated through text aligmet ad the correspodig speech duratios for eachlevel WER are also obtaied, as listed i Table 1. At last, all the aside seteces, whose WERs are zero, are chose as the fial traiig set for EH1. Table 1: The WER result of text aligmet as well as its correspodig speech duratios Word error rate(%) duratios(hour) 0.0 214.02 0.1 262.39 5.0 291.95 10.0 292.26 20.0 292.38 3.1.2. Sytactic Parsig I geeral, prosodic cotext referrig to prosodic structure is ofte selected as labels for the cotext-depedet HMMs. To some extet, this way ca capture the suprasegmetal characteristics of prosodic parameters. But more rich liguistic iformatio ca ot bee recovered fully from this simple four-layer prosodic structure. Therefore we attempted to itroduce more liguistic cotext from sytactic tree to represet the cotextdepedet HMMs. I this paper, both iteral grammar structure of the setece ad iteral collocatio relatios amog the words [7] for sytactic tree are fully adopted to refie traditioal acoustic models. Here two categories of sytactic features icludig grammatical types ad positio relatios for phrases of differet levels are cosidered. Grammatical types maily ivolve the types of father phrase, gradfather phrase ad others for the previous, curret ad ext words. For positio relatios, the relative ad absolute positios amog father phrase, gradfather phrase ad others of curret word are icluded. At last, covetioal prosodic cotext are replaced by rich sytactic cotext i the process of modelig acoustic parameters for both EH1 ad EH2. It is oted that our sytactic parsers are traied usig the Berkeley parser [8], which achieves high performace across may laguages. 3.2. Model Traiig I traiig stage, spectrum (e.g., 39-ordered mel-cepstral coefficiets ad their dyamic features) ad excitatio (e.g., F 0, ad its dyamic features) parameters are first extracted from the speech database usig STRAIGHT [9] ad modeled by the correspodig cotext-depedet HMMs. These parameters are further separated ito differet streams, i which mel-cepstral coefficiets are modeled by cotiuous HMMs while F 0 observatios are modeled by the MSD-HMMs [10]. Specially, a sigle Gaussia distributio is adopted to model the distributio of state duratio. Fially, the cotext-depedet HMMs for each stream are costructed usig the decisio-tree-based cotextclusterig method with the miimum descriptio legth (MDL) criterio. Besides that, whe the quatities of speech icrease up to oe hudred or eve several hudred hours, the covetioal traiig process of acoustic model [11] is ot appropriate due to the problems of both memory space expasio ad ruig time burde. First, a amout of full-cotext HMMs icreased dramatically lead to memory space expasio. Secod, ruig time burde maily comes from both Baum-Welch reestimatio ad model clusterig based o decisio tree for every stream. I this paper, a fast traiig method, which ca esure the accuracy of acoustic model, is realized through the optimizatio of covetioal process of model traiig. 3.3. Sythesis A uit selectio based approach, similar to [3, 4], is employed to costruct our speech sythesis system for EH1 ad EH2. For a whole setece cotaiig N phoes, the selectio criterio combiig the uit likelihood with the distace criterio is adopted as i Equatio (1). The uit likelihood maily ivolves the probability of acoustic observatio o (spectrum ad F 0) icludig static ad dyamic features ad phoe duratio d for the th phoe. Thus the optimal istace sequece u ca be determied usig Equatio (1). u = arg max u N [LL(u, λ ) D( λ, λ )] (1) =1 LL(u, λ ) = w o log P (o λ, Q ) + w d log P (d λ dur ) (2) D( λ, λ ) = S DKL( λ c i, λ i ) t i (3) c {s,p,d} i=1

DKL( λ c i, λ i ) (ω 0 ω 0) log ω0 + (ω 1 ω 1) log ω1 ω 0 ω 1 + 1 { 2 tr ω 1(Σ i Σ 1 i I) + ω 1( Σ iσ 1 i I) + ω 1 Σ } 1 i )(m i m i)(m i m i) T + 1 2 ( ω1 ω1) Σi Σ 1 + (ω 1Σ 1 i i (4) where u is oe of the cadidate uits for the th phoe ad LL(u, λ ) is the log likelihood of cadidate uit u ; w o ad w d are the likelihood weights of acoustic observatio ad phoe duratio; Q ad λ dur deote the state allocatio ad the duratio model for the th phoe respectively; S is the umber of states; c {s, p, d} represets the idex of spectrum, F 0 ad duratio streams respectively ad t i is the duratio of the state i; ω 0, ω 0 ad ω 1, ω 1 are prior probabilities of the discrete ad cotiuous sub-space (for spectrum ad duratio, ω 0, ω 0 0 ad ω 1, ω 1 1); N(m i, Σ i) ad N( m i, Σ i) deote the probability desity fuctio of state i for model λ ad λ respectively. To further speed up the subsequet search process, three pruig techiques [12] icludig cotext pruig, beam pruig ad histogram pruig, are also employed i the process of pre-selectio. The dyamic programmig search ca be applied to fid the optimal uit sequece i the above maximum likelihood sese. Fially, the cross-fade techique [13] is adopted to smooth the phase discotiuity at the cocateatio poits ad the waveforms of every two cosecutive uits i the optimal sequece are cocateated to geerate the sythesized speech. Mea Opiio s (similarity to origial speaker) EH1 All listeers 181 177 180 177 180 179 176 181 179 177 179 Figure 2: Results of MOS o speaker similarity for EH1. 4. Results ad Discussio This sectio will discuss the evaluatio results of our system i Blizzard Challege 2013 i detail. Our system is idetified as M, whereas system A, B ad C are bechmark systems. A is the atural speech, system B is the Festival uit selectio bechmark system ad system C is the HTS statistical parametric bechmark system. Mea Opiio s (similarity to origial speaker) EH2 All listeers 205 207 210 210 206 209 208 207 204 207 205 210 206 208 208 Figure 3: Results of MOS o speaker similarity for EH2. 4.1. Similarity test Figure 2 ad Figure 3 shows the results of similarity scores of all systems for EH1 ad EH2. It ca be see that our system achieves the best similarity to origial speaker for EH1 ad E- H2. Moreover the results of Wilcoxos siged rak tests further show that the differece betwee system M ad ay other systems o similarity is sigificat at 1% level for EH1. The high similarity score of our system ca be attributed to use the origial segmet of a large corpus, eve though there are o modificatios to adapt cocateated uits to ew cotext. 4.2. Naturaless test Figure 4 ad Figure 5 shows the results of MOS o aturaless of all systems for EH1 ad EH2. As we ca see, our system achieved the best performace (ot icludig the atural speech system A) o aturaless amog all the participat systems. Ad the Wilcoxos siged rak tests also show that the differece betwee M ad ay other participat systems o aturaless is sigificat. 4.3. Itelligibility test Figure 6 ad Figure 7 shows the results of the overall word error rate (WER) test of all systems for EH1 ad EH2. The results show that our system achieves the 3th ad 4th lowest WER a- mog all the systems for EH1 ad EH2 respectively. Ad as well as previous Blizzard Challege evaluatios, the itelligibility of HMM-based parametric sythesis method usually ca achieve better performace tha uit selectio methods. 4.4. Paragraph test I additio to three above tests, 60-poit mea opiio scale (MOS) tests was further coducted to evaluate differet aspects of ovel paragraph, such as overall impressio, pleasatess, speech pauses, stress, itoatio, emotio, ad listeig effort. These evaluatio results show that our system is the best system for EH1 ad top-3 system for EH2 i all the seve aspects.

Mea Opiio s (aturaless all data) EH1 All listeers Mea Opiio s (aturaless all data) EH2 All listeers 414 685 687 686 687 686 686 686 688 686 686 349 677 677 676 676 678 677 677 677 676 677 676 679 676 677 Figure 4: Results of MOS o aturaless of seteces for EH1. Figure 5: Results of MOS o aturaless of seteces for EH2. Figure 8 ad Figure 9 shows the results of MOS o overall impressio of all systems for EH1 ad EH2. It ca be see that the our system obtais more advatage o performace tha the other systems as the speech data icreases. This beefits from two aspects: first, our system could choose more clea speech data correspodig to accurate text from roughly labeled corpora with several hudred hours of speech; secod, rich sytactic cotexts may model complex prosodic variatios more accurately tha prosodic structure i the case of large corpora. 5. Coclusios This paper itroduces the developmet of the SHRC-Gikgo speech sythesis system for Blizzard Challege 2013. May ew techologies are exploited to costruct our uit-selectio speech sythesis system for the o-stadard speech database. This system could realize automatically cleaig ad labelig of large-scale corpora by meas of speech recogitio, text aligmet ad sytactic parsig. The evaluatio results of Blizzard Challege 2013 further idicate that our system ca geerate more atural sythesized speech i the ovel domai tha the other systems, especially i EH1. Some importat problems of the audiobook sythesis are still eeded to be solved i the future work, such as emotio expressio of differet roles, fast traiig of acoustic model based o large-corpora, ad so o. 6. Ackowledgemets The work was supported i part by the Natioal Basic Research Program of Chia (2013CB329304) ad Natioal Natural Sciece Foudatio of Chia (No. 61121002, No.91120001). 7. Refereces [1] T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, ad T. Kitamura, Simultaeous modelig of spectrum, pitch ad duratio i HMM-based speech sythesis, i EUROSPEECH, 1999, pp. 2347 2350. [2] R. E. Doova, Traiable speech sythesis, Ph.D. dissertatio, Cambridge Uiversity, 1996. [3] Z. J. Ya, Y. Qia, ad F. K. Soog, Rich-cotext uit selectio (RUS) approach to high quality TTS, i ICASSP, 2010, pp. 4798 4801. [4] Z. H. Lig ad R. H. Wag, HMM-based hierarchical uit s- electio combiig Kullback-Leibler divergece with likelihood criterio, i ICASSP, 2007, pp. 1245 1248. [5] X. G. Li, Z. H. Pag, ad X. H. Wu, Lightly supervised acoustic model traiig for madari cotiuous speech recogitio, Lecture Notes i Computer Sciece, vol. 7751, pp. 727 734, 2013. [6] L. Lamel, J. Gauvai, ad G. Adda, Lightly supervised ad usupervised acoustic model traiig, Computer ad Laguage, vol. 16, pp. 115 129, 2002. [7] Y. S. Yu, D. C. Li, ad X. H. Wu, Prosodic modelig with rich sytactic cotext i hmm-based madari speech sythesis, i IEEE Chia Summit & Iteratioal Coferece o Sigal ad Iformatio Processig (ChiaSIP), 2013. [8] S. Petrov ad D. Klei, Improved iferece for ulexicalized parsig, i Proceedigs of Huma Laguage Techology coferece - North America chapter of the Associatio for Computatioal Liguistics aual meetig(hlt-naacl), Rochester, NY, USA, April 2007, pp. 404 411. [9] H. Kawahara, I. Masuda-Katsuse, ad A. D. Cheveig, Restructurig speech represetatios usig a pitch-adaptive timefrequecy smoothig ad a istataeous-frequecy-based F 0 extractio: possible role of a repetitive structure i souds, Commuicatio, vol. 27, o. 3-4, pp. 187 207, 1999. [10] K. Tokuda, T. Masuko, N. Miyazaki, ad T. Kobayashi, Hidde markov models based o multi-space probability distributio for pitch patter modelig, i ICASSP, 1999, pp. 229 232. [11] K. Tokuda ad H. Ze, Fudametals ad recet advaces i hmm-based speech sythesis, i Tutorial of INTERSPEECH, Brighto, UK, 2009. [12] Y. Qia, Z. J. Ya, Y. J. Wu, F. K. Soog, X. Zhuag, ad S. Y. Kog, A HMM trajectory tilig (HTT) approach to high quality TTS, i INTERSPEECH, 2010, pp. 422 425. [13] T. Hirai ad S. Tepaku, Usig 5 ms segmets i cocateative speech sythesis, i Proceedigs of Sythesis Workshop, Pittsburgh, PA, USA, 2004, pp. 37 42.

Word error rate (EH1, all listeers) Mea Opiio s (audiobook paragraphs overall impressio) EH1 All listeers WER (%) 0 5 10 15 20 25 30 35 40 45 50 55 60 326 324 330 321 326 314 322 322 319 292 0 10 20 30 40 50 60 347 349 351 351 348 348 350 350 350 350 350 M K I L C N B H F P Figure 6: Results of word error rate (WER) for EH1. Figure 8: Results of MOS o overall impressio of audiobook paragraphs for EH1. Word error rate (EH2, all listeers) Mea Opiio s (audiobook paragraphs overall impressio) EH2 All listeers WER (%) 0 5 10 15 20 25 30 35 40 45 50 55 60 273 272 274 276 278 257 276 270 272 264 271 270 268 259 0 10 20 30 40 50 60 45 45 46 45 45 32 45 44 44 45 45 46 45 44 45 M K I L D J C E N B H F G O Figure 7: Results of word error rate (WER) for EH2. Figure 9: Results of MOS o overall impressio of audiobook paragraphs for EH2.