Overview of SHRC-Gikgo speech sythesis system for Blizzard Challege 2013 Yasuo Yu, Fegyu Zhu, Xiagag Li, Yi Liu, Ju Zou, Yuig Yag, Guili Yag, Ziye Fa, Xihog Wu ad Hearig Research Ceter Key Laboratory of Machie Perceptio (Miistry of Educatio) Pekig Uiversity, Beijig, 100871, Chia {yuys}@cis.pku.edu.c Abstract This paper itroduces the SHRC-Gikgo speech sythesis system for Blizzard Challege 2013. A uit selectio based approach is adopted to develop our speech sythesis system usig audiobook speech corpus. Aimig at roughly labeled corpora with several hudred hours of speech, our system adopts lightlysupervised acoustic model traiig of speech recogitio to s- elect clea speech data with accurate text. Moreover, rich sytactic cotexts istead of prosodic structure are utilized to refie traditioal acoustic models. Through automatic sytactic parsig, this way ca also help to label the corpora of several tes or eve hudreds of hours automatically, thus avoidig maually prosodic aotatio with time-cosumig ad expesive effort. I order to solve the problems of memory space expasio ad ruig time burde for acoustic model traiig of large-scale corpora, a fast traiig method, which ca esure the accuracy of acoustic model, is realized. Subjective evaluatio results show that our system performs well i almost all evaluatio tests, especially i the case of large-scale corpora. Idex Terms: speech sythesis, speech data selectio, sytactic parsig, uit selectio 1. Itroductio We have bee ivestigatig may aspects of speech sythesis techology for years, especially i Madari. We oce atteded the Madari tasks of Blizzard Challege at 2009. Ad this is our secod etry to Blizzard Challege. This year s challege ivolves lots of uder-researched topics, such as suboptimal recordigs, several hudred hours of same speaker s speech without fie labelig, ovels with differet styles i both dialogue ad aside. Aimig at this situatio, may ovel techologies, icludig lightly-supervised acoustic model traiig for speech data selectio, speech labelig based o automatic sytactic aalysis ad a fast model traiig approach with low resources, are developed to costruct our uit selectio based speech sythesis system. The paper is orgaized as follows. Sectio 2 itroduces the basic situatio of the Eglish tasks i Blizzard 2013. A overview of the system will be discussed thoroughly i Sectio 3. The results of the evaluatio are further described i Sectio 4. Fially, the coclusio is draw i Sectio 5. 2. The Eglish Tasks i Blizzard 2013 I Blizzard Challege 2013, the Eglish evaluatio cosists of two tasks as follows: EH1 - build a voice from the provided usegmeted audio; text is ot provided, so must be obtaied by participats from the web ad aliged with the audio. EH2 - build a voice from the provided segmeted audio; the accompayig aliged text may be used, or text may be obtaied from the web. For both EH1 ad EH2 tasks, the audiobook data is kidly provided by The Voice Factory, from a sigle female speaker. I E- H1 task, this year s challege provides approximately 300 hours of chapter-sized mp3 files. I EH2 task, approximately 19 hours of o-compressed wav files are prepared ad further labeled by Lessac Techologies, Ic. This task remais the same way as previous challeges. I the followig sectios we will itroduce the whole process of costructig the speech sythesis system for both EH1 ad EH2. Database 3. Overview of the Aalysis Data Selectio Aalysis Sytactic Parsig Feature Extractio Sytactic Parsig Traiig of HMMs HMMs Sythesis Database Traiig Sythesis Figure 1: Flowchart of SHRC-Gikgo speech sythesis system. The overview of the text-to-speech (TTS) system, which cosists of both traiig ad sythesis parts, is show i Figure 1. At traiig stage, the clea speech with accurate text is firstly chose from roughly labeled corpora with several hudred hours of speech by meas of speech recogitio ad text aligmet. Afterwards the acoustic features icludig spectral evelope ad F 0 are extracted from this chose speech ad the correspodig text are labeled with both phoe-related ad sytax-related tags through text aalysis ad sytactic parsig respectively. Based o these acoustic features ad the cotextdepedet labels, the correspodig HMMs are estimated i
the maximum likelihood (ML) sese [1]. At sythesis stage, the cotext-depedet label sequece of sythesized text is first predicted by the frot-ed text aalysis ad sytactic parsig. The this label sequece is adopted to choose optimal waveform segmet sequece from speech corpora uder the statistical criterios, such as maximum likelihood [2], miimum Kullback- Leibler divergece (KLD) [3] or a combiatio of both criterios [4]. Fially, all cosecutive waveform segmets i the optimal sequece are cocateated to produce the sythesized speech. The followig subsectios will itroduce the whole system i detail. 3.1. Data Preparatio 3.1.1. Data Selectio Through detailed aalysis about the speech data of EH1 task, it ca be foud that this corpus has the followig characteristics: (1) all data come from a umber of ovels or stories read by the same perso; (2) there are ot accurate trascriptios a- log with speech ad these texts dowloaded from the Iteret ca t be guarateed to be cosistet with the speech cotet; (3) The average gai varies from oe speech segmet to aother due to differet recordig eviromet; (4) Because the reader will employ various timbre ad rhythm to show the characteristics of the fictio roles, such as age, geder ad mood, speech segmets from the dialogue or the aside differ widely i both acoustic ad prosodic aspects. Hece it s ecessary to first choose clea speech data from raw corpus i order to costruct the followig speech sythesis system (Here clea speech data maily meas that the speech has both relative high quality ad quite accurate trascriptio). Referrig to [5, 6], the basic process of speech data selectio is desiged as follows. The whole process maily adopts the method of speech recogitio based o trascriptio-related laguage model (LM). This LM leads to the effect similar to traiig set i the process of speech recogitio. Based o this, if the recogitio result is ot idetical with raw trascriptio, it s likely that the trascriptio has the errors, such as isertio error, deletio error or substitutio error. Afterwards, the word error rate (WER) for every setece is calculated through text aligmet ad the correspodig speech duratios for eachlevel WER are also obtaied, as listed i Table 1. At last, all the aside seteces, whose WERs are zero, are chose as the fial traiig set for EH1. Table 1: The WER result of text aligmet as well as its correspodig speech duratios Word error rate(%) duratios(hour) 0.0 214.02 0.1 262.39 5.0 291.95 10.0 292.26 20.0 292.38 3.1.2. Sytactic Parsig I geeral, prosodic cotext referrig to prosodic structure is ofte selected as labels for the cotext-depedet HMMs. To some extet, this way ca capture the suprasegmetal characteristics of prosodic parameters. But more rich liguistic iformatio ca ot bee recovered fully from this simple four-layer prosodic structure. Therefore we attempted to itroduce more liguistic cotext from sytactic tree to represet the cotextdepedet HMMs. I this paper, both iteral grammar structure of the setece ad iteral collocatio relatios amog the words [7] for sytactic tree are fully adopted to refie traditioal acoustic models. Here two categories of sytactic features icludig grammatical types ad positio relatios for phrases of differet levels are cosidered. Grammatical types maily ivolve the types of father phrase, gradfather phrase ad others for the previous, curret ad ext words. For positio relatios, the relative ad absolute positios amog father phrase, gradfather phrase ad others of curret word are icluded. At last, covetioal prosodic cotext are replaced by rich sytactic cotext i the process of modelig acoustic parameters for both EH1 ad EH2. It is oted that our sytactic parsers are traied usig the Berkeley parser [8], which achieves high performace across may laguages. 3.2. Model Traiig I traiig stage, spectrum (e.g., 39-ordered mel-cepstral coefficiets ad their dyamic features) ad excitatio (e.g., F 0, ad its dyamic features) parameters are first extracted from the speech database usig STRAIGHT [9] ad modeled by the correspodig cotext-depedet HMMs. These parameters are further separated ito differet streams, i which mel-cepstral coefficiets are modeled by cotiuous HMMs while F 0 observatios are modeled by the MSD-HMMs [10]. Specially, a sigle Gaussia distributio is adopted to model the distributio of state duratio. Fially, the cotext-depedet HMMs for each stream are costructed usig the decisio-tree-based cotextclusterig method with the miimum descriptio legth (MDL) criterio. Besides that, whe the quatities of speech icrease up to oe hudred or eve several hudred hours, the covetioal traiig process of acoustic model [11] is ot appropriate due to the problems of both memory space expasio ad ruig time burde. First, a amout of full-cotext HMMs icreased dramatically lead to memory space expasio. Secod, ruig time burde maily comes from both Baum-Welch reestimatio ad model clusterig based o decisio tree for every stream. I this paper, a fast traiig method, which ca esure the accuracy of acoustic model, is realized through the optimizatio of covetioal process of model traiig. 3.3. Sythesis A uit selectio based approach, similar to [3, 4], is employed to costruct our speech sythesis system for EH1 ad EH2. For a whole setece cotaiig N phoes, the selectio criterio combiig the uit likelihood with the distace criterio is adopted as i Equatio (1). The uit likelihood maily ivolves the probability of acoustic observatio o (spectrum ad F 0) icludig static ad dyamic features ad phoe duratio d for the th phoe. Thus the optimal istace sequece u ca be determied usig Equatio (1). u = arg max u N [LL(u, λ ) D( λ, λ )] (1) =1 LL(u, λ ) = w o log P (o λ, Q ) + w d log P (d λ dur ) (2) D( λ, λ ) = S DKL( λ c i, λ i ) t i (3) c {s,p,d} i=1
DKL( λ c i, λ i ) (ω 0 ω 0) log ω0 + (ω 1 ω 1) log ω1 ω 0 ω 1 + 1 { 2 tr ω 1(Σ i Σ 1 i I) + ω 1( Σ iσ 1 i I) + ω 1 Σ } 1 i )(m i m i)(m i m i) T + 1 2 ( ω1 ω1) Σi Σ 1 + (ω 1Σ 1 i i (4) where u is oe of the cadidate uits for the th phoe ad LL(u, λ ) is the log likelihood of cadidate uit u ; w o ad w d are the likelihood weights of acoustic observatio ad phoe duratio; Q ad λ dur deote the state allocatio ad the duratio model for the th phoe respectively; S is the umber of states; c {s, p, d} represets the idex of spectrum, F 0 ad duratio streams respectively ad t i is the duratio of the state i; ω 0, ω 0 ad ω 1, ω 1 are prior probabilities of the discrete ad cotiuous sub-space (for spectrum ad duratio, ω 0, ω 0 0 ad ω 1, ω 1 1); N(m i, Σ i) ad N( m i, Σ i) deote the probability desity fuctio of state i for model λ ad λ respectively. To further speed up the subsequet search process, three pruig techiques [12] icludig cotext pruig, beam pruig ad histogram pruig, are also employed i the process of pre-selectio. The dyamic programmig search ca be applied to fid the optimal uit sequece i the above maximum likelihood sese. Fially, the cross-fade techique [13] is adopted to smooth the phase discotiuity at the cocateatio poits ad the waveforms of every two cosecutive uits i the optimal sequece are cocateated to geerate the sythesized speech. Mea Opiio s (similarity to origial speaker) EH1 All listeers 181 177 180 177 180 179 176 181 179 177 179 Figure 2: Results of MOS o speaker similarity for EH1. 4. Results ad Discussio This sectio will discuss the evaluatio results of our system i Blizzard Challege 2013 i detail. Our system is idetified as M, whereas system A, B ad C are bechmark systems. A is the atural speech, system B is the Festival uit selectio bechmark system ad system C is the HTS statistical parametric bechmark system. Mea Opiio s (similarity to origial speaker) EH2 All listeers 205 207 210 210 206 209 208 207 204 207 205 210 206 208 208 Figure 3: Results of MOS o speaker similarity for EH2. 4.1. Similarity test Figure 2 ad Figure 3 shows the results of similarity scores of all systems for EH1 ad EH2. It ca be see that our system achieves the best similarity to origial speaker for EH1 ad E- H2. Moreover the results of Wilcoxos siged rak tests further show that the differece betwee system M ad ay other systems o similarity is sigificat at 1% level for EH1. The high similarity score of our system ca be attributed to use the origial segmet of a large corpus, eve though there are o modificatios to adapt cocateated uits to ew cotext. 4.2. Naturaless test Figure 4 ad Figure 5 shows the results of MOS o aturaless of all systems for EH1 ad EH2. As we ca see, our system achieved the best performace (ot icludig the atural speech system A) o aturaless amog all the participat systems. Ad the Wilcoxos siged rak tests also show that the differece betwee M ad ay other participat systems o aturaless is sigificat. 4.3. Itelligibility test Figure 6 ad Figure 7 shows the results of the overall word error rate (WER) test of all systems for EH1 ad EH2. The results show that our system achieves the 3th ad 4th lowest WER a- mog all the systems for EH1 ad EH2 respectively. Ad as well as previous Blizzard Challege evaluatios, the itelligibility of HMM-based parametric sythesis method usually ca achieve better performace tha uit selectio methods. 4.4. Paragraph test I additio to three above tests, 60-poit mea opiio scale (MOS) tests was further coducted to evaluate differet aspects of ovel paragraph, such as overall impressio, pleasatess, speech pauses, stress, itoatio, emotio, ad listeig effort. These evaluatio results show that our system is the best system for EH1 ad top-3 system for EH2 i all the seve aspects.
Mea Opiio s (aturaless all data) EH1 All listeers Mea Opiio s (aturaless all data) EH2 All listeers 414 685 687 686 687 686 686 686 688 686 686 349 677 677 676 676 678 677 677 677 676 677 676 679 676 677 Figure 4: Results of MOS o aturaless of seteces for EH1. Figure 5: Results of MOS o aturaless of seteces for EH2. Figure 8 ad Figure 9 shows the results of MOS o overall impressio of all systems for EH1 ad EH2. It ca be see that the our system obtais more advatage o performace tha the other systems as the speech data icreases. This beefits from two aspects: first, our system could choose more clea speech data correspodig to accurate text from roughly labeled corpora with several hudred hours of speech; secod, rich sytactic cotexts may model complex prosodic variatios more accurately tha prosodic structure i the case of large corpora. 5. Coclusios This paper itroduces the developmet of the SHRC-Gikgo speech sythesis system for Blizzard Challege 2013. May ew techologies are exploited to costruct our uit-selectio speech sythesis system for the o-stadard speech database. This system could realize automatically cleaig ad labelig of large-scale corpora by meas of speech recogitio, text aligmet ad sytactic parsig. The evaluatio results of Blizzard Challege 2013 further idicate that our system ca geerate more atural sythesized speech i the ovel domai tha the other systems, especially i EH1. Some importat problems of the audiobook sythesis are still eeded to be solved i the future work, such as emotio expressio of differet roles, fast traiig of acoustic model based o large-corpora, ad so o. 6. Ackowledgemets The work was supported i part by the Natioal Basic Research Program of Chia (2013CB329304) ad Natioal Natural Sciece Foudatio of Chia (No. 61121002, No.91120001). 7. Refereces [1] T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, ad T. Kitamura, Simultaeous modelig of spectrum, pitch ad duratio i HMM-based speech sythesis, i EUROSPEECH, 1999, pp. 2347 2350. [2] R. E. Doova, Traiable speech sythesis, Ph.D. dissertatio, Cambridge Uiversity, 1996. [3] Z. J. Ya, Y. Qia, ad F. K. Soog, Rich-cotext uit selectio (RUS) approach to high quality TTS, i ICASSP, 2010, pp. 4798 4801. [4] Z. H. Lig ad R. H. Wag, HMM-based hierarchical uit s- electio combiig Kullback-Leibler divergece with likelihood criterio, i ICASSP, 2007, pp. 1245 1248. [5] X. G. Li, Z. H. Pag, ad X. H. Wu, Lightly supervised acoustic model traiig for madari cotiuous speech recogitio, Lecture Notes i Computer Sciece, vol. 7751, pp. 727 734, 2013. [6] L. Lamel, J. Gauvai, ad G. Adda, Lightly supervised ad usupervised acoustic model traiig, Computer ad Laguage, vol. 16, pp. 115 129, 2002. [7] Y. S. Yu, D. C. Li, ad X. H. Wu, Prosodic modelig with rich sytactic cotext i hmm-based madari speech sythesis, i IEEE Chia Summit & Iteratioal Coferece o Sigal ad Iformatio Processig (ChiaSIP), 2013. [8] S. Petrov ad D. Klei, Improved iferece for ulexicalized parsig, i Proceedigs of Huma Laguage Techology coferece - North America chapter of the Associatio for Computatioal Liguistics aual meetig(hlt-naacl), Rochester, NY, USA, April 2007, pp. 404 411. [9] H. Kawahara, I. Masuda-Katsuse, ad A. D. Cheveig, Restructurig speech represetatios usig a pitch-adaptive timefrequecy smoothig ad a istataeous-frequecy-based F 0 extractio: possible role of a repetitive structure i souds, Commuicatio, vol. 27, o. 3-4, pp. 187 207, 1999. [10] K. Tokuda, T. Masuko, N. Miyazaki, ad T. Kobayashi, Hidde markov models based o multi-space probability distributio for pitch patter modelig, i ICASSP, 1999, pp. 229 232. [11] K. Tokuda ad H. Ze, Fudametals ad recet advaces i hmm-based speech sythesis, i Tutorial of INTERSPEECH, Brighto, UK, 2009. [12] Y. Qia, Z. J. Ya, Y. J. Wu, F. K. Soog, X. Zhuag, ad S. Y. Kog, A HMM trajectory tilig (HTT) approach to high quality TTS, i INTERSPEECH, 2010, pp. 422 425. [13] T. Hirai ad S. Tepaku, Usig 5 ms segmets i cocateative speech sythesis, i Proceedigs of Sythesis Workshop, Pittsburgh, PA, USA, 2004, pp. 37 42.
Word error rate (EH1, all listeers) Mea Opiio s (audiobook paragraphs overall impressio) EH1 All listeers WER (%) 0 5 10 15 20 25 30 35 40 45 50 55 60 326 324 330 321 326 314 322 322 319 292 0 10 20 30 40 50 60 347 349 351 351 348 348 350 350 350 350 350 M K I L C N B H F P Figure 6: Results of word error rate (WER) for EH1. Figure 8: Results of MOS o overall impressio of audiobook paragraphs for EH1. Word error rate (EH2, all listeers) Mea Opiio s (audiobook paragraphs overall impressio) EH2 All listeers WER (%) 0 5 10 15 20 25 30 35 40 45 50 55 60 273 272 274 276 278 257 276 270 272 264 271 270 268 259 0 10 20 30 40 50 60 45 45 46 45 45 32 45 44 44 45 45 46 45 44 45 M K I L D J C E N B H F G O Figure 7: Results of word error rate (WER) for EH2. Figure 9: Results of MOS o overall impressio of audiobook paragraphs for EH2.