Natural Indonesian Speech Synthesis by using CLUSTERGEN

2014 International Conference on Information, Communication Technology and System Natural Indonesian Speech Synthesis by using CLUSTERGEN Evan Tysmayudanto Gunawan, Dhany Arifianto Department of Engineering Physics, Institut Teknologi Sepuluh Nopember Surabaya 60111 INDONESIA evan10@mhs.ep.its.ac.id, dhany@ep.its.ac.id Abstract This paper was developing speech synthesis bahasa Indonesia. Experiment used Statistical Parametric Synthesis to develop bahasa Indonesia with CLUSTERGEN method in FestVox build. The CLUSTERGEN give some difference method which had developed, the main different is trajectory model. After synthetic voices were synthesized, DMOS, a subjective test, was set to evaluate quality of voices which has been synthesized. Result showed for CLUSTERGEN technique, it got average 2.11 score. Then, for obtain better synthesized voice especially in naturally of the voices, it were built up with STRAIGHT and moving segment label with new version of FestVox. With DMOS, the voices were acquired average 3.74 score. Another experiment of speech synthesis in bahasa Indonesia have been developing to obtain synthesized speech which more natural. Index : Speech Synthesis, bahasa Indonesia, DMOS, STRAIGHT 1. Introduction Nowadays, relation between countries as haven t limitation, mobilization and business is the most aspect which made it happened. But culture and language must be notice here. Especially the country of destination doesn t have same language with the traveler. Language is the most fundamental culture of the country has, a translator is the one way to solve problem, but it have some limitation in separate conditions. Indonesia has its own language, which called bahasa Indonesia. Bahasa Indonesia nowadays is still an underresource language, under-resource language is a language which include in all or some of this following category: Lack of a unique writing system or stable orthography, limited presence on the web, lack of linguistic expertise, lack of electronic resources for speech and language processing, such as monolingual corpora, bilingual electronic dictionaries, transcribed speech data, pronunciation dictionaries, vocabulary lists, etc. The synonyms for the same concept are: low-density languages, resource-poor languages, low data languages, less-resourced languages. It is important to note that it is not the same as a minority language which is a language spoken by a minority of the population of a territory [4]. It could be some reason for some people non-indonesia language problem to communicate with an Indonesian (whom can t speak English or another language). Because of definition of under-resource above, there is some way to develop bahasa Indonesia, this paper choose to develop bahasa Indonesia in electronic resource for speech and language processing, which is focused in natural speech synthesis. 978-1-4799-6858-9/14/$31.00 2014 IEEE 2. Bahasa Indonesia Bahasa Indonesia is an official language of Indonesia. In term of writing bahasa Indonesia follow roman alphabet orthography, and pronunciation system of bahasa Indonesia is set by phonology bahasa Indonesia. Phonology bahasa Indonesia is the study of sounds in bahasa Indonesia, in bahasa Indonesia Phonology is distinguished by two, phonemic and phonetic, based on purpose of the sound will be observe. Phonetic is a part of linguistic which studied about linguistic especially in pronunciation (based on Kamus Besar Bahasa Indonesia,1997). Indonesia is consist multi-culture, whom in daily life of Indonesian, it will be influence of Indonesian pronunciation, because every ethnic group have its own pronunciation. Therefore important to know a standard pronunciation of bahasa Indonesia, because if learn bahasa Indonesia from Indonesian, ethnic factor will take effect too for who will learn bahasa Indonesia. 2.1 Phoneme Phoneme is the smallest part of pronunciation in a word or phrase, which will construct a word or phrase. Basically phoneme is an important component which will be producted a natural sound based on phonetic rule of bahasa Indonesia. Bahasa Indonesia has 33 phoneme, According to phonetic rule of bahasa Indonesia, whom appear in Table 1 below and rewrite corresponding IPA (International Phonetically Alphabets): 2.2 Syllable Bahasa Indonesia is one of language which has been up dated of the spelling system, it caused in Indonesia was knowing old base spelling (Ejaan lama) and new base spelling (Ejaan Yang Disempurnakan). This paper is focused in Ejaan Yang Disempurnakan (EYD), because Ejaan lama has not been used again (especially in old book and literature). Speech synthesis bahasa Indonesia (in this case Text to Speech bahasa Indonesia system) needed a text input. But in bahasa Indonesia, there is some case that pronunciation is not same with writing rule (the text input), know phoneme e and ȇ. Two of phoneme above is writing with same alphabet ie e but pronunciation of them is not same. So that for obtain synthesized voice as desired, according syllable rules are present in table 2 below, for the right spelling of bahasa Indonesia. ICTS 2014, Surabaya, Indonesia 125

Table 1 List of phoneme in bahasa Indonesia according to IPA (International Phonetically Alphabets) No 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 3. Indonesian /q/ /e/ /ê/ /i/ /o/ /u/ /ay/ /aw/ /ey/ /oy/ /b/ /c/ /d/ /f/ /g/ /h/ /j/ /k/ /m/ /l/ /N/ /P/ /R/ /S/ /T/ /W/ /Y/ /Z/ /Kh/ /Ng/ /Ny/ /Sy/ /sil/ English aa ah, ae ah, ax ih, iy, ix ow, ao uh, uw Ay Aw Ey Oy B Ch d, dx, dh f, v G Hh Jh k m l n p r s t th w y z, zh ng [] Example father ten learn see, happy got, saw put, too five now say boy bad chain did fall, van got hat jam keep man leg no pen red so tea wet yes zoo sing share - Statistical Parametric Synthesis Text to Speech (TTS) Technology of speech synthesis is greatly have been developed. Unit selection at fist is the one of technique of speech synthesis [1]. That technique above widely is knew as the most power full technique to obtain high quality of synthetic speech, but to obtain the quality, it must be built with many variation in speech styles and emotions, is the great data base of high quality sound is needed [5]. As an alternative had been developing Statistical Parametric Synthesis (SPS) by HTS [19], whom is a technique to synthesize voice based a model parameter has been generate from statistical theory Hidden Markov Models (HMMs). This technique can generate synthetic voice by convoluting a voocoder (white noise) and filter (synthetize from training). As the result sound will generate from this technique would have some noise in its output [2] which make degradation of quality in naturally of voice output. Therefore SPS technique comes with some advantage which is interesting to develop, there is produce smooth of synthetic voice, stable, small runtime data and can generate various voices [2]. Build upon that advantage with purpose to increase quality of natural voice output has developed CLUSTERGEN, a new method to repair 126 Table 2 No 1. 2. 3. 4. 5. 6. 7. Syllable rule of bahasa Indonesia is based on phoneme vowel, phoneme diphthong, phoneme consonant, and an group of phoneme particle [13] Rule If middle of the word is double vowel, beheading of syllable is appear between them. Diphthong (ai, au, and oi) never beheading If middle of the word consist a consonant or a group of consonant, whom is between two vowel, beheading is appear before the consonant If middle of the word is consist two consonant whom is a sequence, beheading is appear between them. If middle of the word consist three or more consonant, beheading is appear after the 1st consonant Beheading can appear between elementary word and particle (affix, prefix or modified particle). If word consist a infix, beheading is appear at 1st phoneme of infix. Example /sa//at/ /au//la/ /ba//pak/ /man//di/ /bang//krut/ /makan//an/ /te/lun/juk/ training system of SPS to generate better parameter training [5]. Different way of CLUSTERGEN than HTS model is in trajectory modeling, a setup experiment was set and show trajola, a model trajectory with overlap and add, which is better than other kind of trajectories model have been build [5]. The CLUSTERGEN was included in FesVox [http://festvox.org/] build. Newer version of FestVox not only included CLUSTERGEN into but also had been included STRAIGHT [10] and moving segment label [7] technique too. STRAIGHT (Speech Transformation and Representation using Adaptive Interpolation of weighted spectrum) is a procedure to manipulate speech signal based on pitch adaptive spectral smoothing and instantaneous-frequency-based F0 extraction. This Procedure was designed to eliminate trek which is caused by signal periodic. 4. Experiments In this part, research procedure had been set to produce natural speech synthesis bahasa Indonesia, will be explained. The major idea of research is follow main idea of statistical parametric synthesis which set up with CLUSTERGEN in figure 1 below. This research was classified as 3 step based on its requirements. 4.1 Training Based on [17] to begin the training part, there is some component which need to start training part i.e. data base of original sound, script to obtain a syllable form, scrip of phoneme in bahasa Indonesia and label.

Figure 3 Flow of syllable file Figure 1 Systematic CLUSTERGEN framework [1] Then, the next step is combines 24 MFCCs with F0 which have been extracted, as the result is given 25 vector for every 5ms [5]. The last part of training is clustering of MFCCs from the every sample, this part is used wagon tool which contain in Edinburg Speech Tools CART tree builder [5]. As the result of training part is to obtain model parameter which used in synthesis part. Phoneme script is consist list all phoneme in table 1, specific in that script is define too category of the phoneme (diphthong, schwa, etc.). this script will become a reference to labeling process. Database was used in this research is 1529 sentences, which consist 500 question sentence and contain all of phoneme (table 1) concern to phonetically rule of bahasa Indonesia into. Database was create as.wav file audio and downgrade the sampling frequency to 16000 khz sampling rate with sox tool from SPTK 4.2 Synthesis Core of synthesis part is joining models parameter which have been generate from training part to obtain filter, then that filter is convoluted with voocoder (white noise) to obtain synthetic voice. All part of synthesis are used Festival (http://www.cstr.ed.ac.uk/projects/festival/). The result of synthesis process sequence was obtained synthetic voice from Festival. As compare to original sound was obtained result like Figure 4 below. Figure 2 Distribution of phoneme in database, it shown that the major phoneme which appear in database is a. Syllable scrip was setup based rule of bahasa Indonesia syllable in table 2, with file script, aturan.py, was expected will make it simple to evaluate. The main idea of syllable scrip is separate word to elementary word and particle, from that elementary word is separate again for each phoneme was included. To make it easy to cut off based syllable rule in table 2, then that word is interpretation as structure of vowel or consonant. Output of this syllable script is to generate, a syllable file text, pengucapan and import the word was incuded to lexicon file in FesVox. Setting a variable consist a list of particle of bahasa Indonesia, it is purposed to recognize the structure between word and particle. Then particle is became one syllable an it will cut to every phoneme was included. EHMM [14] is a technique for obtain label files from each database, it is create label from estimate phoneme based on fully connected state models and forward connected state model of HMM. This technique is shown a number of log likelihood better than technique is use 5 state sequence of HMM. Training part done with CLUSTERGEN method, this method essentially contain some part, the first step is extraction of F0 from the audio file in database with Speech Tools [http://www.cstr.ed.ac.uk/projects/speech_tools/] pda [5]. Figure 4 Comparation spectrogram the F0 of original sound (top) and synthesized voice by CLUSTERGEN (bottom) with saya suka baju yang berwarna merah tua sentence Figure 4 above was obtained with wavesurver [https://www.speech.kth.se/wavesurfer/], from that figure, a length of sound appears in synthetic voice is shorter than original which make noise appear. Shown in 0.2-0.3 second of that spectrogram and in synthetic voice is showing a retard of timing in synthetic voice. 4.3 Optimization Purpose of this step is to increase naturally of the synthetic voice has been generated. It is used STRAIGHT and moving segment label to repair of synthetic voice has been generated with CLUSTERGEN. As the result can be show in figure 5 below: 127

Category Figure 5 Comparation spectrogram the F0 of original sound (top), synthesized voice by CLUSTERGEN (middle) and remaked of CLUSTERGEN synthezed voice with STRIGHT dan moving segment label (bottom) with saya suka baju yang berwarna merah tua sentence From spectrogram, figure 5 above, we can take a result that length of frequency of the sound which appear in synthetic voice was remake with STRAIGHT and moving segment label, which is get significantly repair the length of F0, as compare to voice has synthesized by CLUSTERGEN only. The shorting length on F0 is happen because reconstructing of the speech (synthesis part or testing part). In that case of the F0 is solved in each frequency (therefore in some part is not). But a retard in timing is still appear, it will be caused of syllable which has been generate is not perfect, and take an effect to timing at that spectrogram. 4.4 Degradation Mean Opinion Score (DMOS) Score Degradation is inaudible 5 Degradation is audible but not annoying 4 Degradation is slightly annoying 3 Degradation is annoying 2 Degradation is very annoying 1 In table 3 category of scoring DMOS is used. From that know that every category is a subjective range based on what sensitivity of the respondent ear. That will make the DMOS score is more variation. The result of DMOS test is appear in figure 7 Figure 7 DMOS for each speaker, figure above is shown that score for synthesis speech with STRAIGHT give better. To evaluate the quality of voice has been generate, Degradation Most Opinion Score (DMOS) [16] test, a subjective based on opinion of respondent, was set up. This test is done with comparing original sounds with synthesized voices. In this test, 150 samples sound is played, for every category of synthesized voice and 300 times for original sound, through 10 respondents. 150 sample above is consist 20 training voice, 129 test voice and 1 check voice. Respondents would take a score about sound with category in table 3 and set up of the test is illustrated in figure 6 below: Based on the result above quality of the voice can be increased by improving the syllable script, because average of respondents have opinion that the noise most appear on vowel especially in ending of the word. It is could be happened because the syllable scrip is not perfect to obtain right spelling. This is first step to develop bahasa Indonesia in electronic resource. Another research had been developing for obtain natural speech. Figure 6 Set up test to obtain subjective score, Listening to reference speech is process to hear off Original sound, and Listening to assessed speech is process to hear off synthesized voice [16]. Result have been obtained from that test, present that synthesized voice with CLUSTERGEN method was obtained average 2.11 score and then for synthesized voice has included STRAIGHT and moving average label into, present average 3.74 score. Table 3 128 Degradation Category Rating (DCR), criteria for scoring DMOS based on degradation of original sound as compare to synthesized voice [16]. 5. Conclusions Natural speech synthesis bahasa Indonesian have been developed with CLUSTERGEN method. The results of synthesized speech generated by DMOS testing the value obtained average 2.11 score, to improve the naturalization of the synthesized speech, some algorithm, STRAIGHT and move labeling, was added to the trial, so that the results obtained average 3.74 score of DMOS. 6. Acknowledgements 7. Reference [1] G. K. Anumanchipalli and A. Black, Adaptation Techniques For Speech Synthesis in Under-resourced, SLTU 2010, Penang, Malaysia, 2010. [2] K. Tokuda, H. Zen and A. Black An HMM-Based Speech Synthesis System Applied to English, Proc. of 2002 IEEE SSW, Sept. 2002.

[3] Suyanto, An Indonesian Phonetically Balance Sentence Set for Collecting Speech Database, Jurnal Teknologi Industri Vol. XI No. 1 Januari 2007: 59-68. [4] L. Besacier, E. Barnard, A. Karpov, T. Schultz Automatic Speech Recognition for Under-resource Language: A Survey, Speech Communication, Elsevier, Vol. 56, No. 1, 2014, pp. 85-100: [5] A. Black CLUSTERGEN: A Statistical Parametric Synthesizer Using Trajectory Modeling, in Interspeech 2006, Pittsburgh,PA., 2006 [6] S. Kim, J. Kim and M. Hahn HMM-Based Korean Speech Synthesized System for Hand-held Device, IEEE Transactions on Consumer Electronics, Vol. 52, No. 4, NOVEMBER 2006. [7] A. Black and J. Konimek Optimizing Segment Label Boundaries for Statistical Speech Synthesis ICASSP 2009, Taipei, Taiwan. 2009. [8] K. Hashimoto, S. Takaki, K. Oura and K. Tokuda., Overview of NIT HMM-Based Speech Synthesis System for Blizzard Challenge 2011 in Blizzard Challenge 2011, 2011-09. [9] K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi, T. Kitamura, Speech parameter generation algorithms for HMM-based speech synthesis. In Proceedings of ICASSP, pages 1315-1318, 2000. [10] H. Kawahara, I. Masuda-Katsuse, and A. Cheveigne, Restructuring speech representations using a pitchadaptive time-frequency smoothing and an instantaneousfrequencybased f0 extraction: possible role of a repetitive structure in sounds, Speech Communications, vol. 27, pp. 187 207,1999. [11] H. Zen, T. Nose, J. Yamagishi, S. Sako, T. Masuko, A.W. Black, K. Tokuda, The HMM-based speech synthesis system version 2.0, Proc. of ISCA SSW6, Bonn, Germany, Aug. 2007. [12] K. Tokuda, T. Mausko, N. Miyazaki, T. Kobayashi, Multi-space probability distribution HMM. IEICE Transactions on Information & Systems, E85-D(3), pages 455-464, 2002. [13] Panitia Pengembangan Bahasa Indonesia Pedoman Umum Ejaan Yang Disempurnakan, Pusat Bahasa, Departemen Pendidikan Nasional, 2000. [14] K. Prahallad, A. Black, and R. Mosur, Sub-phonetic modeling for capturing pronunciation variation in conversational speech synthesis, in Proceedings of ICASSP 2005, Toulouse, France, 2006. [15] P. Taylor, Analysis and synthesis of intonation using the tilt model, Journal of the Acoustical Society of America, vol. 107 3, pp. 1697 1714, 2000. [16] ITU-T, Methods for Objective and Subjective Assessment of Quality, http://www.itu.int/rec/t-rec- P.800-199608-I/en. [17] A. Black and K. Lenzo, FestVox: Building voices in the Festival Speech Synthesis System, http://festvox.org/bsv/, 2000. [18] K. Tokuda, H. Zen, J. Yamagishi, T. Masuko, S. Sako, T. Nose, and K. Oura, Hmm-based speech synthesis system (hts), http://hts.sp.nitech.ac.jp. [19] K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi, and Kitamura T., Speech parameter generation algorithms for HMM-base speech synthesis, in ICASSP2000, Istanbul,Turkey, 2000. 129

130 This page is left blank on purpose