UNIT SELECTION VOICE FOR AMHARIC USING FESTVOX

Similar documents
ADDIS ABABA UNIVERSITY SCHOOL OF GRADUATE STUDIES MODELING IMPROVED AMHARIC SYLLBIFICATION ALGORITHM

The analysis starts with the phonetic vowel and consonant charts based on the dataset:

Mandarin Lexical Tone Recognition: The Gating Paradigm

SEGMENTAL FEATURES IN SPONTANEOUS AND READ-ALOUD FINNISH

Phonological Processing for Urdu Text to Speech System

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Universal contrastive analysis as a learning principle in CAPT

The Perception of Nasalized Vowels in American English: An Investigation of On-line Use of Vowel Nasalization in Lexical Access

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Learning Methods in Multilingual Speech Recognition

A Hybrid Text-To-Speech system for Afrikaans

Pobrane z czasopisma New Horizons in English Studies Data: 18/11/ :52:20. New Horizons in English Studies 1/2016

Proceedings of Meetings on Acoustics

On the Formation of Phoneme Categories in DNN Acoustic Models

Speech Recognition at ICSI: Broadcast News and beyond

Consonants: articulation and transcription

**Note: this is slightly different from the original (mainly in format). I would be happy to send you a hard copy.**

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Phonological and Phonetic Representations: The Case of Neutralization

A study of speaker adaptation for DNN-based speech synthesis

Quarterly Progress and Status Report. Voiced-voiceless distinction in alaryngeal speech - acoustic and articula

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

To appear in the Proceedings of the 35th Meetings of the Chicago Linguistics Society. Post-vocalic spirantization: Typology and phonetic motivations

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Acoustic correlates of stress and their use in diagnosing syllable fusion in Tongan. James White & Marc Garellek UCLA

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

Unit Selection Synthesis Using Long Non-Uniform Units and Phonemic Identity Matching

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all

SOUND STRUCTURE REPRESENTATION, REPAIR AND WELL-FORMEDNESS: GRAMMAR IN SPOKEN LANGUAGE PRODUCTION. Adam B. Buchwald

Phonetics. The Sound of Language

Clinical Application of the Mean Babbling Level and Syllable Structure Level

Demonstration of problems of lexical stress on the pronunciation Turkish English teachers and teacher trainees by computer

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Revisiting the role of prosody in early language acquisition. Megha Sundara UCLA Phonetics Lab

Parallel Evaluation in Stratal OT * Adam Baker University of Arizona

Building Text Corpus for Unit Selection Synthesis

University of Waterloo School of Accountancy. AFM 102: Introductory Management Accounting. Fall Term 2004: Section 4

DEVELOPMENT OF LINGUAL MOTOR CONTROL IN CHILDREN AND ADOLESCENTS

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Linguistics 220 Phonology: distributions and the concept of the phoneme. John Alderete, Simon Fraser University

Contrastiveness and diachronic variation in Chinese nasal codas. Tsz-Him Tsui The Ohio State University

Phonological encoding in speech production

CS Machine Learning

On Developing Acoustic Models Using HTK. M.A. Spaans BSc.

A comparison of spectral smoothing methods for segment concatenation based speech synthesis

The Bruins I.C.E. School

CROSS-LANGUAGE MAPPING FOR SMALL-VOCABULARY ASR IN UNDER-RESOURCED LANGUAGES: INVESTIGATING THE IMPACT OF SOURCE LANGUAGE CHOICE

Automatic English-Chinese name transliteration for development of multilingual resources

Effect of Word Complexity on L2 Vocabulary Learning

Word Stress and Intonation: Introduction

Language Acquisition by Identical vs. Fraternal SLI Twins * Karin Stromswold & Jay I. Rifkin

Improving the Quality of MT Output using Novel Name Entity Translation Scheme

source or where they are needed to distinguish two forms of a language. 4. Geographical Location. I have attempted to provide a geographical

Voice conversion through vector quantization

Florida Reading Endorsement Alignment Matrix Competency 1

Similarity Avoidance in the Proto-Indo-European Root

Modeling function word errors in DNN-HMM based LVCSR systems

A Cross-language Corpus for Studying the Phonetics and Phonology of Prominence

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

Perceived speech rate: the effects of. articulation rate and speaking style in spontaneous speech. Jacques Koreman. Saarland University

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Journal of Phonetics

An argument from speech pathology

Modeling function word errors in DNN-HMM based LVCSR systems

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Program Matrix - Reading English 6-12 (DOE Code 398) University of Florida. Reading

Segregation of Unvoiced Speech from Nonspeech Interference

Speech Emotion Recognition Using Support Vector Machine

age, Speech and Hearii

Phonology Revisited: Sor3ng Out the PH Factors in Reading and Spelling Development. Indiana, November, 2015

Consonant-Vowel Unity in Element Theory*

ABSTRACT. Some children with speech sound disorders (SSD) have difficulty with literacyrelated

The IRISA Text-To-Speech System for the Blizzard Challenge 2017

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

CS224d Deep Learning for Natural Language Processing. Richard Socher, PhD

Cross Language Information Retrieval

THE MULTIVOC TEXT-TO-SPEECH SYSTEM

Rhythm-typology revisited.

2,1 .,,, , %, ,,,,,,. . %., Butterworth,)?.(1989; Levelt, 1989; Levelt et al., 1991; Levelt, Roelofs & Meyer, 1999

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Edinburgh Research Explorer

Dyslexia/dyslexic, 3, 9, 24, 97, 187, 189, 206, 217, , , 367, , , 397,

BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY

The Journey to Vowelerria VOWEL ERRORS: THE LOST WORLD OF SPEECH INTERVENTION. Preparation: Education. Preparation: Education. Preparation: Education

Radical CV Phonology: the locational gesture *

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

CROSS COUNTRY CERTIFICATION STANDARDS

PH.D. IN COMPUTER SCIENCE PROGRAM (POST M.S.)

Automatic Pronunciation Checker

Lexical phonology. Marc van Oostendorp. December 6, Until now, we have presented phonological theory as if it is a monolithic

English Language Arts Summative Assessment

Detecting English-French Cognates Using Orthographic Edit Distance

Transcription:

UNIT SELECTION VOICE FOR AMHARIC USING FESTVOX Sebsibe H/Mariam, S P Kishore, Alan W Black, Rohit Kumar, and Rajeev Sangal Language Technologies Research Center International Institute of Information Technology, Hyderabad Language Technologies Institute, Carnegie Mellon University Institute for Software Research International, Carnegie Mellon University Abstract In this paper, we try to describe the issues to be considered in developing a concatenative speech synthesizer for Amharic language. The complexity of the syllable structure of the language, the phonetic nature of the language and the result of the perceptual test of the synthesizer will be discussed. Comments and recommendations for further research are included. 1. INTRODUCTION Amharic is the official language of Ethiopia. Among 73 languages which are registered in the country, Amharic is the widely spoken language and is one of the Semitic languages having its own script. The scripts are more or less orthographic representation of the phonemes in the language. In this paper we discuss the development of unit selection voice for Amharic using Festvox [1]. Festvox is a voice building framework which offers general tools for building unit selection voices in new languages. The unit selection paradigm is a cluster based technique where units of the same type are clustered based on the acoustic differences. The clusters are then indexed based on higher level phonetic and prosodic context. During synthesis an appropriate unit is chosen from multiple instances of that unit based on minimization of joining cost and concatenation cost. Voices generated by this system may be run in the Festival speech synthesis system [2]. This paper is organized as follows: Section 2 focuses on the nature of the Amharic script. Section 3 explains issues like representation of the phone set, the letter to sound rules, syllable structure of the language and syllabification rules. In section 4 we described the voice building process. Section 5 presents the results of perceptual testing conducted on the voice. Conclusion and recommendation are given in Section 6. 2. NATURE OF AMHARIC LANGUAGE SCRIPT The script of Amharic language is phonetic in nature. It has 32 consonants and 7 vowels. The orthographic representation of the language is organized into orders. Each of the 32 consonants has seven orders (derivatives). Six of them are CV combinations while the seventh is the consonant itself. Moreover there are extra orthographic symbols in the language that are not organized as above. The total number if orthographic symbols of the language exceed 230. The phonetic features of these groups of symbols are not clearly studied. The vowels also find one line in the ordering list except /e/ 1. For each consonant C, the orthographic ordering is as follows: C/e/ C/u/ C/i/ C/a/ C/ie/ C C/o/ Unlike the orthographic representation, Amharic language has one special property in its spoken form (CV sequence of the acoustic form of the orthographic representation). The sixth order orthographic symbols, which do not have any vowel unit associated to it in the written form (CV transcription of the orthographic form), may associate the vowel /ix/ in its spoken form which has important role during syllabification of the word in the language which allows splitting impermissible consonant clusters. 3. AMHARIC PHONE SET AND SYLLABIFICATION 3.1. Building Amharic Phone Set To work with Amharic scripts, we defined a transliteration scheme using ASCII characters (as shown in Appendix A). This transliteration scheme is designed based on the orthographic ordering of the script and the acoustic similarity of the letters. It also covers all phonemes under 1 The transliteration scheme (I-X notation) is mentioned in appendix A. 1

Table1. Consonant with their features (mainly adopted from [3]) Labials Alveolar Palatals Velars Labio-Velar Glottals Stops Voiceless p t k S kx U ax Voiced b d g gx Glottalized px tx q 4 qx : Fricatives Voiceless f s 0 sx h I Voiced v { z _ zx c Glottalized xx hx Africatives Voiceless c A Voiced j Ï Glottalized cx Nasals Voiced m % n H nx M l + Liquids Voiced r Glides w y consideration in this work and avoids all possible ambiguities for sentence parsing. In Festvox, the phone set of the language is described with the corresponding features like voicing, tongue position, tongue height, place of articulation, and manner of articulation. From the studies reported in [3], we derived a set of phonetic features for the 39 phones. The lists of the phone sets are mentioned in table 1 (consonant) and figure 1 (vowels). High Middle Low Front Mid Back = ii? ie Figure 1. Vowels with their features (mainly adopted from [3]) 3.2. Letter to Sound Rules ix e œ a * o < u The way Amharic orthographic characters are written is very similarly to the way they are spoken. It means Amharic is a phonetic language. The mapping of the written form and the spoken form is one to one except the epenthetic vowel which is mentioned above in transliteration scheme. The syllabification rule for Amharic mentioned below will decide the presence or absence of such vowel in the spoken form of the language. 3.3. Syllable Structure Amharic words are characterized by weak, indeterminate stress; presence of glottal, palatal and labialised consonants; frequent geminate consonants; high frequency of the central vowels, and use of an automatic helping vowel /ix/ [4]. Though strict definition of syllable is difficult [5], a word in Amharic could be monosyllabic like na (meaning come) or polysyllabic like al.me.ta.ciim (meaning she didn t come), which consists of four syllables. All syllables have a vowel nucleus. Several researchers studied the syllable structure of Amharic language and came up with different syllable template. For example, [3] states the six possible syllabic structures in Amharic as V, VC, VCC, CV, CVC, and CVCC and [4] states the syllable structure of Amharic as CV and CVC only. In this paper we use the following templates in the Amharic speech synthesis: V, VC, VCC, CV, CVC, and CVCC. Moreover rarely initial cluster could exist when the second consonant in the cluster is liquid (and form CCV and CCVC). Depending on the context the nucleus may be simple or complex. The syllabification of a given Amharic word into its syllable set needs: Compression of two successive vowels into one nucleus Insertion of epenthetic vowel /ix/ [4] has also pointed out the possibility of having consecutive vowels. If the back rounded vowel (/o/, /u/) appears at the same morpheme boundary, before the middle lower vowel /a/, then the preceding consonant gets labialised. For example samuat, ruac, huala, fuafuatie, quanqua, txuat, etc. In this case, both 2

vowel phonemes (/ua/) act as a nucleus of the corresponding syllable, which is compressed into one vowel. In all other cases when two successive vowels come together, the first vowel will be the nucleus of the left syllable and the second will be the nucleus of the next syllable for example se.at, me.at, be.hua.la, te.sxua.mi. 3.4. Syllabification Rules A recursive algorithm is used to identify the set of syllables in a word. This algorithm assumes inter independence of the left most syllable to the rest syllables. The algorithm to identify the left most syllable make use of the following basic rules of the language after compressing successive vowels into one based on the above compression rule. 1. Consonant between two vowels is always an onset for the second vowel 2. Word with VC, VCC, and CVC phone sequence are monosyllable. Other words that start with vowel take the left vowel as a left most syllable. 3. A word, which consists of only CC phonemes, will insert the epenthetic vowel and form a monosyllable word CVC. 4. If the left most part of the word match the template CVCCV then CVC is taken as a left syllable. 5. If the left most part of the word satisfies the template CVCC then the left syllables may be CVC or CV depending upon the sonority of the last consonant cluster. 6. A word with CCVC sequence, where the second consonant is a liquid is monosyllable. 7. All words with consonant clusters and liquid at second position have left most syllable of type CCV. 8. In all other cases, the left most syllable is CV syllable and covers a larger portion of the syllable distribution of the language. This may apply insertion of the epenthetic vowel if consonant cluster exist. We used simple stress pattern of 1 (primary stress) for initial syllable and 0 (secondary stress) for all of the remaining syllables in the word. 4. BUILDING THE VOICE 4.1. Creation of Speech Database To build Amharic speech database, the prompt-list is selected from different sources such as newspapers ( Addis Zemen, Reporter, Sixmixax xxdixq, etc. ), fictions ( Fikir Eske Mekabir, keadmas Bashager, etc ), and publications (different publication of Addis Ababa University (AAU), and other institute) all exist in hard copy form. Selection is done manually to have complete phone coverage of the language. The total number of phones instances in the training is 27,153 excluding silences. The most dominantly used phones in the spoken form of the corpus are /e/, /a/ and the epenthetic vowel /ix/. The training corpus consists of a total of 29,480 diphone instances made up of 801 unique diphones. The corpus covers 52.3% of the theoretically possible diphones in Amharic. Out of these unique diphones 14% occurs only once in the corpus. Moreover, the corpus consists of a total of 12,724 syllables instances and 1317 unique syllables. Out of these, the first hundred high frequency syllables cover 70% of the total distribution. Moreover among the 12,724 syllables instances: 316 are monosyllables, 3752 are front syllable (word initial), 4904 are middle syllable (word middle) and 3752 are back syllable (word final), which shows us the language has small number of monosyllabic words and most of the words consist of a minimum of 3 syllables. A male speaker using a normal microphone in a quiet room environment recorded the set of prompts. 183 sentences were recorded at 22050 Hz. A speech corpus of 40 minutes duration was generated. The 183 utterances were hand-labeled at phone level using EMU Labeler tool. 4.2. Feature Extraction and Clustering The labeled speech database was processed by applying simple power normalization on each utterance. The maximum and minimum pitch value of the speaker was determined using the KTH Wavesurfer Free Pitch Marker Tool. The Festvox pitch extraction parameters were adjusted accordingly to obtain pitch features for the utterances. Mel Frequency Cepstral Coefficients were also extracted. The Unit Selection Amharic voice was built by applying unit clustering algorithm on the units of the database. Further details of this algorithm can be found in [6]. 5. PERCEPTUAL EVALUATION OF AMHARIC VOICE To evaluate the quality of Amharic synthesizer, we conducted perceptua1 tests on 11 college students who are native speakers of the language: 2 females and 9 males. All subjects are 20 to 30 years old in age. Each subject listens to 5 sentences and gives a ranking value for the naturalness of the speech and its intelligibility. They evaluate the system based on the quality of the speech output by giving a measure of quality as follows: 3

Table 2: Perceptual Evaluation Categories Category Measure Excellent 5 Very Good 4 Good 3 Fair 2 Poor 1 Very poor 0 The results show that the average score of the Amharic synthesizer is 2.9 (which is categorized as good). The summary of the result is shown in table 3 and table 4. 6. CONCLUSION AND RECOMMENDATION In continuation with our efforts to build synthesizers and recognizers for new languages, in this paper, we discussed the development of unit selection voice for Amharic language. We defined a transliteration scheme to work with Amharic scripts and incorporated Amharic phone set, syllabification rules, letter to sound rules into Festvox. We selected the prompt-list from various sources and built a unit selection voice for Amharic. Perceptual evaluation of the synthesizer showed that the quality of the voice is good (as categorized in the above section). The following are the recommendations to further improve the quality of the Amharic synthesizer. The epenthetic vowel is used mainly to split impermissible consonant clusters. There is scope for further improving the algorithm we have used for handling the epenthetic vowel. The epenthetic vowel duration is usually much smaller than the same vowel that exists in the written form representation of the text. Modeling identification of the epenthetic vowel improves speech synthesis process as well as automatic syllabification of speech waveforms. Though the quality of the speech synthesizer is not high, it can be improved by: Proper selection of unit. Since the language is phonetic, syllable as a basic unit may outperform the phone as a basic unit. Optimal selection of corpus, which proportionally covers all basic units and variations, will give better quality. 11. REFERENCES [1]. Alan W. Black and Kevin A. Lenzo, Building Synthetic Voices - for FestVox 2.0 Edition, 2003 http://www.festvox.org/bsv/ [2]. Alan W. Black, Paul Taylor and Richard Caley, The Festival Speech Synthesis System -for The Festival Speech Synthesis System, Edition 1.4, 1999 http://www.speech.cs.cmu.edu/festival/ [3]. Getahun Amare. ²S ¾ T` cªc < uklm k^[w:: ( Modern Amharic Grammar in a simple approach ) 96 [4]. Mulugeta Seyoum, The syllable Structure and Syllablification in Amharic, Masters of philosophy in general linguistic thesis, Department of Linguistics, Trondheim, Norway, 2001 [5]. Andrew Radfors et.al. Linguistics: An Introduction, Cambridge University Press. 1999 [6]. Alan W. Black and Paul Taylor, Automatically clustering similar units for unit selection in speech synthesis, in proceedings of EUROSPEECH 97, page-601-604, 1997 Table 3. Result by sentence Rank Excellent Very Good Good Fair Poor Very poor Sentence 1 0 1 6 2 2 0 Sentence 2 0 3 2 5 1 0 Sentence 3 0 4 4 3 0 0 Sentence 4 2 3 4 2 0 0 Sentence 5 0 3 4 3 1 0 Table 4. Result by total average Rank Excellent Very good Good Fair Poor Very poor Number of 2 14 20 15 4 0 Sentence 3.6% 25.5% 36.4% 27.3% 25.4% 0% 4

APPENDIX A: ( I X NOTATION ) Amharic Phonetic List, IPA Equivalence and its ASCII Transliteration Table IPA Transcriptio n Amharic equivalence Consonants [p] [p] ý [t] [t] ƒ [k] [k] [?] [ax] [b] [b] w [d] [d] É [g] [g] Ó [p ] [px] å [t ] [tx] Ø [c ] [cx] ß [q] [q] p [f] [f] õ [s] [s] e [ ] [sx] i [h] [h] I [s ] [xx] ê [t ] [c] [g ] [j] Ï [m] [m] U [n] [n] [n ] [nx] [l] [l] M [r] [r] ` [j] [y] à [w] [w] < [v] [v] { [z] [z] [z ] [zx]» Vowels [E] [e] [U] [u] < [I] [ii] = [A] [a] Œ [e] [ie]? [ˆ] [ix] [o] [o] * 5