Vowel classification based approach for Telugu Text-to-Speech System using symbol concatenation

Similar documents
Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

A Hybrid Text-To-Speech system for Afrikaans

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Phonological Processing for Urdu Text to Speech System

Learning Methods in Multilingual Speech Recognition

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

Mandarin Lexical Tone Recognition: The Gating Paradigm

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

SARDNET: A Self-Organizing Feature Map for Sequences

Quarterly Progress and Status Report. Voiced-voiceless distinction in alaryngeal speech - acoustic and articula

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

SEGMENTAL FEATURES IN SPONTANEOUS AND READ-ALOUD FINNISH

Letter-based speech synthesis

The IRISA Text-To-Speech System for the Blizzard Challenge 2017

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

CLASSIFICATION OF PROGRAM Critical Elements Analysis 1. High Priority Items Phonemic Awareness Instruction

Speech Emotion Recognition Using Support Vector Machine

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Body-Conducted Speech Recognition and its Application to Speech Support System

Modern TTS systems. CS 294-5: Statistical Natural Language Processing. Types of Modern Synthesis. TTS Architecture. Text Normalization

Improving the Quality of MT Output using Novel Name Entity Translation Scheme

Speech Recognition at ICSI: Broadcast News and beyond

First Grade Curriculum Highlights: In alignment with the Common Core Standards

ADDIS ABABA UNIVERSITY SCHOOL OF GRADUATE STUDIES MODELING IMPROVED AMHARIC SYLLBIFICATION ALGORITHM

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

Rachel E. Baker, Ann R. Bradlow. Northwestern University, Evanston, IL, USA

Word Segmentation of Off-line Handwritten Documents

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Books Effective Literacy Y5-8 Learning Through Talk Y4-8 Switch onto Spelling Spelling Under Scrutiny

Investigation of Indian English Speech Recognition using CMU Sphinx

A comparison of spectral smoothing methods for segment concatenation based speech synthesis

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Expressive speech synthesis: a review

The ABCs of O-G. Materials Catalog. Skills Workbook. Lesson Plans for Teaching The Orton-Gillingham Approach in Reading and Spelling

On the Formation of Phoneme Categories in DNN Acoustic Models

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

THE RECOGNITION OF SPEECH BY MACHINE

Arabic Orthography vs. Arabic OCR

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Building Text Corpus for Unit Selection Synthesis

Proceedings of Meetings on Acoustics

Modeling function word errors in DNN-HMM based LVCSR systems

Demonstration of problems of lexical stress on the pronunciation Turkish English teachers and teacher trainees by computer

Unit Selection Synthesis Using Long Non-Uniform Units and Phonemic Identity Matching

Florida Reading Endorsement Alignment Matrix Competency 1

Phonological encoding in speech production

Parallel Evaluation in Stratal OT * Adam Baker University of Arizona

Consonants: articulation and transcription

PHONETIC DISTANCE BASED ACCENT CLASSIFIER TO IDENTIFY PRONUNCIATION VARIANTS AND OOV WORDS

Software Maintenance

DOWNSTEP IN SUPYIRE* Robert Carlson Societe Internationale de Linguistique, Mali

Perceived speech rate: the effects of. articulation rate and speaking style in spontaneous speech. Jacques Koreman. Saarland University

/$ IEEE

Listening and Speaking Skills of English Language of Adolescents of Government and Private Schools

A study of speaker adaptation for DNN-based speech synthesis

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

SIE: Speech Enabled Interface for E-Learning

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

The Perception of Nasalized Vowels in American English: An Investigation of On-line Use of Vowel Nasalization in Lexical Access

Learning Methods for Fuzzy Systems

Modeling function word errors in DNN-HMM based LVCSR systems

Effect of Word Complexity on L2 Vocabulary Learning

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

THE MULTIVOC TEXT-TO-SPEECH SYSTEM

Segregation of Unvoiced Speech from Nonspeech Interference

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

The analysis starts with the phonetic vowel and consonant charts based on the dataset:

Houghton Mifflin Reading Correlation to the Common Core Standards for English Language Arts (Grade1)

Journal of Phonetics

A Case-Based Approach To Imitation Learning in Robotic Agents

Experiments with Cross-lingual Systems for Synthesis of Code-Mixed Text

Improved Hindi Broadcast ASR by Adapting the Language Model and Pronunciation Model Using A Priori Syntactic and Morphophonemic Knowledge

**Note: this is slightly different from the original (mainly in format). I would be happy to send you a hard copy.**

Lecture 1: Basic Concepts of Machine Learning

Problems of the Arabic OCR: New Attitudes

Rule Learning With Negation: Issues Regarding Effectiveness

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Vowel Alternations and Predictable Spelling Changes

Taught Throughout the Year Foundational Skills Reading Writing Language RF.1.2 Demonstrate understanding of spoken words,

Test Blueprint. Grade 3 Reading English Standards of Learning

Universal contrastive analysis as a learning principle in CAPT

Automatic segmentation of continuous speech using minimum phase group delay functions

A Neural Network GUI Tested on Text-To-Phoneme Mapping

English for Life. B e g i n n e r. Lessons 1 4 Checklist Getting Started. Student s Book 3 Date. Workbook. MultiROM. Test 1 4

LEARNING A SEMANTIC PARSER FROM SPOKEN UTTERANCES. Judith Gaspers and Philipp Cimiano

Speech Recognition by Indexing and Sequencing

1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all

Dyslexia/dyslexic, 3, 9, 24, 97, 187, 189, 206, 217, , , 367, , , 397,

Learners Use Word-Level Statistics in Phonetic Category Acquisition

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Voiceless Stop Consonant Modelling and Synthesis Framework Based on MISO Dynamic System

Weave the Critical Literacy Strands and Build Student Confidence to Read! Part 2

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Transcription:

13 Vowel classification based approach for Telugu Text-to-Speech System using symbol concatenation Pamela Chaudhur 1, K Vinod Kumar Department of CSE, ITER SOA University Bhubaneswar, India Email: pamela.chaudhury@gmail.com 1, kvinod@gmail.com Abstract Telugu is one of the oldest languages in India. This paper describes the development of Telugu Text-to-Speech System (TTS) using vowel classification. Vowels are most important class of sound in most Indian languages. The duration of vowel is longer than consonants and is most significant. Here vowels are categorized as starting middle and end according to the position of occurrence in a word. The algorithm developed by us involves analysis of a sentence in terms of words and then symbols involving combination of pure consonants and vowels. Wave files are being merged as per the requirement to generate the modified consonants influenced by deergalu (vowel sign) and yuktaksharas generate the speech from a text. Speech unit database consisting of vowels (starting, middle and end) and consonants is developed. We evaluated our TTS using Mean Opinion Score (MOS) for intelligibility and voice quality with and without using vowel classification from sixty five listeners, and got better results with vowel classification. Key words: Indian Standard Code for Information Interchange (IISCI); letter to phoneme mapping; Vowel classification; unit sele ction; symbol concatenation. INTRODUCTION Text processing and speech generation are two main components of a text-to-speech system. The objective of the text processing component is to process the given input text and produce appropriate sequence of phonemic units. These phonemic units are realized by the speech generation component either by synthesis from parameters or by selection of a unit from a large speech [1]. For language such as English a pronunciation dictionary of about 15 words is used along with a letter to sound rules to handle unseen words. Indian languages are phonetic in nature [], therefore the letter to sound rule is relatively easy. For Telugu there is good correspondence between written text and spoken language. However for some Indian languages such as Hindi, Oriya and Bengali the rules for mapping the letter to phoneme are not so straightforward. Developing Telugu TTS is easier than other Indian languages like Hindi, Oriya or Bengali because Telugu doesn t require Schwa deletion [3]. The inherent a associated with a consonant is suppressed depending upon the context in which it is used (like kamala is pronounced as kamal in Hindi). The vowel "a" which inherently occurs in all consonants is called schwa. LANGUAGE PROCESSING UNIT The objective of the language processing unit or the text processing unit is to process the given input text and produce appropriate sequence of phonemic units. These phonemic units are realized by the speech generation component by selection of a unit from a large speech corpus. For natural sounding speech synthesis, it is essential that the language processing unit produce an appropriate sequence of phonemic units corresponding to an arbitrary input text. A. Text-to-phoneme conversion Generation of sequence of phonetic units for a given standard word is referred to as letter to phoneme rule or text to phoneme rule. The complexity of these rules and their derivation depends upon the nature of the language. In our Telugu TTS the input is Telugu text in Indian Standard Code for Information Interchange (ISCII). This may be typed in through an ISCII key board or input from a pre stored ISCII file. As an example, the sequence of ISCII codes: 199 1 19 3 3 15 7 19 17 1 1 9 1 Corresponding to the input text - ( Manamanta bharateeyulam) (We all are Indians) is parsed into the following sequence of basic units: Ma, na, ma, n, ta, blank, bha, ra, tee, yu, la, m. The basic units of writing system in Indian languages are Aksharas. They are the orthographic representation of speech sounds. For Telugu also we define character set into vowels and consonants. Vowels play a major role in pronunciation of any word. All together we have 1 vowels and 3 consonants. The vowel signs are known as deergalu. Vowels are most

1 interesting class of sound in any language. Their duration in any word is also of most significance. Indian Languages originated from Sanskrit. The synthesis system of vowels is narrated in Vedas. They play a major role in the pronunciation of any word. Each of the vowels are classified as starting, middle and end according to the duration of occurrence in a word. /a/ /aa/ /i/ /ee/ /u/ /oo/ /e/ /ae/ /ai/ /o/ /oa/ Telugu consists of consonants from /k/ to /h/.vowels are always voiced sounds and they are produced with the vocal cords in vibration, while consonants may be either voiced or unvoiced. Vowels have considerably higher amplitude than consonants and they are also more stable and easier to analyze and describe acoustically. Because consonants involve very rapid changes they are more difficult to synthesize properly. SIGNAL PROCESSING UNIT Given the sequence of phones the objective of the signal processing unit is to synthesize the acoustic waveform. While the articulatory model suffers from adequate modeling of motions of the articulators, the parametric models require a large number of rules to manifest co articulation and prosody. An alternative solution was to concatenate pre recorded speech segments []. For the development of the text-to-speech synthesizer for Telugu we have used concatenation of pre recorded speech units. Connecting prerecorded natural utterances is one of the easiest ways to produce intelligible and natural sounding synthetic speech. However, concatenative synthesizers are usually limited to one speaker and one voice and usually require more memory capacity than other methods. Current state of art speech synthesizers generate natural sounding speech by using an inventory of large number of speech units. Storage of large number of units and their retrieval in real time is feasible due to cheap memory and computational power. A. Concatenation Technique One of the most important aspects in concatenative synthesis is to find correct speech unit length. The selection is usually a trade-off between longer and shorter units. With longer units high naturalness, less concatenation points and good control of co articulation are achieved, but the amount of required units increases, there by speech unit database increases. With shorter units, less memory is needed, but the sample collecting and labeling procedures become more difficult and complex. With shorter units naturalness reduces, concatenation points increases, and coarticulation is not achieved. In present TTS systems units used are usually words, syllables, phonemes, diphones, and sometimes even triphones. The approach of using an inventory of speech units is referred to as unit selection approach. We have selected the basic unit for speech as phones[] (phonemes, diphones and triphones) for the Telugu text-to-speech synthesizer. The signal processing unit concatenates at the symbol level. A symbol can be C, V, CV, CCV, VC, VCC, CVV where C is a consonant and V is a vowel. If the symbol is a C or V then a phoneme is concatenated. If C is associated with a single vowel sign then a diphone is selected for concatenation. If the C is associated with a two vowel signs then a triphone is selected for concatenation. In symbol based concatenation conjuncts can be easily uttered and understood by the listeners. Wave files are being merged as per the requirement to generate the modified consonants influenced by deergaalu (vowel signs), and yuktaksharas generate the speech from a text. Pronunciation of tetraphones and pentaphones is yet not possible through synthesizer. Also vowels are categorized into three types depending upon their duration in the word. The vowels classifications are starting, middle and end. During concatenation stage particular vowels are selected from the speech unit database at run time. This results in a better speech quality than using uncategorized vowels. B. Generation of Speech Unit Database The speech unit database is of primary importance for a good text to speech system. The speech database primarily depends upon the selection of the utterances which has the coverage of all possible units and the recording of the utterances by a good voice talent. Selection of the utterances is linked with the choice of the unit size. The larger the size of the unit the larger would be the number of utterances for the coverage of the units. The natural speech must be recorded so that all used units (phonemes) within all possible contexts (allophones) are included. After this, the units must be labeled

15 or segmented from spoken speech data, and finally, the most appropriate units must be chosen. Gathering the samples from natural speech is very time-consuming. Acoustic quality is maintained by recording voice in noise free environment. The recorded data is then digitized using the Wavsurfer software. The present system employs 1-bit data while the Sampling Rate is taken as.5 KHz. (Approximately KHz sampling rate is good enough for maintaining voice quality). Recording of each phoneme is done at a sampling rate of.5 khz. The phoneme database consists of all the consonants and vowels. I. VOWEL CLASSIFICATION Fig : Waveform of vowel ee at the middle of a word like in the word kavita Vowels are most important class of sound in most Indian languages. Vowels are longer in duration than consonant sounds. Our hypothesis is if vowel sounds can be synthesized perfectly by machine then sound quality achieved would be better [5]. So we have categorized Vowels as starting, middle & end according to the position of occurrence in a word [7]. As the vowels are dominant in the utterance, they are stored for different durations as they occur in the word. Each of vowels are recorded and segmented into starting, middle and end parts.for the Purpose we record each vowel at.5 khz and then segment into three parts. Segmentation is done in such a way that each segment not only represents the vowel but also defines if the vowel belongs to starting, middle or end of a word. There is a distinctive difference between the same vowel when it occurs in the starting, middle and end of a word. So concatenating the right segment of the vowel definitely improves the quality of speech. Fig 3: Waveform of vowel ee at the end of a word like in the word kavi So the speech unit database consists of three sets of each vowel and three sets of each vowel sign. Each vowel and vowel sign has a specific naming convention. The naming convention helps to retrieve the exact segment of the vowel at run time and there by resulting in a perfect concatenation. II. EVALUATION Fig 1: Waveform of vowel ee at start of a word like in the word Vinod The TTS developed has a good voice quality. We evaluated the TTS using Mean Opinion Score.We evaluated our TTS for intelligibility and voice quality with and without using vowel classification from sixty five listeners. The first two charts show the performance without the vowel classification and the last two charts with the vowel classification.

1 1 1 Intelligibility Intelligibili ty L9 L17 L5 L33 L1 L9 L5 L L15 L L9 L3 L3 L5 L Fig : MOS of TTS intelligibility without vowel classification Fig : MOS of TTS intelligibility with vowel classification 1 1 voice quality voice quality L L15 L L9 L3 L3 L5 L L9 L17 L5 L33 L1 L9 L5 Fig 5: MOS of TTS voice quality without vowel classification Fig 7: MOS of TTS voice quality with vowel classification The MOS for intelligibility is.7 and for voice quality is.55 on point scale without vowel classification. MOS result for intelligibility is. and for voice quality is. on point scale with vowel classification.

17 III. CONCLUSION AND FUTURE WORK REFERENCES The Text- to-speech synthesizer for Telugu has been successfully designed and developed using concatenation of prerecorded speech units. Classification of vowels based on position is relatively simple technique and it improves the quality of speech to a great extent. Symbol based concatenation is better than concatenation techniques like diphone based concatenation because the speech unit database is smaller thereby reducing the memory requirements. Also the conjuncts can be better uttered. Vowel classification aims at synthesizing vowels in a near to perfect way depending upon their position in a word. The beauty of the TTS is that with very minor changes in the software the Text to Speech system can be developed for most of the Indian languages. Generated speech is without prosody. Adding prosody is the subject of our future work. [1] [] [3] [] Anand Arokia Raj, Tanuja Sarkar, Satish Chandra Pammi, Santhosh Yuvaraj, Mohit Bansal, Kishore Prahallad, Alan W Black Text Processing for Text-to-Speech Systems in Indian Languages, th ISCA Workshop on Speech Synthesis, Bonn, Germany, August -, 7. Anil Kumar Singh A Computational Phonetic Model for Indian Language Scripts online proceedings of Constraints on Spelling Changes: Fifth International Workshop on Writing Systems. Nijmegen, The Netherlands, October,. Monojit Choudhury, Rule Based Grapheme to Phoneme Mapping for Hindi Speech Synthesis 9th Indian Science Congress of ISCA, (Abstract published), Bangalore, India, 3. Anupam Basu, Debasish Sen, Shiraj Sen and Soumen Chakraborty An Indian Language Speech Synthesizer Techniques and Applications National Systems Conference, Indian Institute of Technology, Kharagpur, december 17-19, 3 [5] Susan Choge, M.Phil, Understanding Kiswahili Vowels. The Journal of Pan African Studies, vol., no., March 9 [] Hunt A.J. and Black A.W., Unit selection in a concatenative speech synthesis system for a large speech database, in Proceedings of IEEE Int. Conf. Acoust., Speech, and Signal Processing,, pp. 373 37,199. [7] Carlson, R., & Nord, L. Vowel dynamics in a text-to-speech system - some considerations. In Proceedings Eurospeech '93 (pp. 1911-191). Berlin, 1993.