Regional Winner paper in CSI-YITPA(E) 2002 Bengali text-to-speech synthesis system, a novel approach for crossing literacy barrier Shyamal Kr. DasMandal & Barnali Pal Electronics Research & Development Center of India, Calcutta A Scientific Society of the Ministry of Communications and Information Technology, Government of India Plot E-2/1, Block GP, sector-v Bidhannagar, kolkata 700 091, India Abstract: In this age of information technology, information exchange methodologies, which overcome the barrier of human limitations, have gained importance. Since speech is a primary mode of communication among human beings, it is natural for people to expect to be able to carry out spoken dialogue with computers. This involves the integration of speech input/output technologies and language technologies. Speech synthesis is the automatic generation of artificial speech signal by the computer. In the last few years, this technology has been widely available for several languages for different platform ranging from personal computer to stand alone systems. If the vocabulary is very limited, very natural speech is possible by merely concatenating stored speech units. Most of voice response systems, such as paying bill through telephone, apply this method, which is much simpler than a real speech synthesizer. A true text-to-speech (TTS) system should be able to accept any input text in the chosen language including new words and typographical errors. This paper reports successful development of Text-to-Speech synthesis system of Bengali language at ER&DCI, Calcutta. The various steps involved and problems encountered in development of such solutions are highlighted. The paper concludes with the description and demonstration of reading newspaper on line from website, which is one of the typical application of this technology. This system can help to overcome the literacy barrier of common mass, can also empower the visually impaired population and increase the possibilities of improved man-machine interaction through on-line newspaper reading from Internet.
1 Introduction: Voice technology applications have created a growing demand for multi-lingual, multi-voice, multi-style speech synthesis system and first of all for a natural sounding voice, close to the quality of prerecorded speech. An unlimited continuous speech synthesizer is the one that can convert any given text into continuous speech within the realm of a language and is not restricted by vocabulary of syntax. There are many techniques available for speech synthesis like Formant synthesis, Concatinative synthesis, Articulacy synthesis etc. In case of our application we are using the concatinative approach. Concatenative model uses different length of prerecorded samples derived from natural speech, is probably the easiest way to produce intelligible and natural sounding synthetic speech. One of the most important aspects in concatenative synthesis is to find correct unit length of speech components. The selection is usually trade-off between longer and shorter units. With longer units high naturalness, less concatenation points and good control of co-articulation are achieved, but the amount of required units and memory is increased. With shorter units, less memory is needed, but the sample collecting and labeling procedures becomes more difficult and complex. In concatenative synthesis the speech units are usually words, syllables, demi-syllable, phonemes, and sometimes-even tri-phones. In the present system partneme are mainly used as units. The advantage of using partneme as the basic unit over all other is the simplicity of introducing intonation and prosodic rules into the synthesized speech signals. Using the above technology developed TTS system deliver a good quality speech out which can be deployed in many kind of application like online news paper reading from internet, overcome literacy barrier, empowering visual impaired population, enhancing other information system. 2 Bengali Text-to-Speech Synthesis system: Fig-1 gives a schematic block diagram of a Bengali TTS. The system consists of two main block a) Text analyzer b) Synthesizer. Text Analyzer Prosodic & Intonation information Bengali Text Text Analyzer Phonological Rules and Exceptional word list Phoneme String with Prosody & Intonation Parameter Phonetic Synthesizer Synthesizer Partneme Signal Dictionary Segmentation of the Speech signal Synthesized speech output
Fig.1 2.1 Text Analyzer The input text is essentially a string of characters, might be data from a word processor, standard ASCII or ISCII from e-mail, online newspaper text or a scanned text from newspapers. The first task faced by the text analyzer is the conversion of input text into linguistic representation i.e. grapheme-to-phoneme conversion. This conversion is highly language depended which required some language dependent phonological, prosodic intonation rules. Text containing digit & numerals are converted into full words based on number system rule. 2.2 Synthesizer In concatenative synthesis system speech is produced by taking the phoneme string and information for intonation and prosody as input. In this approach the quality of the output of synthesizer mainly depend on the quality of the information that contains in the basic building block i.e the partneme dictionary, which contains the part of phoneme as basic sound units. Partnemes includes vowel, consonant, consonant-vowel transition, vowel-consonant transitions and vowel-vowel transitions. The techniques, which have been implemented here is called Epoch Synchronous Non- Overlap Add method (ESNOLA) for concatenating these basic sound units to produce synthetic speech. While generating partnemes two major aspects are seen which are (i) Pitch of each units should remain same, ideally and (ii) Amplitude normalization is to be done depending on nature of vowel part of the signal To satisfy the first criteria, pitch detection and necessary modification is done. Intonation, stress, rhythm, duration etc. are called Prosodic or suprasegmental elements required for introducing naturalness into the synthesized speech. They are in turn related with fundamental frequency, segmental duration and also on complexity variation. To make computer speaking like human while reading a text, and it is necessary to make the computer understand the intended meaning, emotional and physical state of the speaker using some form of the artificial intelligence duration and specification of fundamental frequency contours we can introduce prosody in synthetic speech. 3 Applications of Bengali TTS system The developed speech synthesizer finds many applications. Some of are described as below: (i) Reading News paper from Internet. (ii) Conveying information to people over telephone or over local broadcasting system (village center). The information may be for arrival or Departure of train/plane, share or commodity prices, weather, name and address against any telephone no. etc. (iii) Children fable reading. (iv) Helping barely literate people or people not conversant with English, to receive information from computer. (v) Empowering visually impaired population. (vi) Reading e-mail over tele-phone/cell-phone.
3.1 Reading News paper from Internet Newspaper is a very important media to gather information of recent happening. If newspaper can be read out by means of a machine then that will help this objective very much. Our system is currently providing this facility in a very intelligible manner. It provides the following features for reading on-line Bengali newspaper from Internet. a) Headline reading by which one can listen only the headlines of the newspaper. b) Block wise reading of the news paper in the following fashion i) First block ii) Next block iii) Previous block. c) Stop and Resume of the reading Using hyperlink reader can also go to the relevant news details and can be easily readout by the synthesizer. The program first decodes the text information from the HTML source code of the newspaper from website, then it uses our Bengali text to speech synthesis system to readout the corresponding text information of the selected portion of the web page. The schematic block diagram of the integrated system for reading on-line newspaper from website is shown below. Newspaper site Down loading the news paper in client site Extracting text information from HTML source code Decoding the text information Bengali TTS system Synthesized output Fig.2 Block Diagram of Integrated system for on- line news paper reading 4 Conclusion The output of Bengali synthesizer developed by ER&DCI, Calcutta is fairly clear with good degree of phonetic naturalness. The integrated system for reading on line AnandaBazar Patrika will enable people to get information through Internet conveniently and efficiently. This will attract more people to come forward to take the benefit of IT enabled services.
This system will also become helpful for the physically impaired population, who cannot use speech as their primary means of communication. With some modifications this system may be used for other languages. Acknowledgements We thank to Mr. C. N. Ajit and Mr. A. Bandyopadhyay, ER&DCI, Calcutta for their valuable suggestions to prepare the document. References [1] A Bandyopadhyay, Some Important Aspects of Bengali Speech Synthesis System, The Indo-European Conference on Multilingual Communication Technologies (IEMCT), June 2002,Tata McGrow-Hill,pp. 95-100. [2] Dutoit, T., Introduction to text-to-speech synthesis, Kluwer academic Publishers,1997, Netherlands [3] Datta. A.K, Ganguly N R, Mukherjee B. Intonation in segment-concatenated-speech. Proc. ESCA Workshop on speech synthesis, Sep 1990, France, pp. 153-156.