Speech Synthesis: An Alternative Approach to a Different Problem

Speech Synthesis: An Alternative Approach to a Different Problem Hans Kull, Member, IEEE Abstract Current speech synthesis applications and tools are built to generate speech from text automatically, without need for human intervention Whilst speaking to people in the multimedia and games industry, I became aware of a demand for speech synthesis applications where users are able to manipulate the speech generated in various manners This paper describes the design of the user interface and the current prototype that was developed to address these issues The prototype shows how the user can set and change voices and manipulate text, pronunciation and stress, prosody and volume of the speech generated Furthermore it shows how the user can modify segments of the speech to produce additional effects like echo or telephone The software architecture which shows how this functionality was implemented is presented too The last chapter describes the speech engine under development for the definitive program version and it also presents ideas on how to express emotions like happiness, fear or anger Index Terms Speech Synthesis, Text-to-Speech, Speech Processing, Acoustic Signal Processing I INTRODUCTION IMAGINE you intend to produce a radio feature For every role in your play you need to find an actor with the desired voice You need to arrange a production time that suits all actors, you need to find a sound studio and a technician to operate it, and of course this date has to fit into your own schedule too, so you can direct the play This sixty-year-old approach has become very expensive That is the reason why so few new radio features are produced nowadays The multimedia, games and other industries face similar problems when it comes to include speech into their products This has lead to a culture of avoiding or at least minimising the need of speech in multimedia applications The games Myst and Riven (both trademarks of Cyan Inc) of Brøderbund Software, Inc are famous examples for this approach Imagine you could do that all on your computer All you need to have is a software package in which you can set up your actors (voices), import the text, assign the text passages Fig 1 Speech synthesiser user interface Hans Kull is with Informatic Technologies Pty Ltd, Geelong, Vic, Australia (telephone +61 3 5222 1030, e-mail: kull@inmaticcom) ISBN 1-86467-114-9 to the voices and start directing the play Figure 1 illustrates what is meant by directing the play The user is presented with an editor that allows him to manipulate text, phonetics, prosody and volume He can press the play button and check the generated speech If he is not totally satisfied, he can modify whatever he needs to change, play it again and so forth, until he has worked himself through

the entire text, in a similar manner as he previously did with his actors There are more possibilities for manipulating the output of the program For example effects like phone or echo can be added, and in subsequent sections ways in which the user can specify emotions like happiness, anger and fear will be described However, before going into further detail an overview of the components of the application is provided II PROGRAM COMPONENTS A Overview The concept behind the application is described in [1] The most important parts of our speech synthesiser program are the speech editor and the speech synthesiser If the user chooses to play a part of the text, then the speech editor passes speaker parameters, prosody and pronunciation information to the speech synthesiser This in turn generates the speech signal and directs it to the audio device of the computer or into a file, depending on the user s output options For the editor to work properly there are helper modules needed The user is supported with as much default functionality as possible, so he need not deal with trivialities and can concentrate on more important tasks Dictionary Text Import Speaker Editor Speech Engine Fig 2 Overall Application Architecture Prosody Generation Effects Mixer B The Dictionary Tool The dictionary tool allows the user to import dictionaries, create new dictionaries from scratch or by modifying an existing one, and he can change pronunciation and basic prosody for every word in the dictionary For words with more than one pronunciation like read or lead he can define additional pronunciation and prosody pairs and he can store hints on the use of a particular pair depending on the given situation The most common uses of the dictionary tool is to add new words to a given dictionary and to create a new dictionary from an existing one To define effects like accents or dialects, the user can copy a dictionary and then modify the pronunciation of all the words in it by entering modification rules As an overly simplified example, he could replace the pronunciation of th in English (ie θ ) by the pronunciation for z (s) to generate a German accent The user can modify pronunciation and prosody of a single word in the editor directly and he can always store his modification in the dictionary or keep it local to this particular instance in the text An important function of the dictionary tool is to provide default pronunciation and prosody for every word entered in the editor If the dictionary finds more than one pronunciation of a word, then it has to decide on which one to choose On the other hand, if it does not find a given word, then it has to generate pronunciation and prosody based on rules which are language-dependent Therefore, every dictionary stores its base language and its modifiers C The Speaker Tool This tool is used to define and modify speakers A speaker is defined by voice parameters (see chapter III) A dictionary is also assigned to the speaker To define a new speaker, the user defines a voice, usually by selecting a standard voice (child, young female, elderly female, young male, and elderly male) and then modifying its parameters The user can just modify the basic parameters like pitch, speed and volume, or he can go into the extended dialog and modify all parameters he wants to To produce good feedback, the user can play a standard text after every modification he made Extended mode is helpful as well to create unnatural voices for example comic figures, robots, computers or aliens A further extension of the speaker tool could be a tool, which allows the creation of new voices from natural voices The speaker would have to talk for a certain amount of time into a microphone Depending on the type of speech synthesiser used, the recorded speech would then be analysed, the speaker parameters extracted and the results used to create a new speaker D Prosody Generation At the end of every sentence entered into the editor, the basic prosody of the sentence (given by the basic prosody of the words) is modified to produce the default prosody of the sentence To do this standard techniques are used as described in the literature eg in [2] and [3] E Text Import The text import tool is not only to import plain texts, it helps as well to import texts that contain speaker information already, like for example a play In this case one can specify, that a predefined name given at the beginning of a paragraph should translate into a speaker object with the same name As another example the user can specify that all text of a given font should translate into headings Headings are parts in the text that are not passed to the speech engine and therefore remain silent However, if it is necessary you can ask the editor to pass the heading information to the synthesiser too A predefined speaker, the narrator, then speaks the headings F The Editor Only headings do not have a speaker assigned in the editor To all other text a speaker is assigned first As soon as a particular word knows its speaker, it knows the dictionary it

belongs to The dictionary then delivers the phonetic and prosodic information needed to complete the word s information At the end of every sentence the prosody generator is called This device modifies the basic prosody provided by the dictionary for every word After all assignments of speakers to text, the speech synthesiser can be provided with enough information to produce good quality speech But this is not good enough for the intended user of our application He wants to make his mark on the spoken text and let the speaker give much more expression to parts of the spoken text than what can be generated automatically The user can edit not only the text, but the phonetics and the prosody too For words with more than one pronunciation, the user can look them up and simply select from them If he wants to modify the phonetics of a word, he can decide whether he wants this pronunciation stored as a replacement of the existing one in the dictionary or as an additional one If the word is not stored in the dictionary and its phonetics and prosody was generated by rule, then he can store it in the dictionary too The user can change the prosody as well He can change pitch and duration of every note within a given range Furthermore, the user can change the volume In our prototype volume is visualised by the font size Italics stand for whispering Prosody and volume control gives the user the possibility to address problems that come from different meanings of a statement like: I want this error to be fixed today! which has a completely different meaning depending on whether you stress for example this or today Emotions are specified in a similar way In our prototype they are visualised by a coloured background of the text Blue stands for happiness, green for jealousy, red for anger and yellow for fear G Effects Additional effects, like background noise, talk or music, echo or a filter to emulate a telephone line can be defined too Although some of these effects could be added at a later stage, ie after generation of the speech signal, this functionality is provided in the editor, to enable proper synchronisation of speech and effects H Mixer The mixer is a post-processing stage to the speech synthesiser Effects like echo or filters are post-processing stages to the speech signal, additional sources like background noise are added and in the mixer tool it is possible to adjust their volume Effects and the mixer are not implemented in the prototype It would be possible to use a standard tool readily available on the market to do their job This functionality will be provided to make sure everything is properly synchronised For Internet use a stand alone speech synthesiser is planned This program will contain the mixer too, although with no user interface provided The intent of this program is to provide the client with an application, which turns the data stream passed to the speech synthesiser into audio output on the client s computer Therefore additional effects and sound information like background noise and it s volume information will be passed on to this stand alone synthesiser as well as the speech information III THE SPEECH SYNTHESISER A Current Technologies Currently there are two principal technologies used to generate automated speech, speech concatenation and formant synthesis Speech synthesis based on concatenation uses recorded pieces of real speech In Text-to-Speech applications, these recorded pieces are short utterances, usually containing combinations of two phonemes Simply put, the synthesiser then for example concatenates the utterances for /ha/ and /at/ to create the utterance for /hat/ Arguably the best synthesiser based on this technology is AT&T s new Text-to-Speech synthesiser, see wwwnaturalvoicescom Formant synthesis on the other hand uses a mathematical model of the human vocal tract to create speech One of the well-known models is the Klatt-Synthesiser [4] on which for example the DECtalk speech synthesiser is based [5] However, there is no longer a clear distinction between the two technologies, as we will see later B Comparison 1) Basic Functionality At first glance both technologies seem to be suitable for our purpose Both have the same pre-processing stages, consisting of text parsing, letter-sound translation and prosody generation In the application, these parts are done in the editor to give the user maximum control over the speech generated Existing speech synthesis software packages do these processing steps themselves automatically and existing development kits like the Microsoft Speech SDK or the AT&T Speech SDK give little control over this process This means that a specialist speech synthesis has to be developed too, but the fact that these pre-processing steps are common to both technologies indicates that both technologies can be used for the final synthesis steps 2) Voice Creation One important property a speech synthesiser has to provide in our application is the creation and modification of voices In a speech concatenation synthesiser, recorded speech is used to extract the utterances needed For AT&T s speech synthesiser it takes approximately 40 hours of recorded speech to reproduce a specific voice [6] Although they hope this could be reduced to a few hours, this still requires too much effort On the other hand, for our purpose it would be good enough in most cases, if the new voice in some way resembles the original one, without any need for the listener to identify the original speaker In a formant synthesiser voices are stored as parameter sets New voices can be created by modifying an existing parameter set This sounds easier than it is, because a voice is described

by many parameters that are not completely independent Therefore, modifying just one parameter can lead to a very unnatural sounding voice To create a completely new voice from recorded speech is perhaps achieved by the use of a modified version of a speaker identification algorithm as described in [7], which could be used to provide the parameters needed 3) Emotions Although our prototype currently allows the user to enter emotions, there is no functionality implemented yet in the synthesiser to process this information Part of the necessary modifications could be made in the preprocessing stage Sadness or depression for example is expressed by a monotone low voice This is easily achieved by simply modifying prosody and volume For example sadness or depression is characterised by the speaker letting his head hang down and almost speaking to himself, as opposed to standing straight and talking with a smile when he is happy The positions of head and chest or the smile, all change the speaker parameters So if these changes are known, it is possible to modify the speaker parameters in formant synthesis On the other hand, with a concatenation synthesiser (as used in the prototype) it could prove to be very difficult to generate the desired result However, expression of emotions with the voice alone has its limitations I believe it will work fine as long as the emotion expressed is supported by the spoken text It will most likely still work if you want to tell a joke with a sad voice However, to produce a paradoxical message, eg a sad text spoken with a happy voice has its limitations, because in life such messages depend not only on the auditory information but on the body language as well C Hybrid Models In a hybrid model, rather than storing phonemes or other parts of speech as a signal, the parameters of these signals are stored These parameters are usually extracted from real spoken text Then, instead of concatenating the speech signals the signal is generate for every unit by means of its parameters (see Fig 3) and rules are used to modify the parameters in the transitional steps In an additional processing step after the parameter extraction, the parameters are normalised Since the use of a hybrid model is intended, this normalisation process could prove crucial, since only a good standardisation will allow the modification of the parameters with the speaker- the prosodyand other information D Implementation with Sinusoidal Synthesis As described in [7], in some respect sinusoidal synthesis has similarities to the filter bank representation used in the vocoder However, since the use of discrete Fourier transform (DFT) renders a highly adaptive filter bank, I prefer to use the basic idea of this method, with adaptations to our needs Fig 3 describes the basic architecture of the synthesiser we intend to implement Parameters are generated from phonetic and prosody information, the speaker model and from modifiers which for example express emotions for every frame This step is described in more detail in the next chapter Phonetics, Prosody Noise Amplit Phase 1 Frequ 1 Amplit 1 Phase n Frequ n Amplit n Parameter Generation Speaker Frame-toframe interpol interpolation Phase unwrap and interpol interpolation Phase unwrap and interpol interpolation Modifiers Fig 3 Speech Synthesiser Architecture Noise Gen Sine Gen Sine Gen Speech Output The parameters then are used to generate a set of n phase, frequency and amplitude triples For vowels the duration information obtained from the melody line is used to determine the length of the frame For every frame and parameter triple, the information is unwrapped and interpolated A sine wave generator gets the frequency and phase and generates the sine wave which is amplified by the amplitude amount Every triple thus defines a signal which then is summed up to produce the synthetic speech output signal To avoid discontinuities at the boundaries of the frames some provisions must be made to smoothly interpolate the parameters from one frame to the next For this purpose, sine wave tracks for frequencies are established, where every frequency of the current frame is attached to the closest match of the previous one If it is not possible to establish a good match, then a track may die and later a new one is born How these tracks are constructed is described eg in [7], but there are other ways of interpolation, see for example [8] or [9] A more difficult problem then is the matching of the phases Although computational expensive, it is intended to use cubic phase interpolation since this method produces the best results These transition procedures are described in detail in [10] E Parameter Generation 1) The Normalisation Procedure To understand the concept of parameter generation, I have to explain first how I intend to normalise the frame parameters

for every frame As mentioned earlier, normalisation is crucial to our synthesiser Fig 4 gives an example of the speech parameters extracted for a particular frame The extraction process is described in detail eg in [7] For every peak value a parameter triple is generated Frequency Fig 4 Spectral Magnitude of a Speech Element For this purpose normalisation has to be so, that the frames belonging to a speech element store the parameters in a form which is independent of the speaker To do this, it is intended to record the speech elements from a natural speaker and then to normalise them by means of the speaker parameters Furthermore, from the speaker parameters a standardisation window will be created, see Fig 5 F1 Fig 5 Standardisation Window F2 Frequency For every frequency the amplitude evaluated after recording is then divided by the respective amplitude of the standardisation window Then the frequencies are shifted so that the first formant F1 is at 100Hz It is obvious, that for the same speaker one will get the same frequencies and amplitudes back by shifting the frequencies F3 back so that the 100Hz frequency goes to the first formant, and afterwards by multiplying for every frequency the amplitudes by the respective amplitude of the standardisation window 2) Parameter Generation As we have seen, the input to parameter generation is the standardised frame parameters, obtained via the phonetics from the frames table, the speaker model, the prosody and the modifiers In a first step, for every frame the speaker parameters are modified To match the frequency given by the prosody, the first formant is shifted to the frequency given by the prosody The adaptation for emotions will be more complex, but will essentially be the modification of all the formants in frequency and amplitude, except for the first one, where the prosody takes precedence Using the modified speaker parameters the modifier window is built the same way as the standardisation window for normalisation was built Then the frequencies are shifted and the amplitudes multiplied to get the final parameters There is no need to use random noise for standard speech synthesis, see [7], but other authors for example [11] apply such a model, although they admit that it is not valid from a speech production point of view I am still considering which model to use However, in both cases there will be some need for a noise generator to express whispering There the amplitude for all frequencies is reduced and random noise is added ACKNOWLEDGEMENTS The author thanks Adam Pitts and Peter Brdar for their valuable input REFERENCES [1] H Kull, Device And Method For Digital Voice Processing, PCT patent application, international publication number WO 00/16310, 2000 [2] R Linggard, Electronic synthesis of speech, Cambridge: Cambridge University Press, pp 131-133 1985 [3] Dutoit, Thierry, An introduction to text-to-speech synthesis, Dordrecht, Boston: Kluwer Academic Publishers, 1997 [4] DH Klatt, "Software for a Cascade/Parallel Formant Synthesizer," Journal of the Acoustical Society of America, vol 67, pp 971-975, 1980 [5] W J Hallahan, DECtalk Software: Text-to-Speech Technology and Implementation, Digital Technical Journal Vol 7 No 4 1995 pp 5-19 [6] E Vonderheld, Speech Synthesis Offers Realism For Voices of Computers, Automobiles and Yes, Even VCR,s, The Institute, vol 26 No 3, March 2002 [7] T F Qualtieri, Discrete-Time Speech Signal Processing, Upper Saddle River, NJ: Prentice Hall 2002 [8] F Valerio and O Böffard, A Hybrid Model for Text-to-Speech Synthesis, IEEE Trans On Speech and Audio Processing, vol 6 no 5, pp 426-434, September 1998 [9] M Banbrook, S McLaughlin and I Mann, Speech Characterisation and Synthesis by Nonlinear Methods, IEEE Trans On Speech and Audio Processing, vol 7 no 1, pp 1-17, January 1999 [10] R J McAulay and TF Qualtieri, Speech Analysis-Synthesis based on a Sinusoidal Representation, IEEE Trans Acoustics, Speech and Signal Processing, vol 34 no 4, pp 744-754, August 1986 [11] Y Stylianou, Applying the Harmonic Plus Noise Model in Concatenative Speech Synthesis, IEEE Trans On Speech and Audio Processing, vol 9 no 1, pp 21-29, January 2001