Soft-computing Methods for Text-to-Speech Driven Avatars MARIO MALCANGI DICo Dipartimento di Informatica e Comunicazione Università degli Studi di Milano Via Comelico 39 20135 Milano ITALY malcangi@dico.unimi.it http://dsprts.dico.unimi.it Abstract: - This paper presents a new approach for driving avatars with text-to-speech synthesis that uses pure text as an information source. The goal is to move lips and face muscles on the basis of the phonetic nature of the utterance and the related expression. Several methods came together to define this solution. Rule-based text-to-speech synthesis generates phonetic and expression transcription of the text to be uttered by the avatar. Phonetic transcription is used to train two artificial neural networks, one for text-to-phone transcription and the other for phone-to-viseme mapping. Then two fuzzylogic engines were tuned for smoothed control of lip and face movements. Key-Words: - phone-to-viseme conversion, text-to-speech synthesis, artificial neural networks, fuzzy logic 1 Introduction Speech communication can be considered a single medium with a multimodal representation of the information. When a person utters speech, the information communicated to another is not only semantic and syntactic but also emotional, expressive, gestural, and so forth. In lip-synching applications based on direct synchronization of uttered speech with lip and face movements [1], information embedded in speech is often lost because it is too difficult to extract information like emotion or gesture. Only a few general speech parameters, such as amplitude and pitch variability, can be measured and tracked. However, these low-level measurements fall far short of those we need to drive an avatar with the full information content of the uttered speech. This approach leads to very good results for lip synchronization, but greatly impoverished expression can be driven onto the avatar, resulting in very limited naturalness. To overcame this problem, text-based synthetic speech (text-to-speech) can be used instead of natural speech to drive the avatar. Text-to-speech synthesis is currently used to drive avatars' lip movements, but only for text-reading tasks. The avatar's face seems unnatural during utterance because no emotion or gesture information is provided by current text-to-speech systems. Text-to-viseme may be the right approach to control an avatar for natural utterance. The text-to-viseme process can translate text into the appropriate viseme and supplement this basic information with other related information such as emotion or gesture [ 2] [3 ] [4]. Rule-based, text-to-viseme synthesis has been successfully implemented by considering emotion an additional item of information [ 5] and for direct visualspeech synthesis [6]. In these approaches, speech synthesis and face-control synthesis are separate tasks, although in human utterance behavior they belong to an integrated task. Artificial, neural-network-based, text-to- viseme synthesis has been also explored [ 7] [8], demonstrating that greater naturalness can be achieved with a soft-computing rather than a hard-computing approach. Fuzzy logic has proven highly effective in smoothing the action of the logical control rules that move an avatar's face muscles during emotional behavior [9]. This research combines the use of artificial neural networks and fuzzy logic to generate phoneme and viseme information that drives face movements during the utterance of a text, as humans do. Our goal is to use pure text to feed the whole process, as a human does when reading a text. Reading text aloud consists of a complex set of tasks. The lower level of these tasks involves correctly uttering each word in the text according to a set of hidden pronunciation rules. Our research tries to solve the problem of reading the words of a pure text aloud by generating both the speech and the related whole-avatar face motion. 2 Process framework To design the expressive synchronized-speech and facesynthesis system, a two-phase process framework was built. The whole process can be considered a general-purpose model for designing an integrated system of expressive, avatar-based speech communication in human-computer interfaces. The first phase involves training and tuning two artificial neural networks (ANNs) for text-to-phones and for phonesto-viseme synthesis, respectively. Two fuzzy-logic engines are also used to smooth speech and face-muscle control. As shown in Figure 1, a rule-based, text-tophone/expression transcriber trains the ANN-based, text-tophone generator and the ANN-based text-to-viseme generator. Using such a transcriber, only pure ASCII text is used to train the ANNs. Ancillary data for speech and facial ISSN: 1790-2769 288 ISBN: 978-960-474-124-3
expressiveness is automatically extracted from the text by means of regular-expression-based description rules. The two fuzzy-logic engines are manually tuned using a fuzzy logic developing environment. This enables us to edit the fuzzy rules and membership functions according to expert experience. (The tuning task can be also performed by a genetic algorithm). A formant-based speech synthesizer and a viseme generator comprise the additional components of the test process. The formant-based synthesizer allows full control of all speech parameters, so any modulation of speech can be achieved. The viseme generator allows control of face movements and expression during utterance. 3 Text-to-phone/expression transcription by rules Text-to-phone/expression consists of a series of processing steps applied to the text. The text is first preprocessed to convert non alphabetical elements such as numbers, sequences, abbreviations, and special ASCII symbols into the corresponding expanded text. Punctuation and word boundaries are processed by a set of rules that encodes the expression. Each word in the text is converted into phone/expression streams by a language-specific set of rulesleave two blank lines between successive sections as here. The rules have the following format: C(A)D = B (1) Figure 1. Training and tuning process of the ANNs and the fuzzylogic engines. The second phase consists of testing the speech synthesis in a synchronous execution with face motion, as shown in Figure 2. A is the text transformed into the phonetic/expression B if the text to which it belongs matches A in the sequence CAD. C is a pre-context string and D is a post-context string. To compile the rules, the following classes of elements were defined: (!) (^) ($) (#) ([AEIOUY]+) (:) ([^AEIOUY]*) (+) ([EIY]) (2) ($) ([^AEIOUY]) (.) ([BDGJMNRVWZ]) (^) ([NR]) Figure 2. Testing process for expressive speech synthesis and face-motion control. For each class, a regular expressions has been used for compact encoding of the rules. ISSN: 1790-2769 289 ISBN: 978-960-474-124-3
4 Artificial neural-network architecture The two ANNs used for text-to-phone/expression transcription and for phone/expression-to-viseme conversion are both three-layer, feed-forward, backpropagation architectures (FFBP-ANN). taking into account the pre-context and post-context of the current input character. Figure 4. Sliding window Figure 3. Architecture of the FFBP- ANN. The first ANN takes text as input and yields phone/expression transcription. This output is the input for the second ANN whose output is viseme encoding. A linear activation function controls the connections at input and hidden layer nodes. A non-linear (sigmoid) activation function connects hidden-layer nodes to outputlayer. The non-linear activation function is: s i = 1 1+e I i I i = j w ij s j where: s i is the output of the i-th unit E i is the total input w ij is the weight from the j-th to i-th unit The first ANN's input is a text window of nine consecutive characters. This window slides from right to left. Current output encodes the phone and the expression that corresponds to the middle character in the input-layer string, The text-to-phone/expression transcription system is used to train the ANN for text-to-phone/expression transcription. This generates the ANN input-output training patterns for a large variety of texts, so the ANN learns how to read an unknown text with expression. Training the second ANN proceeds in similar fashion but it is conducted only after the first ANN has been fully trained. The first ANN's output is used as input for the second ANN, employing the same sliding-window strategy. A basic viseme set is used as reference for ANN training during the error back-propagation process. 5 Fuzzy-logic engines for controlling smoothed speech and face movement The two trained ANNs are able to drive the speech synthesizer and the avatar face. However, to give greater naturalness to speech utterance and face movement, a smoothing action needs to be performed on the ANNs' outputs, prior to applying them to the speech synthesizer and the avatar's facethe two trained ANNs are able to drive the speech synthesizer and the avatar face. However, to give greater naturalness to speech utterance and face movement, a smoothing action needs to be performed on the ANNs' outputs, prior to applying them to the speech synthesizer and the avatar's face controller. Two fuzzylogic engines were tuned to accomplish this. The two fuzzy subsystems must convert the ANN-output expression state into control levels for speech dynamics and for face muscles. Crisp information (intensity, level, etc.) about expression was transformed into fuzzy rules. The resulting crisp control level comes from an appropriate defuzzifying process. ISSN: 1790-2769 290 ISBN: 978-960-474-124-3
The two fuzzy subsystems have identical engine structure and differ only in their settings (knowledge base). They consist of a fuzzifying front end, a rule-based inference engine, and a defuzzifying back end. The first step in the fuzzy-engine tuning process consists of modeling crisp intensity and level information into fuzzy measurements. This is done by modeling seven fuzzy sets: Imperceptibly low Very low Moderately low Medium Moderately high Very high Strongly high The third step consists of defuzzifying the control output grade. To do this, a set of singleton membership functions and a weighted-average calculation (center of gravity) were used to convert control degree into crisp control: Control= (A x B) / A+B Figure 6 illustrates the membership function shapes used to defuzzify the inferred smoothed controls. Triangular and trapezoidal membership functions are used to implement these fuzzy sets. The shape and relations among these are qualitatively reported in Figure 5. Tuning is accomplished by an expert who uses a fuzzy-logic development environment to simulate and evaluate the resulting membership degrees for each crisp input. The second step consists of editing and tuning a set of inference rules such as: IF x AND y THEN z where x and y are membership grades for the intensity and level of speech and facial expression we intend to smooth before they are applied as controls. z is the degree of control to be applied Figure 6. Singleton membership function to defuzzyfy controls. 6 Speech synthesis model The speech synthesizer model we refer emulates the human vocal tract. The purpose of this choice is that unlimited utterances need to be generated. Naturalness in speech production by this speech synthesis model is achieved by means of dynamic control of its processing elements: filters, generators, and modulators. Coarticulation, phonetic articulation-rate, inflection (pitch) are all controllable, in static or dynamic mode. Speech nature (male, female, child, etc.) and alteration (bass, baritone, etc.) can also be controlled. Figure 5. Fuzzy modeling of speech synthesis and facial control inputs. 7 Facial control modeling Speech intensity is used to control two different components of facial modeling: lips and facial modifications during expressive utterance. Lips and facial expression are controlled in terms of mouth opening and strength of expression-control muscles. ISSN: 1790-2769 291 ISBN: 978-960-474-124-3
The fuzzy, smoothed control produces variable dynamics during the utterance of stationary speech units such as phonemes and allophones. This dynamic control is used to modulate the amplitude of the lip-opening strength, resulting in more natural movement. Expression-control muscles are also dynamically controlled to produce modifications, such as: Facial muscles stretching/relaxing Eyebrows frowning Forehead wrinkling Nostrils extending/contracting driven from auditory speech, in Proceeding of AVSP 99, 1999. [7] G. Zoric, I. S. Pandzic, Real-time language independent lip synchronization method using a genetic algorithm, Signal Processing, Volume 86, Issue 12, December 2006, pp. 3644-3656, 2006. [8] D. W. Massaro, J. Beskow, M. M. Cohen, C. L. Fry, T. Rodriguez, Picture my voice: Audio to visual speech synthesis using artificial neural networks, Proceedings of AVSP 99, Santa Cruz, California, 1999-8 Conclusion Preliminary results of this research demonstrate that soft computing offers a good solution for the smoothed control of avatars during the expressive utterance of text. Using pure text as input information, correct expressive utterance of each word (letter sequence) was achieved. Furthermore, the related expressive avatar face movements were synchronized. The next step will apply a similar approach to the automatic extraction of high-level expression information related to word sequenze. References: [1] M. Malcangi, R. de Tintis, Audio based real-time speech animation of embodied conversational agents, in A. Camurri, G. Volpe (Eds.): Gesture-Based Communication in Human-Computer Interaction, selected revised papers of The 5 th Intrnational Workshop on Gesture and Sign Language based Human-Computer Interaction, GW 2003, Lecture Notes in Artificial Intelligence LNAI 2915 (Subseries of Lecture Notes in Computer Science), Springer-Verlag, Berlin Eidelberg, 2004. [2] T. Masuko, T. Kobayashi, M. Tamura, J. Masubuchi, K. Tokuda, Text-to-visual speech synthesis based on parameter generation from HMM, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, Volume 6, Issue, 12-15 May 1998 Page(s):3745-3748 vol.6, 1998. [3] W. Gao, L. Xu, B. Yin, Y. Liu, Y. Song, J, Yan, J. Yan, J. Zhou, H. Chen, A text-driven sign language synthesis system, Proceedings of CAD & Graphics 97, December 2-5, 1997, Shenzhem, China. [4] M. A. Zliekha, S. Al-Moubayed, O. Al-Dakkak, N. Ghneim, Emotional audio visual arabic text to speech, in Proceedings of Eusipco 2006, 2006. [5] J. Beskow, Rule-based visual speech synthesis, ESCA, Eurospeech 95, Madrid, September, 1995. [6] E. Agelfor, J. Beskow, B. Granstrom, M- Lundeberg, G. Salvi, K. Spens, T. Ohman, Synthetic visulal speech ISSN: 1790-2769 292 ISBN: 978-960-474-124-3