Soft-computing Methods for Text-to-Speech Driven Avatars

Similar documents
Learning Methods for Fuzzy Systems

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Evolutive Neural Net Fuzzy Filtering: Basic Description

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Speaker Identification by Comparison of Smart Methods. Abstract

Evolution of Symbolisation in Chimpanzees and Neural Nets

Human Emotion Recognition From Speech

Speech Emotion Recognition Using Support Vector Machine

A student diagnosing and evaluation system for laboratory-based academic exercises

FUZZY EXPERT. Dr. Kasim M. Al-Aubidy. Philadelphia University. Computer Eng. Dept February 2002 University of Damascus-Syria

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

A study of speaker adaptation for DNN-based speech synthesis

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

REVIEW OF CONNECTED SPEECH

On the Formation of Phoneme Categories in DNN Acoustic Models

Learning Methods in Multilingual Speech Recognition

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

SARDNET: A Self-Organizing Feature Map for Sequences

AQUA: An Ontology-Driven Question Answering System

Knowledge-Based - Systems

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Emotional Variation in Speech-Based Natural Language Generation

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Word Segmentation of Off-line Handwritten Documents

An Interactive Intelligent Language Tutor Over The Internet

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Speech Recognition at ICSI: Broadcast News and beyond

English Language and Applied Linguistics. Module Descriptions 2017/18

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Ph.D in Advance Machine Learning (computer science) PhD submitted, degree to be awarded on convocation, sept B.Tech in Computer science and

Cambridgeshire Community Services NHS Trust: delivering excellence in children and young people s health services

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

INPE São José dos Campos

Modeling function word errors in DNN-HMM based LVCSR systems

Client Psychology and Motivation for Personal Trainers

AGENDA LEARNING THEORIES LEARNING THEORIES. Advanced Learning Theories 2/22/2016

Seminar - Organic Computing

Florida Reading Endorsement Alignment Matrix Competency 1

Modeling function word errors in DNN-HMM based LVCSR systems

Applying Fuzzy Rule-Based System on FMEA to Assess the Risks on Project-Based Software Engineering Education

Cross Language Information Retrieval

Python Machine Learning

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Abstractions and the Brain

Artificial Neural Networks written examination

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Calibration of Confidence Measures in Speech Recognition

Phonological Processing for Urdu Text to Speech System

Introduction and survey

Three Different Modes of Avatars as Virtual Lecturers in E-learning Interfaces: A Comparative Usability Study

Voice conversion through vector quantization

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Functional Mark-up for Behaviour Planning: Theory and Practice

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Annotation and Taxonomy of Gestures in Lecture Videos

ACCOMMODATIONS FOR STUDENTS WITH DISABILITIES

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Conversation Starters: Using Spatial Context to Initiate Dialogue in First Person Perspective Games

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Visual CP Representation of Knowledge

Computerized Adaptive Psychological Testing A Personalisation Perspective

Language-driven nonverbal communication in a bilingual. Conversational Agents

SIE: Speech Enabled Interface for E-Learning

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Assignment 1: Predicting Amazon Review Ratings

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

WHEN THERE IS A mismatch between the acoustic

SAM - Sensors, Actuators and Microcontrollers in Mobile Robots

Body-Conducted Speech Recognition and its Application to Speech Support System

An Architecture to Develop Multimodal Educative Applications with Chatbots

CWIS 23,3. Nikolaos Avouris Human Computer Interaction Group, University of Patras, Patras, Greece

Rendezvous with Comet Halley Next Generation of Science Standards

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Course Law Enforcement II. Unit I Careers in Law Enforcement

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

Test Effort Estimation Using Neural Network

Xinyu Tang. Education. Research Interests. Honors and Awards. Professional Experience

Radius STEM Readiness TM

Proceedings of Meetings on Acoustics

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Guru: A Computer Tutor that Models Expert Human Tutors

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

CHANCERY SMS 5.0 STUDENT SCHEDULING

The Perception of Nasalized Vowels in American English: An Investigation of On-line Use of Vowel Nasalization in Lexical Access

Speaker Recognition. Speaker Diarization and Identification

Transcription:

Soft-computing Methods for Text-to-Speech Driven Avatars MARIO MALCANGI DICo Dipartimento di Informatica e Comunicazione Università degli Studi di Milano Via Comelico 39 20135 Milano ITALY malcangi@dico.unimi.it http://dsprts.dico.unimi.it Abstract: - This paper presents a new approach for driving avatars with text-to-speech synthesis that uses pure text as an information source. The goal is to move lips and face muscles on the basis of the phonetic nature of the utterance and the related expression. Several methods came together to define this solution. Rule-based text-to-speech synthesis generates phonetic and expression transcription of the text to be uttered by the avatar. Phonetic transcription is used to train two artificial neural networks, one for text-to-phone transcription and the other for phone-to-viseme mapping. Then two fuzzylogic engines were tuned for smoothed control of lip and face movements. Key-Words: - phone-to-viseme conversion, text-to-speech synthesis, artificial neural networks, fuzzy logic 1 Introduction Speech communication can be considered a single medium with a multimodal representation of the information. When a person utters speech, the information communicated to another is not only semantic and syntactic but also emotional, expressive, gestural, and so forth. In lip-synching applications based on direct synchronization of uttered speech with lip and face movements [1], information embedded in speech is often lost because it is too difficult to extract information like emotion or gesture. Only a few general speech parameters, such as amplitude and pitch variability, can be measured and tracked. However, these low-level measurements fall far short of those we need to drive an avatar with the full information content of the uttered speech. This approach leads to very good results for lip synchronization, but greatly impoverished expression can be driven onto the avatar, resulting in very limited naturalness. To overcame this problem, text-based synthetic speech (text-to-speech) can be used instead of natural speech to drive the avatar. Text-to-speech synthesis is currently used to drive avatars' lip movements, but only for text-reading tasks. The avatar's face seems unnatural during utterance because no emotion or gesture information is provided by current text-to-speech systems. Text-to-viseme may be the right approach to control an avatar for natural utterance. The text-to-viseme process can translate text into the appropriate viseme and supplement this basic information with other related information such as emotion or gesture [ 2] [3 ] [4]. Rule-based, text-to-viseme synthesis has been successfully implemented by considering emotion an additional item of information [ 5] and for direct visualspeech synthesis [6]. In these approaches, speech synthesis and face-control synthesis are separate tasks, although in human utterance behavior they belong to an integrated task. Artificial, neural-network-based, text-to- viseme synthesis has been also explored [ 7] [8], demonstrating that greater naturalness can be achieved with a soft-computing rather than a hard-computing approach. Fuzzy logic has proven highly effective in smoothing the action of the logical control rules that move an avatar's face muscles during emotional behavior [9]. This research combines the use of artificial neural networks and fuzzy logic to generate phoneme and viseme information that drives face movements during the utterance of a text, as humans do. Our goal is to use pure text to feed the whole process, as a human does when reading a text. Reading text aloud consists of a complex set of tasks. The lower level of these tasks involves correctly uttering each word in the text according to a set of hidden pronunciation rules. Our research tries to solve the problem of reading the words of a pure text aloud by generating both the speech and the related whole-avatar face motion. 2 Process framework To design the expressive synchronized-speech and facesynthesis system, a two-phase process framework was built. The whole process can be considered a general-purpose model for designing an integrated system of expressive, avatar-based speech communication in human-computer interfaces. The first phase involves training and tuning two artificial neural networks (ANNs) for text-to-phones and for phonesto-viseme synthesis, respectively. Two fuzzy-logic engines are also used to smooth speech and face-muscle control. As shown in Figure 1, a rule-based, text-tophone/expression transcriber trains the ANN-based, text-tophone generator and the ANN-based text-to-viseme generator. Using such a transcriber, only pure ASCII text is used to train the ANNs. Ancillary data for speech and facial ISSN: 1790-2769 288 ISBN: 978-960-474-124-3

expressiveness is automatically extracted from the text by means of regular-expression-based description rules. The two fuzzy-logic engines are manually tuned using a fuzzy logic developing environment. This enables us to edit the fuzzy rules and membership functions according to expert experience. (The tuning task can be also performed by a genetic algorithm). A formant-based speech synthesizer and a viseme generator comprise the additional components of the test process. The formant-based synthesizer allows full control of all speech parameters, so any modulation of speech can be achieved. The viseme generator allows control of face movements and expression during utterance. 3 Text-to-phone/expression transcription by rules Text-to-phone/expression consists of a series of processing steps applied to the text. The text is first preprocessed to convert non alphabetical elements such as numbers, sequences, abbreviations, and special ASCII symbols into the corresponding expanded text. Punctuation and word boundaries are processed by a set of rules that encodes the expression. Each word in the text is converted into phone/expression streams by a language-specific set of rulesleave two blank lines between successive sections as here. The rules have the following format: C(A)D = B (1) Figure 1. Training and tuning process of the ANNs and the fuzzylogic engines. The second phase consists of testing the speech synthesis in a synchronous execution with face motion, as shown in Figure 2. A is the text transformed into the phonetic/expression B if the text to which it belongs matches A in the sequence CAD. C is a pre-context string and D is a post-context string. To compile the rules, the following classes of elements were defined: (!) (^) ($) (#) ([AEIOUY]+) (:) ([^AEIOUY]*) (+) ([EIY]) (2) ($) ([^AEIOUY]) (.) ([BDGJMNRVWZ]) (^) ([NR]) Figure 2. Testing process for expressive speech synthesis and face-motion control. For each class, a regular expressions has been used for compact encoding of the rules. ISSN: 1790-2769 289 ISBN: 978-960-474-124-3

4 Artificial neural-network architecture The two ANNs used for text-to-phone/expression transcription and for phone/expression-to-viseme conversion are both three-layer, feed-forward, backpropagation architectures (FFBP-ANN). taking into account the pre-context and post-context of the current input character. Figure 4. Sliding window Figure 3. Architecture of the FFBP- ANN. The first ANN takes text as input and yields phone/expression transcription. This output is the input for the second ANN whose output is viseme encoding. A linear activation function controls the connections at input and hidden layer nodes. A non-linear (sigmoid) activation function connects hidden-layer nodes to outputlayer. The non-linear activation function is: s i = 1 1+e I i I i = j w ij s j where: s i is the output of the i-th unit E i is the total input w ij is the weight from the j-th to i-th unit The first ANN's input is a text window of nine consecutive characters. This window slides from right to left. Current output encodes the phone and the expression that corresponds to the middle character in the input-layer string, The text-to-phone/expression transcription system is used to train the ANN for text-to-phone/expression transcription. This generates the ANN input-output training patterns for a large variety of texts, so the ANN learns how to read an unknown text with expression. Training the second ANN proceeds in similar fashion but it is conducted only after the first ANN has been fully trained. The first ANN's output is used as input for the second ANN, employing the same sliding-window strategy. A basic viseme set is used as reference for ANN training during the error back-propagation process. 5 Fuzzy-logic engines for controlling smoothed speech and face movement The two trained ANNs are able to drive the speech synthesizer and the avatar face. However, to give greater naturalness to speech utterance and face movement, a smoothing action needs to be performed on the ANNs' outputs, prior to applying them to the speech synthesizer and the avatar's facethe two trained ANNs are able to drive the speech synthesizer and the avatar face. However, to give greater naturalness to speech utterance and face movement, a smoothing action needs to be performed on the ANNs' outputs, prior to applying them to the speech synthesizer and the avatar's face controller. Two fuzzylogic engines were tuned to accomplish this. The two fuzzy subsystems must convert the ANN-output expression state into control levels for speech dynamics and for face muscles. Crisp information (intensity, level, etc.) about expression was transformed into fuzzy rules. The resulting crisp control level comes from an appropriate defuzzifying process. ISSN: 1790-2769 290 ISBN: 978-960-474-124-3

The two fuzzy subsystems have identical engine structure and differ only in their settings (knowledge base). They consist of a fuzzifying front end, a rule-based inference engine, and a defuzzifying back end. The first step in the fuzzy-engine tuning process consists of modeling crisp intensity and level information into fuzzy measurements. This is done by modeling seven fuzzy sets: Imperceptibly low Very low Moderately low Medium Moderately high Very high Strongly high The third step consists of defuzzifying the control output grade. To do this, a set of singleton membership functions and a weighted-average calculation (center of gravity) were used to convert control degree into crisp control: Control= (A x B) / A+B Figure 6 illustrates the membership function shapes used to defuzzify the inferred smoothed controls. Triangular and trapezoidal membership functions are used to implement these fuzzy sets. The shape and relations among these are qualitatively reported in Figure 5. Tuning is accomplished by an expert who uses a fuzzy-logic development environment to simulate and evaluate the resulting membership degrees for each crisp input. The second step consists of editing and tuning a set of inference rules such as: IF x AND y THEN z where x and y are membership grades for the intensity and level of speech and facial expression we intend to smooth before they are applied as controls. z is the degree of control to be applied Figure 6. Singleton membership function to defuzzyfy controls. 6 Speech synthesis model The speech synthesizer model we refer emulates the human vocal tract. The purpose of this choice is that unlimited utterances need to be generated. Naturalness in speech production by this speech synthesis model is achieved by means of dynamic control of its processing elements: filters, generators, and modulators. Coarticulation, phonetic articulation-rate, inflection (pitch) are all controllable, in static or dynamic mode. Speech nature (male, female, child, etc.) and alteration (bass, baritone, etc.) can also be controlled. Figure 5. Fuzzy modeling of speech synthesis and facial control inputs. 7 Facial control modeling Speech intensity is used to control two different components of facial modeling: lips and facial modifications during expressive utterance. Lips and facial expression are controlled in terms of mouth opening and strength of expression-control muscles. ISSN: 1790-2769 291 ISBN: 978-960-474-124-3

The fuzzy, smoothed control produces variable dynamics during the utterance of stationary speech units such as phonemes and allophones. This dynamic control is used to modulate the amplitude of the lip-opening strength, resulting in more natural movement. Expression-control muscles are also dynamically controlled to produce modifications, such as: Facial muscles stretching/relaxing Eyebrows frowning Forehead wrinkling Nostrils extending/contracting driven from auditory speech, in Proceeding of AVSP 99, 1999. [7] G. Zoric, I. S. Pandzic, Real-time language independent lip synchronization method using a genetic algorithm, Signal Processing, Volume 86, Issue 12, December 2006, pp. 3644-3656, 2006. [8] D. W. Massaro, J. Beskow, M. M. Cohen, C. L. Fry, T. Rodriguez, Picture my voice: Audio to visual speech synthesis using artificial neural networks, Proceedings of AVSP 99, Santa Cruz, California, 1999-8 Conclusion Preliminary results of this research demonstrate that soft computing offers a good solution for the smoothed control of avatars during the expressive utterance of text. Using pure text as input information, correct expressive utterance of each word (letter sequence) was achieved. Furthermore, the related expressive avatar face movements were synchronized. The next step will apply a similar approach to the automatic extraction of high-level expression information related to word sequenze. References: [1] M. Malcangi, R. de Tintis, Audio based real-time speech animation of embodied conversational agents, in A. Camurri, G. Volpe (Eds.): Gesture-Based Communication in Human-Computer Interaction, selected revised papers of The 5 th Intrnational Workshop on Gesture and Sign Language based Human-Computer Interaction, GW 2003, Lecture Notes in Artificial Intelligence LNAI 2915 (Subseries of Lecture Notes in Computer Science), Springer-Verlag, Berlin Eidelberg, 2004. [2] T. Masuko, T. Kobayashi, M. Tamura, J. Masubuchi, K. Tokuda, Text-to-visual speech synthesis based on parameter generation from HMM, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, Volume 6, Issue, 12-15 May 1998 Page(s):3745-3748 vol.6, 1998. [3] W. Gao, L. Xu, B. Yin, Y. Liu, Y. Song, J, Yan, J. Yan, J. Zhou, H. Chen, A text-driven sign language synthesis system, Proceedings of CAD & Graphics 97, December 2-5, 1997, Shenzhem, China. [4] M. A. Zliekha, S. Al-Moubayed, O. Al-Dakkak, N. Ghneim, Emotional audio visual arabic text to speech, in Proceedings of Eusipco 2006, 2006. [5] J. Beskow, Rule-based visual speech synthesis, ESCA, Eurospeech 95, Madrid, September, 1995. [6] E. Agelfor, J. Beskow, B. Granstrom, M- Lundeberg, G. Salvi, K. Spens, T. Ohman, Synthetic visulal speech ISSN: 1790-2769 292 ISBN: 978-960-474-124-3