Audible and visible speech

Similar documents
1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Proceedings of Meetings on Acoustics

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

Phonetics. The Sound of Language

Christine Mooshammer, IPDS Kiel, Philip Hoole, IPSK München, Anja Geumann, Dublin

Speaking Rate and Speech Movement Velocity Profiles

DEVELOPMENT OF LINGUAL MOTOR CONTROL IN CHILDREN AND ADOLESCENTS

Universal contrastive analysis as a learning principle in CAPT

Consonants: articulation and transcription

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

A study of speaker adaptation for DNN-based speech synthesis

Quarterly Progress and Status Report. Voiced-voiceless distinction in alaryngeal speech - acoustic and articula

SOUND STRUCTURE REPRESENTATION, REPAIR AND WELL-FORMEDNESS: GRAMMAR IN SPOKEN LANGUAGE PRODUCTION. Adam B. Buchwald

Quarterly Progress and Status Report. Sound symbolism in deictic words

On the Formation of Phoneme Categories in DNN Acoustic Models

Mandarin Lexical Tone Recognition: The Gating Paradigm

Edinburgh Research Explorer

age, Speech and Hearii

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

The Perception of Nasalized Vowels in American English: An Investigation of On-line Use of Vowel Nasalization in Lexical Access

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Speaker Identification by Comparison of Smart Methods. Abstract

Artificial Neural Networks written examination

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

On Developing Acoustic Models Using HTK. M.A. Spaans BSc.

THE RECOGNITION OF SPEECH BY MACHINE

Dyslexia/dyslexic, 3, 9, 24, 97, 187, 189, 206, 217, , , 367, , , 397,

Phonological and Phonetic Representations: The Case of Neutralization

Perceptual scaling of voice identity: common dimensions for different vowels and speakers

WHEN THERE IS A mismatch between the acoustic

Expressive speech synthesis: a review

Human Factors Engineering Design and Evaluation Checklist

Corpus Linguistics (L615)

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Body-Conducted Speech Recognition and its Application to Speech Support System

A comparison of spectral smoothing methods for segment concatenation based speech synthesis

Lecture 1: Machine Learning Basics

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

A Neural Network GUI Tested on Text-To-Phoneme Mapping

One major theoretical issue of interest in both developing and

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Segregation of Unvoiced Speech from Nonspeech Interference

Beginning primarily with the investigations of Zimmermann (1980a),

Speech Recognition at ICSI: Broadcast News and beyond

NIH Public Access Author Manuscript Lang Speech. Author manuscript; available in PMC 2011 January 1.

Learning Methods for Fuzzy Systems

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

INPE São José dos Campos

Speech/Language Pathology Plan of Treatment

Complexity in Second Language Phonology Acquisition

Given a directed graph G =(N A), where N is a set of m nodes and A. destination node, implying a direction for ow to follow. Arcs have limitations

Speaker recognition using universal background model on YOHO database

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Rhythm-typology revisited.

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

COMPUTER-AIDED DESIGN TOOLS THAT ADAPT

Voice conversion through vector quantization

THE MULTIVOC TEXT-TO-SPEECH SYSTEM

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

AGENDA LEARNING THEORIES LEARNING THEORIES. Advanced Learning Theories 2/22/2016

arxiv: v2 [cs.ro] 3 Mar 2017

Phonological Processing for Urdu Text to Speech System

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Rule Learning With Negation: Issues Regarding Effectiveness

Grade 6: Correlated to AGS Basic Math Skills

Pobrane z czasopisma New Horizons in English Studies Data: 18/11/ :52:20. New Horizons in English Studies 1/2016

Perceived speech rate: the effects of. articulation rate and speaking style in spontaneous speech. Jacques Koreman. Saarland University

phone hidden time phone

Radical CV Phonology: the locational gesture *

Articulatory Distinctiveness of Vowels and Consonants: A Data-Driven Approach

VB-MAPP Guided Notes

Python Machine Learning

Phonological encoding in speech production

To appear in the Proceedings of the 35th Meetings of the Chicago Linguistics Society. Post-vocalic spirantization: Typology and phonetic motivations

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

Mathematics subject curriculum

Content Language Objectives (CLOs) August 2012, H. Butts & G. De Anda

BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY

Special Education Program Continuum

Pp. 176{182 in Proceedings of The Second International Conference on Knowledge Discovery and Data Mining. Predictive Data Mining with Finite Mixtures

On the Combined Behavior of Autonomous Resource Management Agents

A MULTI-AGENT SYSTEM FOR A DISTANCE SUPPORT IN EDUCATIONAL ROBOTICS

Math Placement at Paci c Lutheran University

A Bayesian Model of Imitation in Infants and Robots

Lecture Notes in Artificial Intelligence 4343

On Human Computer Interaction, HCI. Dr. Saif al Zahir Electrical and Computer Engineering Department UBC

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

Individual Differences & Item Effects: How to test them, & how to test them well

SEGMENTAL FEATURES IN SPONTANEOUS AND READ-ALOUD FINNISH

9 Sound recordings: acoustic and articulatory data

Human Emotion Recognition From Speech

Hynninen and Zacharov; AES 106 th Convention - Munich 2 performing such tests on a regular basis, the manual preparation can become tiresome. Manual p

The Effect of Discourse Markers on the Speaking Production of EFL Students. Iman Moradimanesh

Integrating simulation into the engineering curriculum: a case study

A Retrospective Study

Evaluation of Various Methods to Calculate the EGG Contact Quotient

Transcription:

Building sensori-motor prototypes from audiovisual exemplars Gérard BAILLY Institut de la Communication Parlée INPG & Université Stendhal 46, avenue Félix Viallet, 383 Grenoble Cedex, France web: http://www.icp.grenet.fr/bailly - e-mail: bailly@icp.grenet.fr Abstract This paper shows how an articulatory model, able to produce acoustic signals from articulatory motion, can learn to speak, i.e. coordinate its movements in such a way that it utters meaningful sequences of sounds belonging to a given language. This complex learning procedure is accomplished in four major steps: (a) a babbling phase, where the device builds up a model of the forward transforms, i.e. the articulatory-to-audiovisual mapping; (b) an imitation stage, where it tries to reproduce a limited set of sound sequences produced by a distal teacher; (c) a shaping stage, where phonemes are associated with the most ecient sensorimotor representation; and nally, (d) a rhythmic phase, where it learns the appropriate coordination of the activations of these sensori-motor targets. Linguistic description Planning Distal score Execution Proximal trajectories Plant Audible and visible speech Figure : General framework for articulatory control.. Introduction The generation of synthetic speech from articulatory movements faces two main challenges: (a) the classical problem of the generation of a continuous ow of command parameters from a discrete sequence of symbols and (b) the adequate use of the degrees of freedom in excess of the articulatory-toacoustic transform. An ecient solution is to separate planning from execution (cf. Fig. ): the planning parametrises the linguistic task in adequate representation spaces whereas the execution converts these distal specications into actual commands for the articulatory synthesiser. Concurrent to the Task Dynamics approach [3], where distal objects of speech production are supposed to be constrictions in the vocal tract, our current approach make use of those distal representation spaces best adapted to the sound to be uttered: exteroceptive, haptic or proprioceptive information are collected in course of the movement so as that the planning process could make use of the most appropriate feedbacks.. Emergence of representations.. The control model The control model used here has been developed within the Speech Maps project []. The so-called articulotron is based on the following principles: a positional coding of targets: each sensori-motor region associated with a percept is modelled as an attractor which generates in all speech representation spaces a force eld which attracts the current frame towards that region. a back-projection of these force elds to the motor space of the plant: the controller implements a pseudoinversion of all proximal-to-distal Jacobians. a composite and superpositional control: each sensorimotor target has an emergence function which can overlap those of adjacent targets. Force elds generated in each representation space are thus weighted and added, then back-projected. These motor force elds are then combined and integrated to determine the actual articulatory movement. When computed in dierent representation spaces, backprojected elds may contradict each other. The strategy for resolution of conicts is essential in motor control and we describe in section 5 our current strategy. 3. Audiovisual inversion The simplest way to give our speech robot, the articolotron, the gift of speech is to imitate an audiovisual speech synthesizer via a global inversion. The audiovisual characterisation is delivered by an audiovisual perceptron. This perceptron may deliver a continuous distal specication as in [3] or sample these audiovisual specications at salient events as proposed by []. We adopted the distal-to-proximal inversion proposed by Jordan [] where the inverse Jacobian of the forward - This work was supported by EC ESPRIT/BR n o 6975 cspeech Maps

Vfbac.cda avec CDA..:R4; = 96.5% Vfbgeo.cda avec CDA..:R4; = 86.54% Vfbart.cda avec CDA..:R4; = 94.3%.5.5.5.5.5.5 3 4 5 6 6 8 4 5 6 7 8.5 9 (a) ten vowels.5 3 3 Cfbac.cda avec CDA..:R4; = 96.5% Cfbgeo.cda avec CDA..:R4; = 98.7% Cfbart.cda avec CDA..:R4; = %.5.5.5.5.5.5 5 5 4 6 3 (b) three occlusives 4 5 3 Figure : The two rst discriminant spaces. From left to right: acoustic, geometric and articulatory spaces. articulatory-to-audiovisual transform - is used to convert the distal gradient into a proximal one. The proximal gradient is augmented by a smoothness criterion with a forgetting factor. This smoothness favours solutions which minimise jerk. Thus starting from an initial articulaory conguration, articulatory movements progressively converge towards gestures producing the appropriate exteroceptive information with minimal jerk. 3.. The plant - proximal parameters The plant has been elaborated using a database of 6 X- rays obtained from a reference subject []. Eight degrees-offreedom [8] will be used here. The model intrinsically couples jaw rotation and translation, controls upper and lower lip relative position and protrusion, controls larynx and velum position and has four degrees-of-freedom for the tongue midsagittal section. 3.. Distal characterisation The perceptron delivers here continuous formant and lip area trajectories of the sounds emitted by some distal teacher. In the following, the distal teacher is the same subject who was X-rayed to build the articulatory model. This avoids normalisation procedures which are beyond the scope of this paper. 3.3. Forward modelling The forward proximal-to-distal transform is learned in the babbling phase. This many-to-one transform from eight articulatory parameters to the rst four formants and area of the lips is modelled by a polynomial interpolator. The four formants were estimated from the area functions delivered by the plant using []. The order for each interpolator was set experimentally to 4. The interpolator was initially estimated using the set of 6 congurations of the X-ray database augmented by a random generation of the articulatory parameters. The actual database has 7368 frames. 3.4. The corpus Our French speaker pronounced two sets of V CV where C is a voiced plosive: (a) with a symmetric context (V =V ) with the ten French vowels and (b) an asymmetric context where V and V are one of the extreme vowels /a,i,u,y/. The set of audiovisual stimuli which will enable our control model to build internal representation of speech sounds consists thus of 78 stimuli, comprising 78 exemplars of voiced plosives and 56 vowels. 3.5. Distal-to-proximal inversion The inversion procedure is done for the whole set of speech items described above. Inversion results have been assessed in two cases: (a) a static case where prototypic articulatory vocalic congurations obtained by a gradient descent towards speaker-specic prototypic acoustic congurations are compared with both the articulatory targets extracted from the X-ray database and well-known structural constraints [7]; (b) kinematic inversion where results on the inversion of VCV sequences are compared with the X-ray data at well-dened time landmarks. The results published in [7, 5, 3] show that such simple global optimisation techniques are able to recover accurate and reliable articulatory movements. 4. Building sensori-motor spaces Once inversion of the whole set of items has been successfully performed, the imitation stage is achieved. The sensorimotor representations obtained by inversion were augmented

with VCV sequences from the original X-ray database, i.e. 78 vowels and 8 occlusives. The so-called Articulotron is supposed to have now sucient sensori-motor representations of context-dependent exemplars of the sounds. These internal representations are sampled at the temporal landmarks delivered by the Perceptron. We selected two landmarks: vocalic targets dened as points of maximum spectral stability consonantal targets dened as points of maximal occlusion 4.. Characterising targets Targets are dened as compact regions of the sensori-motor space. We supposed that separate control channels for different classes of sounds are built: here two channels, one for the vowels and one for the voiced plosives. On these control channels, phonemic targets have been implemented as simple Gaussians: the force eld is created by the derivative of the probability function (see section 5). A simple Gaussian has the advantage of generating a simple force eld with no singularities and builds up intrinsically a compacity constraint. The sensori-motor space is divided into three sub-spaces: The articulatory space consisting of 8 articulators A geometric space consisting of 5 parameters: the area of the lips (Al), the area (Ac) and location (Xc) of the main constriction and two mid-sagittal distances: the minimum distances of the tongue tip (TT) and tongue dorsum (TD) to the palate. These two latter parameters are similar to those used in [3]. An acoustic space consisting of the rst three formants. 4.. Sensori-motor sub-spaces A Canonical Discriminant Analysis was performed and the vocalic and consonantal targets were projected on the rst discriminant planes (see Fig. ). Vowels The examination of the structure of the projections and of the identication scores demonstrates that vowels are best dened in acoustic terms. Some additional arguments may be given in favour of an acoustic control of vocalic trajectories: The most successful procedure for predicting vocalic systems [9] uses a basic criterion of maximal acoustic dispersion of vocalic targets. Although a perceptual weighting of the solutions improves the prediction of the most frequent systems up to 9 vowels, articulatory or geometric data only shape and weights the dimensions of the maximal space. Recent perturbation experiments show that speakers tend to reach the same perceptual/acoustic goals with articulatory strategies that greatly dier from the unperturbed case [4]. Vocalic trajectories tend to be linear in the acoustic space when it is re-analysed in terms of resonances [4]. Occlusives On the other hand, the voiced occlusives are best dened in terms of place of articulation. When the acoustic information is sampled at the vocalic onset as proposed by [5], the identication score is just above chance while the geometric score still rates 97%. Of course, the paradigm of relational invariance may hold but a contextindependent target is no longer available. khz act/barks..5.8 5 q 5 5 i u y i y u 3 [f vs f,f3] for vowels e x o e x o q q E X O a E khz X a activations/formants O....3 s u cm std.5.5 [al versus dtt,dtd] for occlusives b d g d g 4 6 cm articulators....3 s Figure 3: A simple modulation of the acoustic force-eld generating [u] from the neutral posture. From left to right, top: F/F3 versus F and TT/TD versus Al. Note the quasilinear3 trajectory in the F/F plane. Bottom: Resulting formant trajectories (thick lines in a Bark scale) superposed with emergence functions (thin lines) and resulting articulatory gesture. Here tongue dorsum and jaw raise whereas tongue tip lowers and lips close. The tongue body is pulled back. 5. Voluntary motion by modulating force elds Once sensori-motor representations of sound targets have been built, we have to verify that sound sequences can effectively be generated using a composite and superpositional control of attractor elds. 5.. Vowels First, wehavetoverify that vocalic sounds may be produced and chained adequately and that force elds generated in a structured acoustic space still pull articulatory gestures towards prototypical articulatory targets. The movement equation is: a A =! A:pinv(J a!a): A, where A!! and a! A are respectively the resulting driving acoustic force and the back-propagated articulatory velocity. The driving force equals the sum of the gradients of the probability function for each vowelvweighted by its emergence k V (t). Each probability function is dened by its mean mean! V and covariance matrix cov V. Only the acoustic characteristics of the vocalic targets [] A are considered as follows:! A! ( A(t) = P V kv (t) cov V dtt dtd! [mean V ] A A(t)); whith td tb jh tt lh

P V kv (t) =. Fig. 3 shows the acoustic, geometric and articulatory trajectories produced by modulating the force eld from the neutral attractor towards the /u/ vowel. khz act/barks..5.8 5 5 5 i u y i y u 3 [f vs f,f3] for vowels e x o e x o q q E X O a E khz X a activations/formants q a b a..4 s O cm std.5.5 [al versus dtt,dtd] for occlusives b d g d g 4 6 cm articulators..4 s Figure 4: Starting from the neutral posture the acoustic force-eld generating [a] is perturbed by the [b] geometric attractor characterised by rst principal axis at A l =. 5.. Occlusives dtt dtd jh lh tb tt td We have shown above how articulation may be driven by a back-propagated modulation of an acoustic eld. We suppose here that this carrier acoustic gesture is primarily modulated by vocalic targets whose emergence functions are characterised by slow and overlapping transition functions whose sum equal to one. We have shown that this carrier gesture may react to unexpected articulatory perturbations [6]. Occlusives may be seen as voluntary perturbations (see Fig. 4): the geometric trajectory deviates from the one produced by the acoustic driving eld because of emergence of plosivespecic geometric attractors. 6. Conclusions We described here a strategy for giving an articulatory model the gift of speech i.e. a learning paradigm that will enrich its internal representations from experience. These internal sensori-motor representation are emergent because they are by-products of a rst global audiovisual-to-articulatory inversion. Thanks an appropriate selective use of these representations the controller produces skilled actions and reacts to unexpected perturbations. Consonants may be seen as planned perturbations. We have to extend this paradigm to other consonants than those studied here. The next step is the learning and control of timing: how temporal relationship can be implemented both in terms of sequential and dynamic constraints and phasing between articulation and phonation can be handled. 7. references. Badin, P. and Fant, G. Notes on vocal tract computations. STL-QPSR /3, 538, 984.. Badin, P., Gabioud, B., Beautemps, D., Lallouache, T., Bailly, G., Maeda, S., Zerling, J.P., and Brock, G. Cineradiography of vcv sequences: articulatory-acoustic data for a speech production model. In International Congress on Acoustics, pages 34935, Trondheim - Norway, 995. 3. Badin, P., Mawass, K., Bailly, G., Vescovi, C., Beautemps, D., and Pelorson, X. Articulatory synthesis of fricative consonants : data and models. In ETRW on Speech Production, pages 4, Autrans - France, 996. 4. Bailly, G. Caracterisation of formant trajectories by tracking vocal tract resonances. In Sorin, C., Mariani, J., Méloni, H., and Schoentgen, J., editors, Levels in speech communication :relations and interactions, pages 9. Elsevier, Amsterdam, 995. 5. Bailly, G. Recovering place of articulation for occlusives in vcvs. In International Congress of Phonetic Sciences, volume, pages 333, Stockholm, Sweden, 995. 6. Bailly, G. Sensori-motor control of speech movements. In ETRW on Speech Production Modelling, Autrans, 996. 7. Bailly, G., Boë, L.J., Vallée, N., and Badin, P. Articulatoriacoustic prototypes for speech production. In Proceedings of the European Conference on Speech Communication and Technology, volume, pages 9396, Madrid, 995. 8. Beautemps, D., Badin, P., Bailly, G., Galvàn, A., and Laboissière, R. Evaluation of an articulatory-acoustic model based on a refrence subject. In ETRW on Speech Production, pages 4548, Autrans - France, 996. 9. Boë, L.J., Schwartz, J.L., and Vallée, N. The prediction of vowel systems: perceptual contrast and stability. In Keller, E., editor, Fundamentals of speech synthesis and speech recognition, pages 854. John Wiley and Sons, Chichester, 994.. Honda, M. and Kaburagi, T. A dynamical articulatory model using potential task representation. In International Conference on Speech and Language Processing, volume, pages 7984, Yokohama, Japan, 994.. Jordan, M.I. Supervised learning and systems with excess degrees of freedom. COINS Tech. Rep. 88-7, University of Massachussetts, Computer and Information Sciences, Amherst, MA, 988.. Morasso, P. and Sanguineti, V. Representation of space and time in motor control. In Bailly, G., editor, SPEECH MAPS - WP3: Dynamic constraints and motor controls, chapter Deliverable : Learning with the articulotron I, pages 4 86. Institut de la Communication Parlée, Grenoble - France, 994. 3. Saltzman, E.L. and Munhall, K.G. A dynamical approach to gestural patterning in speech production. Ecological Psychology, (4):6563, 989. 4. Savariaux, C., Perrier, P., and Orliaguet, J.P. Compensation strategies for the perturbation of the rounded vowel [u] using a lip-tube: A study of the control space in speech production. Journal of the Acoustical Society of America, 5:4844, 995. 5. Sussman, H.M., McCarey, H.A., and Matthews, S.A. An investigation of locus equations as a source of relational invariance for stop place categorization. Journal of the Acoustical Society of America, 9(3):3935, 99.

Sound File References: [qaba.wav] [qabi.wav] [qada.wav] [qaga.wav]