GREEK EMOTIONAL D ATABASE: CONSTRUCTION AND LINGUISTIC ANALYSIS

Similar documents
Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

Word Stress and Intonation: Introduction

Mandarin Lexical Tone Recognition: The Gating Paradigm

Expressive speech synthesis: a review

Speech Emotion Recognition Using Support Vector Machine

Eyebrows in French talk-in-interaction

/$ IEEE

ANGLAIS LANGUE SECONDE

CEFR Overall Illustrative English Proficiency Scales

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

The Acquisition of English Intonation by Native Greek Speakers

Rhythm-typology revisited.

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Speech Recognition at ICSI: Broadcast News and beyond

Rachel E. Baker, Ann R. Bradlow. Northwestern University, Evanston, IL, USA

English Language and Applied Linguistics. Module Descriptions 2017/18

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

A study of speaker adaptation for DNN-based speech synthesis

USER ADAPTATION IN E-LEARNING ENVIRONMENTS

Revisiting the role of prosody in early language acquisition. Megha Sundara UCLA Phonetics Lab

Listening and Speaking Skills of English Language of Adolescents of Government and Private Schools

SEGMENTAL FEATURES IN SPONTANEOUS AND READ-ALOUD FINNISH

Quarterly Progress and Status Report. Voiced-voiceless distinction in alaryngeal speech - acoustic and articula

A Web Based Annotation Interface Based of Wheel of Emotions. Author: Philip Marsh. Project Supervisor: Irena Spasic. Project Moderator: Matthew Morgan

1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

Getting the Story Right: Making Computer-Generated Stories More Entertaining

Problems of the Arabic OCR: New Attitudes

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Jacqueline C. Kowtko, Patti J. Price Speech Research Program, SRI International, Menlo Park, CA 94025

AGENDA LEARNING THEORIES LEARNING THEORIES. Advanced Learning Theories 2/22/2016

Perceived speech rate: the effects of. articulation rate and speaking style in spontaneous speech. Jacques Koreman. Saarland University

Designing a Speech Corpus for Instance-based Spoken Language Generation

Phonological and Phonetic Representations: The Case of Neutralization

REVIEW OF CONNECTED SPEECH

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

The Perception of Nasalized Vowels in American English: An Investigation of On-line Use of Vowel Nasalization in Lexical Access

Learning Methods in Multilingual Speech Recognition

Review in ICAME Journal, Volume 38, 2014, DOI: /icame

Using dialogue context to improve parsing performance in dialogue systems

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

OPAC and User Perception in Law University Libraries in the Karnataka: A Study

Modern TTS systems. CS 294-5: Statistical Natural Language Processing. Types of Modern Synthesis. TTS Architecture. Text Normalization

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Discourse Structure in Spoken Language: Studies on Speech Corpora

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading

Linking object names and object categories: Words (but not tones) facilitate object categorization in 6- and 12-month-olds

Think A F R I C A when assessing speaking. C.E.F.R. Oral Assessment Criteria. Think A F R I C A - 1 -

DIDACTIC MODEL BRIDGING A CONCEPT WITH PHENOMENA

Phonological encoding in speech production

Natural Language Processing. George Konidaris

Journal of Phonetics

Human Emotion Recognition From Speech

Individual Differences & Item Effects: How to test them, & how to test them well

Switchboard Language Model Improvement with Conversational Data from Gigaword

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Lecturing Module

THE EFFECTS OF TEACHING THE 7 KEYS OF COMPREHENSION ON COMPREHENSION DEBRA HENGGELER. Submitted to. The Educational Leadership Faculty

Guru: A Computer Tutor that Models Expert Human Tutors

Physics 270: Experimental Physics

The Effect of Discourse Markers on the Speaking Production of EFL Students. Iman Moradimanesh

Florida Reading Endorsement Alignment Matrix Competency 1

A Cross-language Corpus for Studying the Phonetics and Phonology of Prominence

Probabilistic Latent Semantic Analysis

Voice conversion through vector quantization

The influence of metrical constraints on direct imitation across French varieties

Program Matrix - Reading English 6-12 (DOE Code 398) University of Florida. Reading

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

Phonological Processing for Urdu Text to Speech System

Stimulating Techniques in Micro Teaching. Puan Ng Swee Teng Ketua Program Kursus Lanjutan U48 Kolej Sains Kesihatan Bersekutu, SAS, Ulu Kinta

L1 Influence on L2 Intonation in Russian Speakers of English

TRAITS OF GOOD WRITING

Tuesday 13 May 2014 Afternoon

Practice Examination IREB

Automatic Pronunciation Checker

Unit Selection Synthesis Using Long Non-Uniform Units and Phonemic Identity Matching

Applications of memory-based natural language processing

STEPS TO EFFECTIVE ADVOCACY

Emotions from text: machine learning for text-based emotion prediction

Effect of Word Complexity on L2 Vocabulary Learning

Probability and Statistics Curriculum Pacing Guide

A Hybrid Text-To-Speech system for Afrikaans

What is Thinking (Cognition)?

Letter-based speech synthesis

Communication around Interactive Tables

Assessing Functional Relations: The Utility of the Standard Celeration Chart

Rule Learning With Negation: Issues Regarding Effectiveness

THE PERCEPTION AND PRODUCTION OF STRESS AND INTONATION BY CHILDREN WITH COCHLEAR IMPLANTS

Annotation Pro. annotation of linguistic and paralinguistic features in speech. Katarzyna Klessa. Phon&Phon meeting

Document number: 2013/ Programs Committee 6/2014 (July) Agenda Item 42.0 Bachelor of Engineering with Honours in Software Engineering

Functional Mark-up for Behaviour Planning: Theory and Practice

Segregation of Unvoiced Speech from Nonspeech Interference

SURVIVING ON MARS WITH GEOGEBRA

PRODUCT COMPLEXITY: A NEW MODELLING COURSE IN THE INDUSTRIAL DESIGN PROGRAM AT THE UNIVERSITY OF TWENTE

Author: Justyna Kowalczys Stowarzyszenie Angielski w Medycynie (PL) Feb 2015

P. Belsis, C. Sgouropoulou, K. Sfikas, G. Pantziou, C. Skourlas, J. Varnas

Transcription:

GREEK EMOTIONAL D ATABASE: CONSTRUCTION AND LINGUISTIC ANALYSIS Panagiotis Zervas Nikos Fakotakis Irini Geourga George Kokkinakis UNIVERSITY OF PATRAS UNIVERSITY OF PATRAS UNIVERSITY OF PATRAS UNIVERSITY OF PATRAS,,,,,,, 69 Key words emotional speech, speech synthesis, prosody, fundamental frequency, pitch contour, declination phenomenon, duration, speech intensity 1 Introduction When compared to human speech, synthesized speech is distinguished by insufficient intelligibility, inappropriate prosody and inadequate expressiveness These are serious drawbacks for conversational human-machine interfaces Prosody-intonation (melody) and rhythm, clarifies syntactic structures, disambiguates meaning and helps in discourse flow control Moreover expressiveness, or affect, provides information about the speaker s mental state and intentions beyond what is revealed by word content The quality of synthetic speech has been greatly improved by the continuous research of the speech scientists Nevertheless, most of these improvements were aimed at simulating natural speech as that uttered by a professional announcer reading natural text in a neutral speaking style Because of mimicking this style, the synthetic voice results to be rather monotonous, suitable for some man-machine applications, but not for a vocal prosthesis device such as the communicators used by disabled people Synthesized speech is mainly distinguished by a lower intelligibility, a not natural prosody and lack of expressiveness These are important drawbacks for computer human speech communication 1

Our work comprises a systematic study of speech with emotional expression to model the effects of emotion on signal level The scope of this research is to improve the naturalness of voice in text to speech systems Emotions are marked by three main operations: They reflect the result of concrete stimulus in relation to the needs and the preferences of individuals they prepare bodily and psychologically the organism for concrete energies and they transmit the person s psychological situation in the remainder environment The major obstacle in the research of human emotions is the difficulty to describe them with a strict way (ie there is a degree of subjectiveness) Greek emotional speech database has been recorded under laboratory conditions, the speech corpora were declaimed by a professional Greek actress following a standard data recording procedure This was necessary in order to systematically record the same utterance with different emotional contents It is shown in (Montero etal 1998) that recordings with actors are good approximations to true emotional speech To avoid the interference of a listener s decision on the emotional contents due to semantically meaning, we attempted to construct semantically neutral sentences In this work we give the detailed description and the composition of an emotional speech database for Greek 2 Database Construction For the study and analysis of prosody, first we choose a number of sentences that will compose our corpus The corpus was designed in a way that each phoneme resides in various positions in a word (initial, medial, final) in that way the extraction of them is possible and can be used as a structural element in a text-to-speech system (TTS) inventory Sentences were extracted from passages, newspapers or were set up by a professional linguist Finally the corpus was compromised by ten single words, twenty short sentences, twenty five long sentences and twelve passages of fluent speech (ranging from three to five sentences each) All sentences were emotionally neutral, meaning that they do not convey any emotional charge through lexical, syntactical or semantical means The thirty year old speaker that was recorded for the database has the standard Greek accent as spoken in Athens and has been a professional actress for almost ten years She was instructed to read all the utterances with one emotion then change it and start over again In that way we wanted to assure that the speaker did not have to change emotion more than five times (expressing sadness, anger, fear, joy and neutral) 2

3 Evaluation of the Natural Voice Following the recordings, a listening test was performed to test whether normal listeners could identify the type of emotion that characterized the recorded utterances Six qualified listeners were used both men and women, of different ages, from several social environments and none of them was used to synthetic speech The stimuli for the evaluation was five neutral-content sentences (twenty recordings per listener), randomly played The whole evaluation process took place in two parts First a free response test was held where the listeners were labeling each utterance with whatever emotion found appropriate and second they were forced to choose between the four emotions that where included in our database The results are tabulated on table 1 Emotion Free Response Test Forced Response Test Sadness 97,1% 97,5% Anger 97,8% 98,2% Joy 84% 89% Fear 68% 74% Table 1: Free and forced response test results 4 Parameters for emotional speech description In view of finding a description of phonetic operations under the effect of concrete sentimental situations, contemporary researchers have studied various parameter estimation techniques (effect on F0 contour, variation in number of pauses, length of pauses, ratio of pause duration to total phonation time and speech rate, fundamental frequency-its median value, the average pitch range, the rate of F0 change) (Murray and Arnott, 1995) Taking into account all the above we concluded in a set of features for the description of each emotional state composed of the: Fundamental frequency F0 Speech intensity Speech duration in various levels (sentence, word, phoneme) The above parameters were adopted as the most efficient and most important factors for the recognition and variation of the emotions that were recorded in our database In the next pages a detailed description and statistical analysis regarding the results on measured variations is given 3

41 Fundamental Frequency Parameter As far as it concerns the addition of emotional characteristics in synthetic speech is essential the analysis, modeling and finally the generation of pitch contour The fundamental frequency (F0) contour for each sentence in our corpus was extracted First we started with the analysis of neutral session s F0 and then we proceeded to the analysis of each emotional counterpart The F0 contour of each emotional session was compared with the neutral part Quantitative definition of F0 contours for each emotional state is contacted by the utilization of declination phenomenon The values of B start, B end and B slope of neutral sessions were compared with their emotional versions The above values are characteristics of an F0 contours baseline B start Variation B end Variation Emotion B slope Variation (Raise) (Raise) Sadness 17,83% 18,56% 2,2%(raise) Joy 54,70% 11,67% 20% (decrement) 11,1% Anger 33,43% 11,20% (decrement) Fear 20,12% 18,5% 2,3% (decrement) Table 2: Emotional / Neutral speech fundamental frequency parameters variation Comparison of the B start, B end and B slope values showed that, B start rises for all emotional states in regard of its neutral equivalent B end also seems to rise in emotional version of the utterances As regards B slope there was not a clear tension regarding each of the emotional state 411 Comparing F0 Contours Inspection of F0 contours of neutral utterances and their emotional versions led us to the conclusion that, Emotional version of each utterance had a contour similar to its neutral counterpart but shifted to higher frequencies Pitch accent phenomena were still there but in a higher degree Emotional versions (anger, joy mostly) seem to have a higher speech rate In example pitch accent phenomena such as L*+H (Arvaniti and Baltazani 2000) were transformed, because of higher speech rate to H* 4

Picture 1: Emotional / Neutral speech pitch contour 43 Speech Intensity Parameter In order to verify if there are non random differences, as far as, it concerns the intensity of emotional speech, we calculated the energy per window (256 samples) We calculated the change of energy of each window against the mean value of the energy of the corresponding utterance By inspection of the resulting graphs we came to the conclusion that the distribution of the intensity to the mean energy of the utterance is the same for the emotional and neutral speech For the interpretation of the intensity behaviour in each emotional state, we probe into phoneme energy A category of phonemes (fricatives, explosives) showed an unbalanced behaviour (in some cases having almost zero energy and in other having exaggerated values) The main reason was that the behaviour of these phonemes was a function of the recording conditions 5

12 10 8 6 4 2 Neutral Neutral e e e l l l a a 14 12 10 8 6 4 2 0 Joy e e e e l l a a a Picture 2: Neutral/Joy Intensity Distribution Examples 44 Speech Rate Parameter Speech rate is known to be a variable affecting timing in a speech signal, but one that is difficult to quantify Absolute measures of duration in text tell little about the relative lengths of segments, and account must be taken of all other factors involved if relative values as long, short, fast or slow are to be applied In picture 3 is depicted the mean duration of the phonemes of our database for the neutral session Picture 3: Neutral session phonemes mean duration For the measurement of the duration in sentence level we took the following results, Regarding anger we had a 60% decrease of sentence duration with a 958% In fear we had a 90% decrease with a 7% For the sadness session there was a 100% raise of duration with a 135% And in joy there wasn t a clear tension for raise or decrease of duration In the following picture the aforementioned observations are depicted 6

Picture 4 Sentence level emotional sessions duration A further analysis of emotional speech duration was conveyed by measuring it in phonemic level From this analysis of our data we took the following results, Regarding anger the 69,8% percent of the phonemes showed a decrease of duration by a 16,2% against the neutral counterpart The 77,5% of phonemes in fear session showed a decrease of 17% 82,1% had a raise in duration for the emotion of fear with a 22% And in joy we had the 56,2% percentage of phonemes to show a decrease of duration in a percentage of 15,1% as regards the duration for its neutral equals Picture 5 Phoneme level emotional sessions duration 5 Conclusion The recorded emotional speech database represents a good base for emotional speech analysis and is also usable for emotional speech synthesis Some improvements we could apply consists of undercover recording of real emotions in natural environments, automation of the postprocessing phase (labeling, segmentation) and additional recordings of amateur speakers for emotional consistency analysis With a close inspection to the results of our research we can value our first hypothesis that emotional variation of speech can be achieved up to a level by slight manipulation of the three fundamental parameters we analyzed which are pitch, speech rate and speech intensity (Murray and Arnott, 1995) 7

References Arvaniti, A, Baltazani, M, GREEK ToBI: A System for the Annotation of Greek Speech Corpora, VOL II, 555-562, LREC 2000 Banse, R and Scherer, K R, Acoustic Profiles in Vocal Emotion Expression, Journal of Personality and Social Psychology, 70(3):614-636, 1996 H illenbrand J, Perception of aperiod icities in synthetically generated voices, JASA, 83:2361-70, June 1988 Kienast, M and Paeschke, A and Sendlmeier, W F Articulatory Reduction in Emotional Speech, Proc Eurospeech, Budapest, 1:117-120, 1999 Klatt, D H and Klatt, L C Analysis, Synthesis and Perception of Voice Quality Variations among Female and Male Talkers, JASA, 87 (2):820-856, 1990 Montero LM, Gutierrez-Arriola J, Palazuelos S, Enriquez E, Aguilera S, Pardo JM, Emotional Speech Synthesis: From Speech Database to TTS, ICSLP 1998 Murray, I R and Arnott, J L Implementation and testing of a system for producing emotionby-rule in synthetic speech, Speech Communication 16 (1995) 369-390 Murray, I R and Arnott, J L Towards the Simulation of Emotion in Synthetic Speech: A Review of the Literature on Human Vocal Emotion, JASA, 93(2):1097-1108, 1993 Rank, E and Pirker, H Generating Emotional Speech with a Concatenative Synthesizer, Proc ICSLP, Sidney, 975-978, 1998 Vroomen J, Collier R, Mozziconacci S, Duration and intonation in emotional speech, Institute for Perception Research, Eindhoven 8

This document was created with Win2PDF available at http://wwwdaneprairiecom The unregistered version of Win2PDF is for evaluation or non-commercial use only