Determining Emotion in Speech

Size: px

Start display at page:

Download "Determining Emotion in Speech"

Frederick Caldwell
5 years ago
Views:

1 Determining Emotion in Speech Charles Van Winkle University of Washington

2 Reviewed Literature Toward Detecting Emotions in Spoken Dialogs Publication Date 2005 Authors Chul Min Lee Shrikanth S. Narayanan Detecting emotional state of a child in a conversational computer game Publication Date 2010 Authors Serdar Yildirim Shrikanth Narayanan Alexandros Potamianos 2

3 Toward Detecting Emotions in Spoken Dialogs Observations Claims Role of spoken language interfaces in human-computer interaction applications has increased Automatically recognizing emotions from human speech has therefore grown in importance Research in understanding and modeling human emotions is increasingly attracting attention from the engineering community There is an increasing need to know not only what information a user conveys but how it is being conveyed Emotions are important in human communication and decisionmaking It is desirable that intelligent human-machine interface be able to accommodate human emotions in an appropriate way 3

4 Toward Detecting Emotions in Spoken Dialogs Challenges Claims It is difficult to define what emotion means in a precise way Disagreement on the number of emotion categories It may not be necessary or practical to recognize a large variety of emotions in the context of developing algorithms for conversational interfaces. Reconciling long-term properties like moods with short-term emotional states Previous Studies show promise of using higher level linguistic information for emotion recognition 4

5 Toward Detecting Emotions in Spoken Dialogs The Old Acoustic Signal Pattern Recognition Maximum likelihood Bayes classification Kernel regression K-nearest neighborhood methods Fisher linear discrimination methods An ensemble of neural networks 5

6 Toward Detecting Emotions in Spoken Dialogs The Old Acoustic Features Pitch-Related Features Fundamental Frequency (aka Pitch), (aka F0) (and other formant frequencies) Pitch Contour Energy Timing Features Speech Rate Boundaries of Phrases/Words/Phonemes Spectral Information Voiced and Unvoiced portions 6

7 Toward Detecting Emotions in Spoken Dialogs The Old Discourse Information Has been used in conjunction with acoustic correlates Topic and/or Sub-Dialog Repetition Correction information Use of Swear Words* Negation How to combine the different information sources (e.g. Acoustic & Discourse) Fusion at the Feature Level Suffers from potential dimensionality issues in regards to classification with increasing feature sizes. 7

8 Toward Detecting Emotions in Spoken Dialogs The New Favor the notion of application-dependent emotions Examine a reduced space of emotions Negative (anger and frustration in human speech) and Non-Negative emotions (the complement) Data Set Speech signals derived from a commercially deployed automatic call center dialog system Combine various aspects of spoken language information Acoustic Lexical Discourse Intended Use Detection of negative emotions can be used as a strategy to improve the quality of the service in automated call center applications 8

9 Toward Detecting Emotions in Spoken Dialogs The Plan Acoustic Leverage previously published results and use a number of acoustic correlates Systematically Reconcile through Feature Selection Feature Reduction 9

10 Toward Detecting Emotions in Spoken Dialogs The Plan Discourse Separate users responses into 5 categories Rejection Found more often in negative emotion utterances Repetition Rephrase Ask-Start Over None of the Above Mostly factual responses to voice prompts like giving the name of a person or place (in the corpus) 10

11 Toward Detecting Emotions in Spoken Dialogs The Plan Language Introduce a new method for estimating the emotion information conveyed by words (and by sequences of words) Automatically calculate emotional salience of the words in the specific (constrained) data corpus How to combine the various information sources? Fusion at the decision level Linear Discriminant Classifiers with Gaussian class-conditional probability K-Nearest Neighborhood classifiers Emotional Salience is a measure of how much information a word provides about a given emotion category 11

12 Toward Detecting Emotions in Spoken Dialogs The Data Observations Claims Most studies in emotion recognition in speech have used actors voices Single utterances for archetypal emotions Results from these may not be generalized to human-machine interaction scenarios. In non-dialog settings Real data suffers from coverage problems Need vast amounts of data characterizing various emotions in various contexts A limited-domain approach allows in-depth focus on a finite set of emotions Using significant amounts of data obtained from realistic human-machine interactions 12

13 Toward Detecting Emotions in Spoken Dialogs The Data Speech data 8 khz, 8-bit, µ-law compression Obtained From real users engaged in spoken dialog with a machine agent Commercially-deployed call center application 1187 calls* Each having an average of 6 utterances About 7200 total utterances Database was whittled down from thousands of calls to only include a fraction with potentially negative emotions. Authors used some automatic pre-processing and some subjective tagging by 4 different human listeners. 13

14 Toward Detecting Emotions in Spoken Dialogs Acoustic Fundamental Frequency (F0) mean, median, standard deviation, maximum, minimum, range, & linear regression coefficient Energy mean, median, standard deviation, maximum, minimum, range, & linear regression coefficient Duration speech-rate, ratio of duration of voiced and unvoiced region, and duration of the longest voiced speech Formants First and second formant frequencies (F1, F2), and their bandwidths (BW1, BW2). Also mean of each feature 14

15 Toward Detecting Emotions in Spoken Dialogs Acoustic Forward Selection to Reduce Dimensionality Two sets of rank-ordered selected features 10-best features 15-best features Principle Component Analysis to possibly further reduce dimensionality Male 15-best Ratio of duration of voiced and unvoiced region, energy STDEV*, energy median, F0 regression coeff., F0 median, energy regression coeff., energy max, energy min, energy range, duration of the longest voiced speech, F0 mean, BW1, F0 max, BW2 Female 15-best Ratio of duration of voiced and unvoiced region, energy median, F0 regression coeff., speech rate, energy min, duration of the longest voiced speech, energy regression coeff., F0 median, F0 mean, F1, energy mean, energy max, F0 max, energy range, energy STDEV 15

16 Toward Detecting Emotions in Spoken Dialogs Lexical Word Salience Emotion Wrong 0.72 Negative Computer 0.72 Negative Damn 0.72 Negative No 0.45 Negative Arrival 0.33 Non-Negative Phoenix 0.33 Non-Negative Delayed 0.21 Non-Negative Baggage 0.20 Non-Negative After salience calculation, a salient word pair dictionary was constructed by only retaining word pairs that have greater salience values than a pre-chosen threshold and optimized on held-out data. Gender-Independent. 16

17 Toward Detecting Emotions in Spoken Dialogs Lexical 17

18 Toward Detecting Emotions in Spoken Dialogs Discourse Male Female Total Tag Negative Non-Negative Negative Non-Negative Negative Non-Negative Rejection Repeat Rephrase Ask-Startover Non Total Labeling performed by one person, based on utterance transcriptions Rephrase is non-perfect Repeat, & eventually becomes same category 18

19 Toward Detecting Emotions in Spoken Dialogs Error k = 8 for Male k = 4 for Female 19

20 Toward Detecting Emotions in Spoken Dialogs Error M/F 20

21 Toward Detecting Emotions in Spoken Dialogs Error Male 21

22 Toward Detecting Emotions in Spoken Dialogs Error Female 22

23 23

24 computer game Observations Claims Over the last few years, attention to automatic recognition of users communicative styles within spoken dialog system frameworks has increased It is important to know not only what was said but also how something was communicated to a dialog system Enabling automatic emotion recognition within a multimodal dialog system is an emerging trend Being able to detect the users emotion can help enhance the capability of such systems in terms of being more natural and responsive 24

25 computer game Observations Claims Currently deployed spoken dialog interfaces are limited in terms of handling the rich information contained in speech Their scope in supporting natural human-machine interaction is therefore limited as well. Much of the work on emotion analysis focuses on databases with acted speech This provides certain useful knowledge, but it is more suitable to work on data that is directly representative and suitable for the domain application in mind 25

26 computer game Challenges Claims Most research on emotion recognition is primarily targeted towards adult users Greater variability exists in the acoustic and linguistic characteristics of children s speech. These parameters change with age and gender. Automatic recognition from speech itself is a difficult problem It may be difficult to elicit acted speech from children Children are one of the potential beneficiaries of computers with spoken interfaces, e.g. for educational applications and games. It is important to identify emotionally salient features by means of emotion recognition as a function of gender and age group It is not necessary to recognize a large set of emotions 26

27 computer game Survey of the Corpora Databases of children speech Mostly used for acoustic analysis and modeling Some are read speech corpora Recent databases Child-machine spontaneous speech interaction Open-ended spoken dialog interaction between children and animated characters in a game setting Data from children spontaneously communicating with the AIBO robot (Emotional labeling for this corpus is available) A corpus of child-machine spoken dialog interaction in a game setting (Used in this paper) 27

28 computer game Previous Acoustic Techniques Acoustic Signal Pattern Recognition Used to separate emotional coloring present in (children s) speech Popular features Phoneme-, syllable-, and word-level statistics corresponding to F0 Energy Duration Spectral Parameters Voice Quality Parameters 28

29 computer game Previous Aggregate Techniques Previous studies show that younger children use less overt politeness markers and express more frustration compared to older children It has been shown that the use of speech and language features for predicting student emotions in human-computer tutoring dialogs improves accuracy Promising results in the combined use of acoustic, spectral, and language information for detecting confidence, puzzlement, and hesitation in child-machine dialog tasks Language model features might be poor predictors of frustration Emotion recognition performance can be improved by using contextual information in addition to acoustic features 29

30 computer game Proposal Focus on two attitudinal states Polite and Frustrated Authors believe that this is well-suited to domain of child-computer interfaces Data Set Children s Interactive Multimedia Project (ChIMP) database Combine various aspects of spoken language information Acoustic Language Extend the notion of Emotional Salience Intended Use Detection of polite and frustrated states in children of different age groups and genders 30

31 computer game The Data Spontaneous child-machine spoken dialog interaction in a game setting Task was to play the game Where in the USA is Carmen Sandiego? Goal is to identify and arrest a cartoon criminal Children had to interact with several animated characters to obtain clues Most children played the game twice Contains speech data collected from 160 boys and girls (ages 6-14) Wizard-of-Oz technique Over 50,000 utterances 31

32 computer game The Data Researchers tagged speech from 103 out of the 160 players Neutral, Polite, or Frustrated Age Group Female Male Total Results are presented as a function of age group and gender Table to the right shows number of subjects per category Image Here Total

33 computer game The Data Goals Identify age and gender trends in emotional state Identify lexical, semantic and pragmatic markers of emotional state Neutral Polite Frustrated Total Only utterances where both labelers are used Image Here Table to the right shows number of instances (speaker turn) for each emotional class for each gender and age group Male Female Total 10, ,585 33

34 computer game The Data Goals Identify age and gender trends in emotional state Identify lexical, semantic and pragmatic markers of emotional state Neutral Polite Frustrated Total % 17% 14% 37% % 20% 6% 35% Only utterances where both labelers are used % 15.8% 16% 28% Image Here Table to the right shows percent of instances (speaker turn) for each emotional class for each gender and age group Male 72% 15% 13% 53% Female 69% 20% 11% 47% Total 70% 18% 12% 100% 34

35 computer game Lexical and Pragmatic Markers Polite Explicit Markers please, thank you, excuse me Implicit Markers may I, could you, would you Usage of explicit vs. implicit varies with age Frustrated Typical Lexical Markers shut up, oh man, hurry, oops, heck Pragmatic Markers Repetition or getting stuck in the same dialog state for multiple turns often indicated that a child was experiencing difficulty with the task and was getting frustrated 35

36 computer game Feature Extraction Acoustic Feature Extraction Low-Level descriptors 384 features were extracted using opensmile feature extraction. Features comprise of utterance level statistics corresponding to pitch frequency, RMS energy, zero-crossing-rate, harmonics-to-noise ratio, and 1 12 MFCCs. Delta coefficients were also computed to each of these LLD. Twelve statistics from each LLD and Delta Coefficients mean, standard deviation, skewness, kurtosis, maximum and minimum value, relative position, range and two linear regression coefficients with their mean square error 36

37 computer game Feature Extraction Lexical Feature Extraction Certain words are associated with specific emotions and attitudes Two different modeling approaches are proposed Information theoretic analysis is used for lexical feature selection, in conjunction with Bayesian classifiers Second, latent semantic analysis (LSA) is used to transform the feature space and then cosine distance metrics are used to compute emotional distance between utterances (Both techniques used widely in the field) Calculate Emotional Salience Then create Bayesian classifier 37

38 computer game Feature Extraction Male Female Class Drop it Hey you Do it No thank Find the Frustrated Get me You there Stop miss Not that Pick that Frustrated Shut up Someone talk Need this My pad Go talk Frustrated Stop this You repeat I don t You pick To issue Frustrated Stop please You mind Hello there Doing mister Hello I d Polite You good Suspect can Please show You have Thanks can Polite Please tell Person please Very much Please take Would you Polite The phone You can You get Look that Where d she Polite After salience calculation, a salient word pair dictionary was constructed by only retaining word pairs that have greater salience values than a pre-chosen threshold and optimized on held-out data 38

39 computer game Feature Extraction Discourse and Contextual Information Modeling Model relationship between emotional state dialog state Simple Bayesian model Assume emotional state depends directly on dialog state history (past three) Context because emotions are persistent Use derivative of the acoustic features as extra parameter Table to the right shows examples for 5 (out of 9) possible sates User Can I talk to him please? Tell me about the suspect Dialog State Talk2Him TellmeAbout Can I see my Image Here EnterFeature choices for height? Tall for height Thank you Tell me where did the suspect go EnterFeature CloseBook WhereDid 39

40 computer game Fusion of Classifiers Decision-Level Fusion Acoustic, Lexical, and Contextual information sources If classifiers are statistical and calculate posterior probabilities Can use Average of Decision Can use Product of Decision (assuming independence) Acoustic doesn t fit the above description Use distance metric instead of decision then sigmoidal transform Two-Way Classification Politeness is more of a speaking style than emotional state Three-Way Classification 40

41 computer game Acoustic Evaluation Unweighted Average Recall MFCC F0 RMS energy Voicing ZCR Male 67.9% 45.4% 42.9% 40.3% 41.1% Female 70.4% 51.3% 44.9% 45.8% 46.3% % 51.7% 44.2% 45.7% 44.2% % 47.9% 45.2% 44.0% 43.6% % 49.3% 44.0% 43.0% 44.9% Used a k-nearest neighborhood classifier (k-nnr) with k=3. Classification results for the three categories (neutral, polite, frustrated) are computed using 10-fold cross-validation 41

42 computer game Two-Way Classification Male Female Acoustic Lex Lex LSA Context Polite vs. Others Unweighted Average Recall % 42

43 computer game Two-Way Classification Male Female Acou + Lex Acou + Lex Acou + LSA Acou + Ctxt Polite vs. Others Unweighted Average Recall % 43

44 computer game Two-Way Classification Male Female Acou + Lex1 + C Acou + Lex2 + C Acou + LSA + C Polite vs. Others Unweighted Average Recall % 44

45 computer game Two-Way Classification Male Female Acoustic Lex Lex LSA Context Frustrated vs. Others Unweighted Average Recall % 45

46 computer game Two-Way Classification Male Female Acou + Lex Acou + Lex Acou + LSA Acou + Ctxt Frustrated vs. Others Unweighted Average Recall % 46

47 computer game Two-Way Classification Male Female Acou + Lex1 + C Acou + Lex2 + C Acou + LSA + C Frustrated vs. Others Unweighted Average Recall % 47

48 computer game Three-Way Classification Male Female Acoustic Lex Lex LSA Context Three-way classification results in terms of unweighted average recall% 48

49 computer game Three-Way Classification Male Female Acou + Lex Acou + Lex Acou + LSA Acou + Ctxt Three-way classification results in terms of unweighted average recall% 49

50 computer game Three-Way Classification Male Female Acou + Lex1 + C Acou + Lex2 + C Acou + LSA + C Three-way classification results in terms of unweighted average recall% 50

51 51

Speech Emotion Recognition Using Support Vector Machine

Speech Emotion Recognition Using Support Vector Machine Yixiong Pan, Peipei Shen and Liping Shen Department of Computer Technology Shanghai JiaoTong University, Shanghai, China panyixiong@sjtu.edu.cn,