Recognizing Natural Emotions in Speech, Having Two Classes

Recognizing Natural Emotions in Speech, Having Two Classes Niels Visser University of Twente P.O. Box 217, 7500AE Enschede The Netherlands n.s.visser@student.utwente.nl ABSTRACT Emotion recognition is a useful way to filter important information, especially when it comes to situations where statistics are kept to track ones performance during work time, such as for workers in a call centre (for example, you might want to record a call when a customer sounds angry, so you can track the patience of an employee). However, there is little research conducted on how emotion recognition methods behave in practice, compared to the work that has been done using data from actors. This paper emphasizes on the input data (sound samples), requiring them to be natural instead of acted. It shows that training and testing on the same datasets yield high accuracy rates (with accuracy rates on acted data around 90%, and accuracy rates on natural data between 60 70%). When cross trained-tested with the acted set as training data and the natural set as testing data, there is not much difference from randomly labelling the samples. However, if the natural set is used as training data and the acted set as testing data, the classifiers and yield very high accuracy rates (78.6% and 74.2% respectively) and the classifier only classifies the samples as anger/frustration. 1. INTRODUCTION A lot of research is conducted on emotion recognition in speech during the past few years. [1, 11, 13] Some of these experiments have already shown accuracy rates of more than 90%. [4, 7, 12] However, the datasets used consist mostly of acted data [1], so it might be true that the same speech based emotion recognition methods (a combination of a feature set and a classification algorithm, hereafter referred to as ERM) that have a high accuracy rate on these datasets, may achieve much lower accuracy rates in practice. Reasons why we might believe this could be true are: because the currently used datasets could contain exaggerated or otherwise exaggerated emotions (for the data in these sets is acted), it may be Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. 15 th Twente Student Conference on IT June 20 th, 2011, Enschede, The Netherlands. Copyright 2011, University of Twente, Faculty of Electrical Engineering, Mathematics and Computer Science. that the speech characteristics of these datasets differ from non-acted (natural) data. One of the reasons why datasets containing acted data are used is that you do not have a human bias playing a role in the annotation of the sound samples, because an actor is ordered to play a certain emotion. (Of course, the quality of the datasets with acted data now depends on the skills of the actor, rather than on the skills of the annotators.) How we deal with the problem of human bias is explained in section 3 of this document. Another reason to choose acted data over natural data is that it is easier to obtain high quality samples. Meaning that you can define the number of speakers, the length of the samples and the intensity of the emotions for example yourself, rather than having to search for suitable samples. (Which is a very time consuming process.) There are a few experiments conducted on emotion recognition in natural speech, which vary much in accuracy. Depending on the dataset, accuracies between 37% and 94.75% have been found using MFCC and pitch features, Gaussian mixture models as classifiers and using samples from Swedish telephone services and English meetings. [9] Or using Teager energy operators and MFCC as a feature set, Gaussian mixture models and neural networks as classifiers and two different datasets: SUSAS and ORI. [4] Remarkable in this research is that the two datasets score very differently. Both datasets contain emotions that are recorded in non-acted environments. SUSAS (containing three classes: high stress, moderate stress and neutral) scores 95% to 61% (note that 33% would statistically be expected when the emotion recognition method does not work at all and just randomly classifies the samples). ORI (containing angry, happy, anxious, dysphoric and neutral) provided a much lower score: the accuracy of the ERM ranged from 57% to 37% on this dataset. This difference could be caused by different levels of arousal between both datasets. [4] It should be noted that both datasets contain multiple (more than two) different emotions. In this document, we will use two different classes ( anger/frustration and other ), instead of using one for each emotion ( other will consist of multiple emotions). An example of an application where this setup could be used would be: deciding to record or not to record a phone call in a call centre. In this case, only the presence or absence of one set of emotions (anger or frustration) may be relevant. This way, the system can detect upset customers, and record how their problem is handled by the employee. Another possible application can be computer gaming. For example: when the difficulty is too high, the system may reduce it when the player gets frustrated or angry. (Of course, voice interaction with the program is needed in this case.)

There are well tested ERMs available, which yield high accuracy rates on acted data. [12, 4, 7] Tests on natural data show high accuracy rates in some cases [9, 4] and lower in others [4]. These researches use more than two (sets of) emotions, while there are applications where only two (sets of) emotions are needed, because of their binary nature (like the call centre application). We may assume that the accuracy rate of a classifier, when offered two sets of emotions, differs from a situation where it is offered each emotion apart, because the classifier might generalize features over different emotions or could get confused because of the difference in features between emotions. Therefore, we may formulate the following: - How do classifiers that yield high accuracy rates on a data set containing multiple emotions perform when having to choose from two sets of emotions? It is clear that, even if a natural data set is used, there is much difference in accuracy rates between data sets. [9, 4] We can say that the data set is an important factor in the success of an ERM. We might wonder how comparable different datasets with the same (sets of) emotions are, and what the best way is to train a classifier. Therefore, we may formulate the following: - What will be the difference in accuracy rate of emotion recognition methods when a dataset consisting of audio samples recorded from nonacted situations as opposed to using a dataset in which all the data is acted, in recognizing anger/frustration and other emotions? - Is it possible to use features from an acted dataset to build a model that is able to recognize the same (sets of) emotions in a natural environment? In this paper, a corpus containing 50% anger/frustration and 50% other emotions will be tested against a set of three different ERMs. The exact composition of the corpus can be found under section 3.1 of this paper. The emotions that are classified as other may contain any emotion, but are limited to containing the speech of one person at a time. 2. APPROACH This research focuses mainly on the dataset used for the ERMs, meaning that ERMs that have already been used in emotion recognition will be used. The main parts of the research will be: - Composing a natural corpus, as there are no natural corpora freely available - Acquiring an acted corpus for comparison with the natural corpus - Training on the natural corpus, testing on the natural corpus - Training on the natural corpus, testing on the acted corpus - Training on the acted corpus, testing on the natural corpus - Training on the acted corpus, testing on the acted corpus (Note that the terms corpus and dataset are used interchangeable in this paper.) This way, we can see how uniform both data sets are (by training and testing on the same data set) and how similar both data sets are (by training on one dataset and testing on the other). We might also be able to draw conclusions on how it is best to choose a training set (taken into account that it is useful to train on acted data, since acted data is easier to obtain). 3. METHODS 3.1 Natural Dataset 3.1.1 Sources Since audio from call centre calls is not freely available, alternate sources that contain similar emotions are used. These will be the two Dutch programs: De rijdende rechter and TROS Radar. De rijdende rechter is a program in which quarrels are dealt with in a court-like way, and TROS Radar is a program where unsatisfied consumers are able to criticize the respective companies. Since there are two opposing parties in both programs, and the cases dealt with are most often personal, we may expect that emotions like anger and frustration are abundant. Further, the emotions of the participants are not acted. Therefore, only samples from the participants of the program, excluding the presenters, are used. These points make these two programs an excellent source for our sound samples. From these programs, a total of 176 audio samples is extracted, each containing the voice of one speaker. Before selection, the sound samples that are found to be eligible range from 1.18s to 15.96s, with an average of 5.42s and a standard deviation of 3.13s. The relatively high standard deviation can be explained due to the different content of a sample (sometimes utterances of 2 seconds contain enough information for humans to detect a certain emotion), and constraint that only a single voice may be present in the sample (sometimes it is not possible to find multiple samples that are longer than a few seconds of the same speaker, without another voice in it). The criteria on which the samples are selected during the first selection (extraction from the programmes) are: - Only one speaker at a time - The presence of words (samples containing only non-verbal sounds are excluded) - Balance between the two classes (for example: if there are 20 suitable samples containing anger or frustration and 100 suitable samples containing other emotions, not all of these 100 samples will be selected) 99 of these samples are extracted from episodes of De rijdende rechter and 77 from TROS Radar. 3.1.2 Annotation To minimize the personal influence on the labelling of the sound samples, a group of six independent people were asked to rate each of the samples. Before the list of samples was presented to the annotators, it was shuffled to ensure that not all samples of one program were played consecutively. If this was the case, people might rate the samples relatively to the program they originate from. The annotators were given two options: anger/frustration and other when rating each sample. (As can be seen in figure 1) The agreement of each

accessibility (it is available for download at http://pascal.kgw.tu-berlin.de/emodb/ ). This dataset contains 127 samples containing anger, and about 40 to 70 of each other emotion (anxiety/fear, disgust, happiness, boredom, neutral and sadness). To create a balanced corpus containing anger and other, we randomly chose 21 samples from each of the other six emotions and randomly deleted a sample from the samples containing anger, so that we have a symmetrical corpus of 252 samples. Figure 1: Annotation GUI sample was calculated by dividing the number of votes for the option with the most votes by the total number of annotators (six). For example: 6/6 means that all annotators agreed on one emotion, while 5/6 means that one annotator disagreed. The results are: Table 1: Agreements on the sound samples Agreement: 6/6 5/6 4/6 3/6 # of samples: 68 42 50 16 (Please note that 2/6 equals 4/6, 1/6 equals 5/6 and 0/6 equals 6/6 by definition in this context, since we do not split the table in emotions, and we have only two options that the test subjects may choose from.) Of these samples, the clearest groups (the groups with the least disagreements) will be used to form a dataset of minimal 100 samples. In this case, these are the groups 6/6 and 5/6. This minimum is set to ensure that the classifier has some base to train and test on. (The classifier will train/test in folds of 10% (which equals 10 samples, in this case), so that is uses 90 samples to train on and 10 samples to test. This will be repeated 10 times, in which case every sample will be trained upon 9 times and tested upon 1 time.) The composition of the corpus is as follows: Table 2: Composition of the corpus Emotion: Anger/frustration Other Source \ group 6/6 5/6 6/6 5/6 Rechter 26 15 15 10 Radar 6 7 21 10 Total: 32 22 36 20 54 56 3.3 Feature Extraction As a feature set, we will combine Mel-frequency cepstral coefficients (MFCC), voice quality, intensity (loudness), pitch (F0), and spectral features (energy). MFCC is chosen since it has proven itself as decent feature set when it comes to emotion recognition. [9, 8]. The other features since they have been identified as related to the expression of emotional states. [10] The features will be extracted using the open-source program OpenSMILE, using a configuration previously used by the INTERSPEECH challenge of 2010 by Florian Eyben. This configuration is packed with the OpenSMILE download, available at opensmile.sourceforge.net. OpenSMILE is used in as official feature extractor in the INTERSPEECH challenge of 2009. [3] 3.4 Classification The classification algorithms that will be used in the paper will be: - SVM ( from Weka) [6] - s (Naïve Bayes from Weka) [2] - [5] These classifiers have been chosen because of their earlier use in emotion recognition. From [5] we can see that, s and have high accuracy rates compared to the rest, which is the reason these are chosen. The tool Weka is used for running the classifiers. The default parameters for each classification algorithm will be used, as specified in Weka. Before we train and test on different data sets, all values in both sets are normalized using z-scores, so that the mean of all values always is zero, and the standard deviation of every feature equals one. As can be seen above, the corpus consists of 54 samples labelled as anger/frustration and 56 samples marked as other. The difference in ratio between anger/frustration : other = 1:1 and the ratio of the corpus (1 : 0.964) is negligible. 3.2 Acted Dataset As acted dataset the German Emo-DB is chosen, because of its strongly acted nature (10 speakers, that all speak 10 sentences, in 7 different emotions) and easy

4. RESULTS We may expect that training and testing on the same dataset yields the highest accuracy rates, because it is likely that the samples with the same emotions in the same data set are most alike, since they originate from the same (type of) source. The results, however, show differently: Table 3: Summary of the results Test Set Natural Acted Natural : 71.8% Bayesian: 70.0% AdaBoost: 61.8% : 78.6% Bayesian: 50.0% AdaBoost: 74.2% Training Set Acted : 54.5% Bayesian: 55.5% AdaBoost: 56.3% : 92.4% Bayesian: 85.7% AdaBoost: 90.1% As can be seen in table 3, accuracy rates are indeed high when trained and tested on the same dataset. However, there are two remarkable results: and AdaBoost yield an even higher accuracy when trained on the natural set and tested on the acted set, than when trained and tested on the natural set. This is remarkable because the two dataset originate from a whole different source, and because when testing in the opposite direction (training on acted data, testing on natural data) yields only negligibly higher than coincidence values. Detailed results are available in appendix A. Therefore, we have a closer look at the classifications of these two classifiers. The dataset that is tested on can be split into three different sets of classes: emotions, speakers and spoken sentences. It is possible that one set can easily be recognized by a classifier, while others as would have been expected based on the results when tested the other way around are more difficult to recognize. There are a few classes that differ notably from the rest. The results show that boredom, neutral, sadness and anger are best recognized (min 81.0%, max 100%) by both classifiers. Worst are anxiety/fear (, 33.3%) and happiness (AdaBoost, 23.8%). Anger is most likely to be the cause of the high accuracy rate, since 81.7% of the respective samples have been correctly classified. There is only one speaker where both classifiers yield an accuracy higher than 80%, being 83.3% () and 87.5% (AdaBoost). Since these values are close to the overall accuracy rates of both classifiers, the results of this speaker are most likely irrelevant. The same applies to the different sentences, there are two sentences that are classified above 80% correctly. Detailed results can be found in appendix B. 5. CONCLUSIONS AND DISCUSSION First, we look at the results when classifiers are trained and tested on the same dataset. In this case, it shows that when an acted dataset is used, accuracy rates reach around 90%. When a natural dataset is used, however, rates vary from 60% to around 70%. This difference may be explained by that an acted dataset is probably more constant than a natural dataset, meaning that samples that are annotated as containing the same emotion are very much alike, in contrary to samples in a natural dataset. Looking at the achievements of the different classifiers, there is not one that achieves much better than the rest when the situations are seen apart. However, achieves best in both situations, while the classifier that achieves worst is not the same in both situations. We may conclude that there is a slight difference between the achievements of the classifiers, with as the most suitable for recognizing emotions with the current feature set when training and testing on the same dataset. Second, we look at the results when one dataset is used for training and another for testing. It shows that the achievements of the classifiers drop radically: there is not much difference between just randomly labelling the samples and letting one of the classifiers do the work. Since the conclusions stated above state that both datasets do contain information about the two classes, there are a few possible explanations: There have been used different definitions of the emotions when annotating the datasets. This may be a fault, but can also be an indication that there are multiple definitions of the same emotion (maybe even cultural differences, since a Dutch and German dataset is used). If it is the latter, it may be interesting for future research to determine how significant these differences are, and how much different classes of the same emotion exist, when looking at different cultures. But there are two remarkable results: and both score very high when training on the natural data set, and testing on the acted set. When looking closer at how the different samples are classified by both classifiers, there are no sole features assignable as the cause for these remarkably high accuracy rates. (In this situation we might expect that there are a few classes or other aspects that are rated abnormally high (between 95 and 100 per cent), and that the score of the rest of the classes does not differ much from random labelling (around 55%, as seen when trained on the acted set and trained on the natural set) ). But this is not the case; instead, most of the classes are rated with about the same accuracy as the overall accuracy of both classifiers. Therefore, we may conclude that in this case and similar cases (meaning when training and testing on a dataset originating from the same type of source) the natural data are good input for building a model that needs to classify acted data. During the introduction, three questions were formulated. The first question worded: How do classifiers that yield high accuracy rates on a data set containing multiple emotions perform when having to choose from two sets of emotions? We may say that the accuracy of the classifiers depends heavily on the data set that is being used. Earlier research showed that accuracy rates up to 94.75% can be reached. However, from [5] it shows that, using the three classifiers, Naïve Bayes and accuracy rates of respectively 71.42%, 66.67% and 71.42% have be reached. The research uses prosodic and acoustic features and an acted dataset with multiple emotions. From this, we may conclude that when using the two sets anger/frustration and other instead of delight, flow,

confusion and frustration (as used by [5] ), much higher accuracy rates can be obtained. The second question is: What will be the difference in accuracy rate of emotion recognition methods when a dataset consisting of audio samples recorded from nonacted situations as opposed to using a dataset in which all the data is acted, in recognizing anger/frustration and other emotions? We may answer this question with that there is a significant difference in accuracy rate between data sets with acted data and data sets with natural data, being that data sets with acted data are much more constant (meaning that the emotions labelled as the same contain more mutual characteristics). For this reason, the accuracy rate of a set of acted data is higher than that of a natural data set. The third question is: Is it possible to use features from an acted dataset to build a model that is able to recognize the same (sets of) emotions in a natural environment? As stated before in this conclusion, it is likely that there is a difference in definition of an emotion when annotating different data sets. In this case, the natural set differed from the acted set in a way that it is not feasible to train on an acted set and use the obtained model in a natural environment. However, because of the results obtained when training on the natural data set and testing on the acted data set, we can see that it is possible to obtain very high accuracies (even higher than when trained and tested on a natural set) when training on a natural set and testing on an acted set. Therefore, we may think that when acted sources are chosen more specific to serve a natural goal, it is possible to obtain higher accuracy rates, since the results show that there are mutual characteristics between the acted and natural data sets. 6. FUTURE WORK 6.1 Other sets of classes In this research, anger and frustration are grouped together, as well as the rest of the emotions. The results have shown that using this binary partition, very high accuracy rates can be obtained. But of course there are applications that need other classes as input. And it might be that anger and frustration are very distinct from other emotions, which makes it easier for a classifier to distinguish between the two. It might be interesting to see if the same applies to other groups of emotions. 6.2 Cultural definition differences Since a Dutch and a German data set are used in this research, it is possible that cultural definition differences play a role in the annotation of both data sets. These differences can play an important role when ERMs are being used and sold internationally. Therefore, it might be interesting to look at how significant these differences are, if they are present. This can be done by letting groups with different nationalities annotate the same set of data. 6.3 Use other natural or acted sources The results that have been obtained from training on acted data and testing on natural data versus training on natural data and testing on acted data show remarkable differences. Since the cause of these differences is not yet entirely clear there are some mutual characteristics it can be interesting to see if the same differences appear when using another natural data set and the same acted data set, or using the same natural data set and another acted data set. REFERENCES [1] Ayadi, M.E., Kamel, M.S., Karray, F., Survey on speech emotion recognition: Features, classification schemes, and databases, Pattern recognition 44, pp. 572 587 (2011) [2] Barra-Chicote, R., Fernandez, F., Lutfi, S. L., Lucas- Cuesta, J. M., Macias-Guarasa, J., Montero, J. M., San- Segundo, R., Pardo, J. M., Acoustic Emotion Recognition using Dynamic s and Multi-Space Distributions, 10th Annual Conference of the Internacional Speech Communication Association pp. 336 339 (2009) [3] Eyben, F., Wöllmer, M., Schuller, B., opensmile The Munich Versatile and Fast Open-Source Audio Feature Extractor, Proceedings of the 18th International Conference on Multimedea 2010, pp. 1459 1462 (2010) [4] He, L., Lech, M., Maddage, N., Memon, S., Emotion Recognition in Spontaneous Speech within Work and Family Environments, Proceedings of the 3rd International Conference on Bioinformatics and Biomedical Engineering, pp. 1 4 (2009) [5] Hoque, M.E., Yeasin, M., Louwerse, M.M., Robust Recognition of Emotion from Speech, Lecture Notes in Computer Science Volume 4133/2006, pp. 42 53 (2006) [6] Keerthi, S.S., Shevade, S.K., Bhattacharyya, C., Murthy, K.R.K., Improvements to Platt's Algorithm for SVM Classifier Design, Neural Computation, pp. 637 649 (2001) [7] Lin, Y.L., Wei, G., Speech Emotion Recognition based on HMM and SVM, 2005 International Conference on Machine Learning and Cybernetics, ICMLC, pp. 4898 4901 (2005) [8] Luengo, I., Navas, E., Hernáes, I., Sánches, J., Automatic Emotion Recognition using Prosodic Parameters, 9th European Conference on Speech Communication and Technology, pp. 493 496 (2005) [9] Neiberg, D., Elenius, K., Laskowski, K., Emotion Recognition in Spontaneous Speech Using GMMs, INTERSPEECH 2006 - ICSLP 2, pp. 809 812 (2006) [10] Nwe, T.L., Foo, S.W., De Silva, L.C., Speech emotion recognition using hidden Markov models, Speech Communication 41, pp. 603 623 (2003) [11] Sebe, N., Cohen, I., Gevers, T., Huang, T.S., Multimodal Approaches for Emotion Recognition: A Survey, Internet Imaging VI. Proceedings of the SPIE, Volume 5670, pp. 56 67 (2004) [12] Wu, S., Falk, T.H., Chan, W.Y., Automatic speech emotion recognition using modulation spectral features, Speech Communication Volume 53, Issue 5, pp. 768 785 (2010) [13] Zeng, Z., Pantic, M., Roisman, G.I., Huang, T.S., A Survey of Affect Recognition Methods: Audio, Visual, and Spontaneous Expressions, IEEE Transactions on Pattern Analysis and Machine Intelligence 31, pp. 39 58 (2009)

APPENDIX A: OVERVIEW OF THE RESULTS OF ALL TESTED SCENARIOS The columns represent the classes as classified by the classifiers, the rows represent the classes as annotated by the annotators. Scenario 1: Training: EmoDB Testing: EmoDB 119 7 A = anger/frustration 12 114 B = other Accuracy: 92.4% 118 8 A = anger/frustration 28 98 B = other Accuracy: 85.7% 115 11 A = anger/frustration 14 112 B = other Accuracy: 90.1% Scenario 2: Training: Natural Testing: Natural 37 17 A = anger/frustration 14 42 B = other Accuracy: 71.8% 35 19 A = anger/frustration 14 42 B = other Accuracy: 70.0% 35 19 A = anger/frustration 23 33 B = other Accuracy: 61.8% Scenario 3: Training: EmoDB Testing: Natural 24 30 A = anger/frustration 20 36 B = other Accuracy: 54.5% 16 38 A = anger/frustration 11 45 B = other Accuracy: 55.5% 25 29 A = anger/frustration 19 37 B = other Accuracy: 56.3% Scenario 4: Training: Natural Testing: EmoDB 103 23 A = anger/frustration 31 95 B = other Accuracy: 78.6% 126 0 A = anger/frustration 126 0 B = other Accuracy: 50.0% 103 23 A = anger/frustration 42 84 B = other Accuracy: 74.2%

APPENDIX B: COMPARISON AND ADABOOSTM1 WHEN TRAINING ON NATURAL SET AND TESTING ON ACTED SET The tables should be read as follows: Right classified : wrong classified = percentage of right classified samples % Results by emotion (78.6%) (74.2%) Anxiety/Fear 7:14 = 33.3% 12:9 = 57.1% Disgust 18:3 = 85.7% 13:8 = 61.9% Happiness 9:12 = 42.9% 5:16 = 23.8% Boredom 21:0 = 100% 17:4 = 81.0% Neutral 21:0 = 100% 17:4 = 81.0% Sadness 19:2 = 90.5% 20:1 = 95.2% Anger 103:23 = 81.7% Results by speaker 103:23 = 81.7% (78.6%) (74.2%) 03 (male) 24:5 = 82.8% 22:7 = 75.9% 08 (female) 23:2 = 92.0% 16:9 = 64.0% (78.6%) (74.2%) 09 (female) 20:4 = 83.3% 21:3 = 87.5% 10 (male) 9:4 = 69.2% 8:5 = 61.5% 11 (male) 16:8 = 66.7% 17:7 = 70.8% 12 (male) 11:5 = 68.8% 13:3 = 81.3% 13 (female) 26:4 = 86.7% 22:8 = 73.3% 14 (female) 24:7 = 77.4% 23:8 = 74.2% 15 (male) 21:7 = 75.0% 22:6 = 78.6% 16 (female) 24:8 = 75.0% 23:9 = 71.9% Results by sentence (78.6%) (74.2%) A01 16:3 = 84.2% 16:3 = 84.2% A02 26:5 = 83.9% 21:10 = 67.7% A04 20:4 = 83.3% 21:3 = 87.5% A05 22:4 = 84.6% 19:7 = 73.1% A07 17:8 = 68.0% 16:9 = 64.0% B01 19:6 = 76.0% 19:6 = 76.0% B02 16:7 = 69.6% 19:4 = 82.6% B03 21:9 = 70.0% 21:9 = 70.0% B09 21:4 = 84.0% 16:9 = 64.0% B10 20:4 = 83.3% 19:5 = 79.2%