Agreement and Disagreement Utterance Detection in Conversational Speech by Extracting and Integrating Local Features

INTERSPEECH 2015 Agreement and Disagreement Utterance Detection in Conversational Speech by Extracting and Integrating Local Features Atsushi Ando 1, Taichi Asami 1, Manabu Okamoto 1, Hirokazu Masataki 1, Sumitaka Sakauchi 1 1 NTT Media Intelligence Laboratories, NTT Corporation, Japan {ando.atsushi,asami.taichi,okamoto.manabu,masataki.hirokazu,sakauchi.sumitaka}@lab.ntt.co.jp Abstract This paper presents a novel framework to automatically detect agreement and disagreement utterances in natural conversation. Such a function is critical for conversation understanding such as meeting summarization. One of the difficulties of agreement and disagreement utterance detection in natural conversation is ambiguity in the utterance unit. Utterances are usually segmented by short pauses. However, in conversations, multiple sentences are often uttered in one breath. Such utterances exhibit the characteristics of agreement and disagreement only in some parts, not the whole utterance. This makes conventional methods problematic since they assume each utterance is just one sentence and extract global features from the whole utterance. To deal with this problem, we propose a detection framework that utilizes only local prosodic/lexical features. The local features are extracted from short windows that cover just a few words. Posteriors of agreement, disagreement and others are estimated window-by-window and integrated to yield a final decision. Experiments on free discussion speech show that the proposed method, through its use of local features, offers significantly higher accuracy in detecting agreement and disagreement utterances. Index Terms: agreement and disagreement utterance detection, paralinguistics, conversational speech, local features 1. Introduction One of the important applications of automatic speech recognition is to extract the structure of a conversation. The results should help the participants to recall what they discussed. However, conversations contain many extraneous utterances and it is inefficient to show each and every utterance. Hence the automatic detection of decision-making utterances such as agreement and disagreement is important. It is also useful for conversation understanding like automatic meeting summarization [1]. In this paper, our purpose is the automatic detection of agreement and disagreement utterances in natural conversation. Detection of agreement and disagreement utterances is regarded as the task of dividing continuous speech into utterances and to classify each into one of three classes, agreement, disagreement and others. Several conventional methods have been proposed to support this task. Hillard et al. [2] proposed a method of dividing speech into utterances based on short pauses and of classifying utterances using lexical and prosodic features of each utterance. Galley et al. [3] introduced adjacency pairs to this method to consider inter-utterance relations. Hahn et al. [4] used a contrast classifier to deal with the label imbalance problem. Wang et al. [5, 6] used CRF to model sentence-to-sentence context and Bousmalis et al. [7] utilized social attitudes such as head actions and body postures. One difficulty of this task is the ambiguity posed by the utterance units. Conventional methods are usually based on utterances segmented by pause length. However, participants often speak continuously and multiple sentences are uttered in one breath. Such utterances exhibit agreement or disagreement in parts of the utterance, not the entire utterance. This renders conventional methods questionable since they assume each utterance has just one sentence and characteristics of agreement and disagreement appear in the whole utterance. Accordingly, they use global features extracted from the whole utterance. To solve this problem, Germesin et al. [8] proposed an approach that attempts to improve utterance segmentations. It identifies criteria that can split speech into utterances that consist of just one sentence per utterance. They used the result of dialog label estimation based on lexical criteria to split utterances. One good point of this approach to that it allows us to use conventional methods in classifying each utterance. However, identifying the ideal criteria is problematic because conversational speech often does not follow strict grammar rules, and speech recognition error directly triggers utterance split error. Hence we take another approach for this problem. We capture the local characteristics of an utterance and utilize them to detect agreement and disagreement utterances. It is useful if conventional pause-based speech splitting methods can be used. However, existing studies do not confirm whether agreement and disagreement can be detected from local features or how to integrate local characteristics into utterance results. In this paper, we propose a new agreement and disagreement utterance detection framework based on local prosodic/lexical features. Posteriors of agreement, disagreement and others are estimated for each short window that covers several words. The method proposed herein employs prosodic features, lexical features, and combinations of both as local features to calculate posteriors. The final decision of each utterance is obtained by integrating these posteriors. Experiments on free discussion speech show that the proposed method significantly improves the detection accuracy of agreement and disagreement utterances. These indicate that local changes in both prosody and lexicon are effective for detecting agreement and disagreement utterances. 2. Dataset In this section, we show the dataset used in this research and the ratio of utterances that exhibit agreement and disagreement in only some parts of each utterance. We plan to apply our research to Japanese conversation but there is no Japanese conversational speech dataset labeled agreement, disagreement and others. Thus we newly collected Japanese conversational speech and manually labeled the utter- Copyright 2015 ISCA 2494 September 6-10, 2015, Dresden, Germany

Figure 1: Frequency distribution of the agreement and disagreement utterances that interval labels cover some part of each utterance. ances. Simulated conversations were held and we recorded speech uttered by the participants. An overview of the simulated conversations is given below. Each conversation had two or four participants. A subject for discussion was given and participants selected their positions, either approval or disapproval, to ensure a balance in positions. All the participants argued their position alternately. After that, a 10 minute discussion and a 5 minute conclusion were followed. We recorded their speech made in the discussion and conclusion parts. Participants were four males and four females. There was no interference during recording speech because conversations were established using a video conference system and each participant occupied a different soundproof booth. Slightly over twenty hours of conversational speech were recorded. After speech segmentation, we were left with 2987 utterances occupying 7.2 hours. We define an utterance as a period of speech that has no pauses of greater than.5 second, which is same as a spurt in [2]. Two types of labels were given to utterances: utterance labels and interval labels. Utterance label was given to each utterance. Interval label was given to the interval which labelers perceive agreement or disagreement characteristics in. Both labels have three classes: agreement, disagreement and others. Utterances and intervals labeled neither agreement nor disagreement were regarded as labeled others. However, utterances with only a single word were regarded as backchannel and were not labeled. Labelers were three and didn t participate in any conversation. We called the common utterance labels and interval labels assigned by two or more labelers as majority utterance labels and majority interval labels. We investigate the characteristics of the labels in this dataset as below. First, we use the Kappa coefficient [9] to measure the coincidence of labels between labelers. Average Kappa coefficients between labeler were.47 for the utterance labels and.48 for interval labels. On the other hand, those between majority labels and three labelers were.71 for utterance labels and.64 for interval labels. The use of majority labels as correct labels is seemed to be valid than using the labels of any of the three labelers in isolation. Hence majority labels are used hereafter. Second, we show the utterance rates of agreement, disagreement, backchannel, and others. The rates were 9%, 7%, 26%, and 58%, respectively. These appearance rates are similar to those of the dataset used in previous works [2 4], which indicates that this dataset has the same characteristic and thus is reliable. Finally, we determine the rate of the utterances exhibiting agreement and disagreement only in some parts. Figure 1 shows the frequency distribution of the agreement and disagreement Figure 2: Overview of the proposed method. utterances that interval labels cover some part of each utterance. The horizontal axis plots the rate of interval label length, calculated as the sum of the interval label length divided by whole utterance length in each agreement and disagreement utterance. For example, the rightmost value, 100%, means that the interval label covers the whole utterance. Figure 1 shows that less than half of the agreement and disagreement utterances are covered whole utterances with interval labels, and over 40% of agreement and over 30% of disagreement utterances, interval labels occupy less than half the whole utterance length. These results demonstrate that it is important to deal with utterances that exhibit agreement or disagreement in some parts of the utterance. 3. Proposed method We propose a new agreement and disagreement utterance detection method based on local characteristics. Our method consists of two steps: local class estimation step and utterance class estimation step. In the first step we estimate a class in each local window by using local features. In the second step we integrate all the results of the first step in an utterance to estimate an utterance class. Figure 2 shows the overview of the proposed method. To use local features effectively, it is important to set the local window length appropriately so as to estimate short-term agreement and disagreement. Long windows have a risk of containing more than two classes in the interval, which decreases the accuracy of agreement and disagreement estimation. Note that even humans need a certain length for estimating agreement and disagreement; very short windows are inappropriate. Taking this discussion into consideration, proposed method uses several word intervals as local windows. 3.1. Overview Detecting agreement and disagreement utterances is the task of estimating an utterance class L in each utterance S. ˆL = argmax P (L S) (1) L ˆL is the estimated class and there are three classes, agreement, disagreement, and others. We assume that agreement and disagreement appear in several continuous word intervals. We represent the subintervals of 2495

Table 1: Statistics of prosodic values calculated by word interval. Type unit feature F0 word mean, std, min, max, slope, range of F0 using word interval first phoneme mean, std, min, max, slope, range of F0 using first phoneme interval of the word last phoneme mean, std, min, max, slope, range of F0 using last phoneme interval of the word Intensity word mean, std, min, max, slope, range of intensity using word interval first phoneme mean, std, min, max, slope, range of intensity using first phoneme interval of the word last phoneme mean, std, min, max, slope, range of intensity using last phoneme interval of the word Duration word duration, speech rate of the word first phoneme duration of the first phoneme of the word last phoneme duration of the last phoneme of the word Pause word pause between the word and the previous word 3.2. Features Figure 3: An example of making correct local classes from manually annotated interval labels and words. the utterance S as s 1,,s K. These correspond to the intervals of the words w 1,,w K included in the utterance. K is total number of words in the utterance. In local class estimation step, local features f k are extracted from short window covering {s k N,,s k,,s k+n } and utilized to obtain the estimated local class ˆl k and local class posteriors. ˆlk = argmax P (l f k ) (2) l N is a parameter of local window length. A set of a local class is same as a utterance class. Posteriors written on the right side of Eq. (2) are trained by local features and correct local classes. Correct local classes are made by interval labels. The class which is dominant in each word interval is regarded as a correct local class. An example of making correct local classes is shown in Figure 3. In utterance class estimation step, we obtain estimated utterance class ˆL by integrating all the estimated local classes and the local class posteriors. We represent local class posteriors obtained in kth short windows as p k, which includes posteriors of the agreement, disagreement and others. All the estimated local classes and local class posteriors are represented as ˆl = {ˆl1,, ˆl K} and P = {p 1,, p K}. ( )) ˆL = argmax P L Φ (ˆl, P L Φ(ˆl, P) means taking ther statistics of the values: total number of local windows, occurrences of the classes, and the mean and standard deviation of the posteriors in each class. Posteriors in Eq. (3) are trained by manually annotated utterance labels and local class estimation results. (3) We use both prosodic and lexical features as local features for detection of agreement and disagreement. Prosodic features are obtained by jointing prosodic statistics calculated in each word interval. The reason we use them is that they can express prosodic characteristics in greater detail than features comprised of prosodic statistics calculated from the combination of several intervals. Amount of training data available for local label estimation is larger than that of for utterance label estimation, which enables us to use those detailed prosodic features. Prosodic statistics calculated from each word interval are shown in Table 1. These are used in emphasis speech analyses [10, 11] that are correlated with our research. We use not only word unit statistics but also phoneme unit statistics because it has been shown that agreement and disagreement are delineated by changes in phoneme interval in human-machine voice interaction [12], and those changes are also likely to be present in human-to-human interaction such as conversations. F0 statistics are extracted from only vowel intervals, and F0 value is normalized for each speaker and each conversation in order to regularize the speakers. Lexical features used in this method are similar to those in [2]. They consist of the number of agreement and disagreement keywords, perplexities and posteriors of 2-gram LMs modeled by each of the three label classes. Agreement and disagreement keywords are those that appear more than five times and whose frequency of being assigned to the agreement or disagreement class divided by the frequency of all appearances is greater than.6. These are calculated from the word sequences in the local feature window. The same training set used to train the estimator in Eq. (2) and in Eq. (3) is employed to obtain the keywords and 2-gram LMs. 4. Experiments To evaluate the proposed method, we conducted experiments on detecting agreement and disagreement utterances in conversations. 10-fold cross validation was used in experiments. A total of 37111 labels of local classes were present in the dataset, 1323(3.6%) were agreement and 2918(7.9%) were disagreement. Same training set was used to training posteriors in Eq. (2) and Eq. (3). That is, local label estimator was trained first, and then integrator was trained using the estimation results of local classes and utterance labels. In both training steps, oversampling [13] was used because of imbalance in the training classes. We used hand transcripts to obtain words and word in- 2496

Table 2: Detection accuracies of agreement and disagreement utterances. Pros Lex Pros+Lex window length Total Agree Disagr. Total Agree Disagr. Total Agree Disagr. Baseline 49.4 71.2 74.3 53.8 76.0 73.7 52.7 76.6 71.9 Proposed 1 (self only) 50.5 74.3 72.5 3 (self ± 1word) 57.6 77.5 77.1 77.5 88.0 89.3 73.7 84.9 88.0 5 (self ± 2words) 58.2 75.1 80.7 77.4 87.8 89.5 68.4 82.6 84.3 7 (self ± 3words) 59.5 75.8 81.0 76.5 86.7 89.3 62.4 78.6 81.8 9 (self ± 4words) 52.7 73.9 75.3 75.7 86.1 89.1 62.0 78.4 81.1 Table 3: Estimation accuracies of local labels. Pros Lex Pros+Lex window length Total Agree Disagr. Total Agree Disagr. Total Agree Disagr. Proposed 1 (self only) 43.7 76.2 64.9 3 (self ± 1word) 38.7 75.1 60.8 65.7 87.1 77.1 62.7 85.0 76.0 5 (self ± 2words) 41.1 78.1 60.5 75.3 90.2 84.1 57.4 84.3 71.5 7 (self ± 3words) 43.0 77.2 63.2 77.8 90.9 86.0 53.2 81.0 70.3 9 (self ± 4words) 44.9 77.8 64.6 78.9 91.5 86.7 55.5 82.2 71.5 tervals, but these are usually not available in practice. Hence using speech recognition output is a future work. Neural networks were used for both in local class estimation step and in utterance class estimation step with two hidden layers with 256 nodes and one hidden layer with 32 nodes, respectively. Values of the F0 and intensity in each frame were extracted by OpenS- MILE [14] with 50ms frame length, 10ms frame shift length. We use [2] as the baseline method using global features to detect agreement and disagreement utterances. Accuracies of agreement and disagreement utterance detection and local label estimation are shown in Table 2 and Table 3. Pros, Lex, Pros+Lex mean the results achieved by using only prosodic features, only lexical features, and both prosodic and lexical features, respectively. Window length is the number of words covered by a local window. Bold presents the max value on the vertical axis. The accuracy of lexical features only for the case that the window covers only a single word was not calculated since we cannot calculate 2-gram perplexity and posterior. From the results yielded by the prosodic features, each total accuracy of the proposed method is better than that of baseline. The maximum improvement from baseline is 10.1% in case of using 3 prior and 3 following words. It indicates that the local prosodic characters of an utterance are effective in detecting agreement and disagreement. Focusing the window length, increasing the length of the local windows tends to raise accuracy, but if window includes 4 prior and 4 following words, utterance accuracy decreases. The reasons we believe are that long window often includes units that are unrelated to agreement/disagreement and that the increase in feature dimensions makes the classifier less robust. These results show that local prosodic characters accompanying agreement and disagreement have relative long duration such as several word intervals, but excessively long windows decrease detection accuracy. This corresponds with what we mentioned in section 3. We also examine results of the lexical features. Utterance label accuracies of the proposed method with lexical features exceeded accuracy of the baseline over 20%. Improvements from baseline given by proposed method with lexical features is greater than those with prosodic features. These results indicate that local lexical characteristics are also effective in detection of agreement and disagreement utterances. Utterance estimation accuracy decreases as the window widened, but local estimation accuracy increases. It indicates that the current method of integrating of the local results is considered not to be optimal with lexical features. The proposed method with both lexical and prosodic features also demonstrated superior utterance label estimation performance to the baseline. However, its local label estimation accuracy decreased as the window widened, this result is different from that of lexical or prosodic features only. This indicates that the simple combination of prosodic and lexical features described herein may not be suitable and consideration of more advanced combination methods such as variable subinterval length is a remaining problem. 5. Conclusions In this paper, we proposed a new agreement and disagreement utterance detection framework for conversational speech that uses local prosodic/lexical features. To detect agreement and disagreement utterances whose characteristics appeared in some part of the utterance, we utilize local features extracted from short windows that cover several words. Local labels are estimated by those features and posteriors of the local labels are integrated to estimate the utterance label. Experiments on free discussion speech showed that proposed method improves the accuracy of detecting agreement and disagreement utterances. Its excellent performance is due to the use of local prosodic/lexical characteristics of utterances. One future work is improving the integration of local estimation results. The method proposed herein uses mean and standard deviation of the local results and so doesn t utilize sequential information which would seem to be effective in detecting agreement and disagreement utterances. Other works are considering better combinations of prosodic and lexical features, evaluation with automatic speech recognition results and with another conversational speech dataset used in previous studies. 2497

6. References [1] C. Lai, S. Renals, Incorporating Lexical and Prosodic Information at Different Levels for Meeting Summarization, in Proc. of INTERSPEECH 2014, 2014. [2] D.Hillard, M. Ostendorf and E. Shriberg, Detection of Agreement vs. Disagreement in Meetings: Training with Unlabeled Data, in Proc. of the HLT-NAACL 2003, vol. 2, pp. 34 36, 2003. [3] M. Galley, K. McKeown, J. Hirschberg, E. Shriberg, Identifying Agreement and Disagreement in Conversational Speech: Use of Baysian Networks to Model Pragmatic Dependencies, in Proc. of the 42nd Annual Meeting of ACL, pp. 669 676, 2004. [4] S. Hahn, R. Ladner, M. Ostendorf, Agreement/Disagreement Classification: Exploiting Unlabeled Data using Contrast Classifiers, in Proc. of the HLT-NAACL, pp. 53 56, 2006. [5] W. Wang, S. Yaman, K. Precoda, C. Richey, G. Raymond, Detection of Agreement and Disagreement in Broadcast Conversation, in Proc. of the 49th Annual Meeting of ACL, pp. 374 378, 2011. [6] W. Wang, K. Precoda, C. Richey, G. Raymond, Identifying Agreement Disagreement in Conversational Speech. A Crosslingual Study, in Proc. of INTERSPEECH 2011, 2011. [7] K. Bousmalis, L. P. Morency, M. Pantic, Modeling hidden dynamics of multimodal cues for spontaneous agreement and disagreement recognition, in Proc. of the Automatic Face & Gesture Recognition and Workshops, pp. 746 752. 2011. [8] S. Germesin and T. Wilson, Agreement detection in multiparty conversation, in Proc. of the International Conference on Multimodal Interfaces, pp. 7 14, 2009. [9] S. Siegel and N. J. Castellan, Nonparametric Statistics for The Behavioral Sciences, McGraw-Hill, 1988. [10] V. K. R. Sridhar, A. Nenkova, S. Narayanan, D. Jurafsky, Detecting prominence in conversational speech: pitch accent, givenness and focus, in Proc. of Speech Prosody, 2008. [11] E. Strangert, Emphasis by Pausing, in Proc. of the 15th ICPhS, pp. 2477 2480, 2003. [12] S. Fujie, D. Yagi, H. Kikuchi and T. Kobayashi, Prosody based Attitude Recognition with Feature Selection and Its Application to Spoken Dialog System as Para-Linguistic Information, in Proc. of the ICSLP 2004, vol. 4, pp. 2841 2844, 2004. [13] N. Japkowicz, The Class Imbalance Problem: Significance and Strategies, in Proc. of the 2000 International Conference on Artificial Intelligence, pp. 111 117, 2000. [14] F. Eyben, M. Wöllmer, and B. Schuller, opensmile - the Munich versatile and fast open-source audio feature extractor, in Proc. of the ACM Multimedia, pp. 1459 1462, 2010. 2498