Utilizing gestures to improve sentence boundary detection

Size: px

Start display at page:

Download "Utilizing gestures to improve sentence boundary detection"

Jody Bennett
6 years ago
Views:

1 DOI /s z Utilizing gestures to improve sentence boundary detection Lei Chen Mary P. Harper Springer Science+Business Media, LLC 2009 Abstract An accurate estimation of sentence units (SUs) in spontaneous speech is important for (1) helping listeners to better understand speech content and (2) supporting other natural language processing tasks that require sentence information. There has been much research on automatic SU detection; however, most previous studies have only used lexical and prosodic cues, but have not used nonverbal cues, e.g., gesture. Gestures play an important role in human conversations, including providing semantic content, expressing emotional status, and regulating conversational structure. Given the close relationship between gestures and speech, gestures may provide additional contributions to automatic SU detection. In this paper, we have investigated the use of gesture cues for enhancing the SU detection. Particularly, we have focused on: (1) collecting multimodal data resources involving gestures and SU events in human conversations, (2) analyzing the collected data sets to enrich our knowledge about co-occurrence of gestures and SUs, and (3) building statistical models for detecting SUs using speech and gestural cues. Our data analyses suggest L. Chen School of Electrical and Computer Engineering, Purdue University, West Lafayette, IN 47905, USA M. P. Harper Department of Computer Science, University of Maryland, College Park, MD 20742, USA M. P. Harper Human Language Technology Center of Excellence, Johns Hopkins University, Baltimore, MD 21211, USA Present Address: L. Chen (B) Educational Testing Service (ETS), Princeton, NJ 08541, USA

2 that some gesture patterns influence a word boundary s probability of being an SU. On the basis of the data analyses, a set of novel gestural features were proposed for SU detection. A combination of speech and gestural features was found to provide more accurate SU predictions than using only speech features in discriminative models. Findings in this paper support the view that human conversations are processes involving multimodal cues, and so they are more effectively modeled using information from both verbal and nonverbal channels. Keywords Gesture Nonverbal communication Multimodal processing Sentence boundary detection 1 Introduction When communicating, people do not only use speech but also nonverbal cues, e.g., hand gestures. Nonverbal communication plays many roles in human conversation [1, 17]: (1) expressing emotions and personal attitudes, (2) expressing information related to verbal content, and (3) regulating conversations. For some tasks, nonverbal communication plays a dominant role. For example, Mehrabian [37] demonstrated that a speaker s body language is the most important cue for determining whether he/she likes a listener or not. However, in a large percentage of current conversation analysis research, only speech cues are utilized. Important nonverbal cues in human communication are often neglected. With the rapid development of technologies for recording, storing, transferring, and processing nonverbal (especially visual) cues, it becomes more feasible to incorporate nonverbal cues to support a deeper analysis of human conversations. As a result, some research [7, 9, 10, 15, 16, 40] has emerged recently on utilizing nonverbal cues in language processing. In this paper, 1 we describe our research work on the topic of incorporating gestural cues for improving the accuracy of sentence boundary detection. Accurate identification of the sentence structure of spontaneous speech has the potential to help human readers of speech transcripts to understand the content more effectively and efficiently. It also would support many language processing tasks such as parsing, language modeling, and question answering in dialog systems. Due to its importance, much research has been conducted on sentence boundary detection [32, 34, 45, 47, 48]. However, most of this research has focused only on lexical and prosodic cues in speech signals. Gesture is a primary nonverbal communicative cue that is highly related to speech [6, 36]. Our work is based on the theory [36] that gesture and speech stem from a single underlying mental process, and they are related both temporally and 1 This paper supersedes the work done in [9, 10]. There are substantial differences in that much of the modeling work was redone, and measurement studies were added. The speech baseline, for example, was substantially improved in this paper. The two previous papers were preliminary results done for the first author s Ph.D thesis proposal supervised by the second author. This paper reports on the work in the final thesis document.

3 semantically. Gestures play an important role in human communication but use quite different expressive mechanisms than spoken language. Given the close relationship between gestures and speech and the special expressive capacity of gestures, gestures are likely to provide additional important information that can be exploited when modeling sentence structure. In this research, we use a data-driven methodology, which includes collecting multimodal data resources, analyzing multimodal data, and building statistical models to detect sentence boundaries. This paper is organized according to this data-driven approach: Section 2 reviews previous research on sentence boundary detection that uses only speech features, as well as previous research on incorporating gestural cues. Section 3 describes our multimodal data collection effort. Section 4 describes our measurement studies of gesture patterns for signaling sentence boundaries. Section 5 introduces our automatic sentence boundary detection evaluation plan. Section 6 describes multimodal features designed for sentence boundary detection. Section 7 describes the statistical models we evaluated in this paper. Section 8 reports our experimental results. Section 9 summarizes our research findings and draws conclusions. 2 Previous research Most of the previous research on automatic sentence structure detection uses lexical cues from text [2, 48] and prosodic cues (e.g., pause duration, F 0, and energy) from audio signals [8, 21, 47]. In the influential work of Shriberg et al. [47], they built a lexical model (a language model augmented with special tokens representing sentence boundaries), a prosodic model (a decision tree trained from the automatically extracted prosodic features), and combined these two models based on a Hidden Markov Model (HMM) approach. Because the prosodic model is less influenced by word errors from automatic speech recognition, it is more robust than the lexical model and provides important complementary information for detecting sentence boundaries. As a result, the combined lexical and prosodic model shows a significantly improved accuracy compared to the lexical model. Building on this foundation [47], some new techniques have been applied to the sentence structure detection task. For example, an ensemble machine learning approach, bagging, has been used to improve the prosodic decision trees trained from the imbalanced data sets [31]. In addition, discriminative modeling approaches, such as maximum entropy (ME) and conditional random field (CRF), were investigated to address weaknesses of the generative HMM approach [32]. As an important non-verbal communicational cue in face-to-face conversations, gesture has been found to be related to language production [36]. There are some previous efforts on utilizing gestures for analyzing conversation structure. For example, speakers often make metaphoric gestures to indicate the introduction of a new topic [36]. Discourse boundaries have been found to be predicted based on gesture information, such as symmetry and oscillation of the speaker s two hands [41]. A limited number of previous efforts [10, 14] have attempted to incorporate gesture features into a model for sentence boundary detection. Using the KDI Wombat data [42], Chen et. al. [10] extracted some gestural features from hand positions that were automatically tracked from video and trained decision trees to detect sentence

4 boundaries using the extracted gestural features. The obtained gestural model was integrated with the speech model using an HMM approach. Their experimental results showed that jointly using textual, prosodic, and gestural information can achieve the lowest error rate for sentence boundary detection. However, compared with the speech-only model, the multimodal model did not achieve a significant improvement on the KDI Wombat data. Following this approach, Eisenstein and Davis [14] incorporated gestures in sentence boundary detection on their own multimodal monologue data. Compared with [10], they used pause duration between two words as the sole prosodic feature, and they used gesture features derived from their manual annotations. They reported that gestural cues can help SU (sentence unit) detection, but their work does not produce statistically significant improvement. Previous investigations on utilizing gestures for sentence boundary detection were somewhat preliminary. First, there was no deep analysis of gesture patterns for sentence boundaries. Second, the previous studies on automatic sentence boundary detection using both gestures and speech were limited by: (1) small-sized data sets with a limited number of words and speakers, (2) comparison to an HMM speech baseline that was not state of the art compared to previous research. These limitations affect our ability to draw conclusions about the impact of incorporating gestures for the sentence boundary detection task. Therefore, a deeper investigation was carried out and is reported on in this paper. 3 VACE multimodal meeting corpus The corpus used in this research was collected with the support of the Advanced Research and Development Activity (ARDA) Video Analysis and Content Exploration (VACE) project [11]. The VACE meeting corpus is a collection of meetings based on war games, which elicit rich multimodal behaviors. Examples of scenarios include planning a rocket launch, planning humanitarian assistance, planning foreign material exploitation, and selecting recipients for a scholarship award. A lecture room was modified to collect time-synchronized audio, video, and motion data. For audio recording, similar to other meeting audio data collections [20, 39], we recorded each meeting participant s speech using an individual wireless microphone and the group s speech using table mounted microphones. For video recording, we recorded each meeting participant s body movements using calibrated stereo cameras and a 3D motion tracking system [53]. The collected audio, video, and motion signals were processed as described in the rest of this section. 3.1 Audio processing We manually transcribed and force-aligned meeting speech files to obtain timing information of the spoken words. First, we segmented audio data into speech and silence regions using an automatic multi-step segmentation method [22]. Then, the speech segments were transcribed using Quick Transcription (QTR) methodology developed by LDC [27], which offers a balanced tradeoff between the accuracy and the speed of the transcription. We force-aligned the word transcription and speech segments to get timing information for all of the words spoken in the meeting. To obtain the forced alignment, we used ISIP s automatic speech recognition (ASR)

5 system with a triphone acoustic model trained from more than 60 hours of spontaneous speech data [51]. Given the time aligned word transcriptions, we used the EARS MDE annotation specification V6.2 [28] to annotate sentence units (SUs). An SU is defined as the complete expression of a speaker s thought or idea [50]. It can be either a complete sentence or a semantically complete smaller unit. SUs may not be completed sentences, e.g., it is common to answer a question with a phrase rather than a sentence. There are four types of SUs. Statement: A complete SU that functions as a declarative statement Question: A complete SU that functions as an interrogative Backchannel: It is a word or phrase said by the non-dominant speaker when listening to the dominant speaker, indicating that he/she follows the dominant speaker s controlling thread and does not want to obtain the conversational floor. Typical backchannel words include for example: hmm, right, sure, yeah, oh, really. Incomplete SU: Sometimes a speaker does not finish expressing a complete thought or idea. Some reasons might be (1) the speaker generates a new speech plan and stops the ongoing speech; (2) the speaker is interrupted and then does not continue with the old utterance. This incompleteness is with respect to not completely finishing what the speaker has planned to say at the outset. In this research, due to the limited size of our multimodal corpus, we focus on investigating the presence of SU boundaries without specifying their subtypes. 3.2 Gesture annotations To support visualizing and coding nonverbal communication, researchers at Virginia Tech have designed MacVisSTA [46], which displays video content from multiple viewing angles along with frame-aligned speech transcriptions and a wide variety of other annotations. Using MacVisSTA and referring to the time-aligned speech transcriptions, researchers at the University of Chicago annotated the gesture of each participant in the meetings. Gesture onset and offset, as well as the semiotic properties of the gesture as a whole, were coded in relation to the accompanying speech [11]. In our investigation, we only use the information provided by the starting and ending boundaries of the annotated gestures. 3.3 Motion tracking result processing Special markers were placed on the wrists (for hand position) of each meeting participant. The motion traces of these markers were tracked by a Vicon motion tracking system. From the tracked hand positions, some raw features of gesture were computed, e.g., effort that is defined as the kinetic energy of hand motion, holds that are defined as states without noticeable hand movement, and rest area that is defined as where the speaker rests his/her hands without making gestures. More details on computing these raw features can be found in Section 4.1 and Section 4.2.

6 Table 1 A comparison between the VACE meeting data set and the previous data sets used for multimodal SU detection Data # Participants Type # Words # SUs MIT [14] 9 Monolog 2, KDI [10] 3 Dialog 3, VACE 14 Meeting 24, Data In the VACE multimodal corpus, meetings were named based on their recording dates. Three fully annotated meetings, named Jan07, Mar18, and Apr25, were used in our SU detection experiment. Each meeting provides synchronized multimodal signals: multi-channel audio, time-aligned words, annotated SUs, Vicon motion traces, and gesture annotations. Table 1 compares the data used in this study with the data used in previous efforts. Compared with the previous SU multimodal studies (e.g., [10] and[14]), the data set used in this research is much larger (more speakers, longer speech files, and more words), although the data size is far more limited than has been used in SU studies involving only speech. Table 2 provides some basic statistics related to each meeting participant s use of words and gestures, including the number of words spoken, the number of SUs produced, the ratio of SUs to words, the duration of all words (Dur words ), the duration of all gestures (Dur gestures ), and the ratio of Dur gestures to Dur words. Meeting participants have different percentages of gesture use. In both the Jan07 and Apr25 meetings, there is one participant with a comparatively low percentage of gesture use (D in the Jan07 meeting and F in the Apr25 meeting). Also note that most of participants in the Mar18 meeting have a low percentage of gesture use, except for F. A possible reason for this is that the participants spent a lot of time manipulating documents on the conference table and therefore had fewer opportunities to make gestures. In addition, among the 14 speakers in these three VACE meetings, only participant C in the Jan07 meeting was a female speaker. Table 2 Statistics on words, SUs, and gestures for each participant in the VACE meeting corpus Participant # Words #SUs SU (%) Dur words Dur gestures Ratio (%) Jan07_C 1, Jan07_D 2, Jan07_E 3, Jan07_F 2, Jan07_G 1, Mar18_C 2, Mar18_D 1, Mar18_E 1, Mar18_F 1, Mar18_G 1, Apr25_C 1, Apr25_D 3, Apr25_F 2, Apr25_G

7 4 Analysis of gesture patterns of SUs In a multimodal conversation, some word boundaries in the speech channel have co-occurring gestures in the visual channel. The variation of gestural properties (e.g., effort, location, and starting and ending points) may provide constraints for signaling the presence of an SU boundary. Hence, in order to better characterize the relationship between gestures and SUs, we conducted the following analyses on the three meetings in the VACE multimodal corpus: Effort information of gestures and SUs: In Section4.1, we examine the likelihood of there being an SU boundary at word boundaries corresponding to different levels of effort change. According to [25, 36], the stroke phase (the most effortful part) of gesture tends to co-occur with or just before the important parts of speech content. Therefore, when a speaker greatly changes his or her hand gesture s effort, e.g., finishing the stroke (the most rapid movement of a hand gesture) or retracting his/her hands back to the rest area, he or she may also finish forming an SU in the speech domain. One might expect that the word boundary at which a speaker s gesture has a large effort change is more likely to be an SU boundary than the word boundaries at which a speaker s gesture has a small effort change. Location information of gestures and SUs: In Section 4.2, we examine the likelihood of there being an SU boundary at word boundaries corresponding to different hand positions. When a speaker makes a gesture, his/her hands are normally away from the rest area [36]. Therefore, one might expect that the word boundaries at which a speaker s hands are away from the rest area are less likely to be SU boundaries than the word boundaries at which his/her hands are near the rest area. Timing information of gestures and SUs: In Section 4.3, we examine the likelihood of there being an SU boundary at word boundaries corresponding to different temporal portions of a gesture. According to [36], gesture and language are generated simultaneously from the same entity (growth point (GP)) in the mental process. Therefore, when a speaker makes a gesture, his/her language production is still ongoing (the GPs in the speaker s mental process are still being unpacked into visual and speech activities). Based on this, one might expect that word boundaries in the middle of a gesture are less likely to be SU boundaries compared to the word boundaries close to the beginning or end of a gesture. 4.1 Effort information of gestures and SUs Based on tracked hand positions, Effort is computed using a sliding window approach. Given that P(i) ={x(i), y(i), z(i)} is the hand position at frame i, theeffort (RMS energy) E(i) of a window of width N at frame i can be computed as: E(i) = 1 N i+n/2 i N/2 P(i) ˆP(i) 2, (1)

8 where ˆP(i) is the average hand position in the window: ˆP(i) = 1 N i+n/2 i N/2 P(i). From the computed effort values of hand motion, intervals of Hold are classified using an adaptive thresholding and temporal filtering approach, which is based on the dominant motion rule [41]. Given the data and gesture annotations described in Section 3.2, we identified all word boundaries co-occurring with gestures (i.e., all word boundaries that occur inside the time span of each gesture). Then, based on the absolute difference of the averaged effort values on the word immediately previous to and the word following to the word boundary, we grouped all these word boundaries into two types: (1) LARGE-CHANGE type word boundaries when the speaker s gesture effort changes greatly (the absolute difference of the averaged effort values across the word boundary is above a threshold), and (2) SMALL-CHANGE type word boundaries when the speaker s gesture effort does not change greatly (the absolute difference of the averaged effort values across the word boundary is equal to or below the threshold). The threshold was selected at the 75% quantile (0.2 mm per frame) of all effort changes. That means that only words boundaries in the top 25%ofeffort changes were treated as LARGE-CHANGE type. Figure 1 depicts a boxplot of different speakers tendencies of producing SU boundaries at these two types of word boundaries. We found that LARGE- Frequency of being an SU(%) LARGE CHANGE type of word boundary SMALL CHANGE Fig. 1 Frequencies of SU boundaries of LARGE-CHANGE type and SMALL-CHANGE type word boundaries among all speakers

9 CHANGE type and SMALL-CHANGE type word boundaries have different frequencies of being SU boundaries. Such a finding suggests that hand effort change information may be helpful for signaling SU boundaries. 4.2 Location information of gestures and SUs The rest area is a gesture characteristic derived from the hand position traces. In a simplified view, a gesture involves two important hand positions: (1) the rest area and (2) the gesture space. The rest area is where people put their hands when not making hand gestures. The gesture space is where people make gestures. During typical gestures, speakers move their hands away from the rest area during the preparation phase and retract their hands back to the rest area after a stroke. Rest areas are not fixed and tend to change during a conversation, especially for those with a long duration. For this reason, instead of using one fixed rest area, we determined the rest area for each gesture. After a careful empirical observation of our VACE meeting data, we formed a heuristic rule for rest area estimation. That is, when a time interval that is not coded as a gesture exceeds a pre-defined threshold, we treat that location where a speaker puts his/her hands as a rest area. For each gesture, we found the preceding non-gesture interval that was longer than a predefined time range (2.0 s) and then used the hand positions averaged over a range (0.5 s) before the non-gesture interval s end as the rest area. For each word boundary that co-occurs with a gesture, we calculated the distance between the average hand locations during the word immediately prior to the word boundary and the rest area. Then, according to a predefined distance threshold, we grouped these word boundaries into two types: (1) AWAY word boundaries when the speaker s hands were away from the rest area (the distance to the rest area was equal to or larger than the threshold), and (2) REST type word boundaries when the speaker s hands were near the rest area (the distance to the rest area was lower than the threshold). The threshold was selected at the 75% quantile (185.7 mm) of all distances between hand locations and the rest area. That means that only word boundaries in the top 25% of distances were treated as AWAY type. Figure 2 depicts a boxplot for all speakers of the frequencies of there being SU boundaries at these two types of word boundaries. We found that AWAY type and REST type word boundaries have quite different tendencies of being SU boundaries. Such a finding suggests that hand location information may be helpful for signaling SU boundaries. 4.3 Timing information of gestures and SUs Word boundaries co-occurring with gestures were grouped into two types, according to the distance between the word boundary and the gesture s start or end boundary: (1) if the distance was lower than a predefined threshold (Dur boundary ), we called them boundary type (BND), and (2) if the distance was equal to or higher than the predefined threshold, we called them inside type (IN). The Dur boundary was selected at the 25% quantile (0.2 s) of all distances. That means that only word boundaries in the top 25% of the shortest distance to gestures start or end boundaries were treated as BND type.

10 0 2 Frequency of being an SU(%) AWAY type of word boundary REST Fig. 2 Frequencies of SU boundaries of AWAY type and REST type word boundaries among all speakers Frequency of being an SU(%) BND type of word boundary IN Fig. 3 Frequencies of SU boundaries of BND type and IN type word boundaries among all speakers

11 Figure 3 depicts a boxplot of the frequencies of there being SU boundaries at BND and IN type word boundaries. We found that BND type and IN type word boundaries have quite different frequencies of being SU boundaries. Such a finding suggests that the timing placement of a gesture may be helpful for signaling SU boundaries. 4.4 A summary of findings Our measurement studies show that gesture information related to word boundaries provide information that allows us to separate non-su boundaries from SU boundaries. Cues from a gesture s velocity, location, and temporal relation to speech can predict different likelihoods of there being SUs. Compared to the location and temporal relation to speech, a gesture s effort change is somewhat less predictive for SUs. Results from these studies do not only support our view that gestures can play a role in predicting sentence structure, but also suggest potentially useful features for SU detection algorithms that use gesture cues, which will be discussed and evaluated in the rest of this paper. 5 Evaluation plan SU detection has been previously treated as a classification task [32, 47] and will be treated in this manner in our experiments. For each inter-word boundary, a multimodal classifier would decide whether there is an SU boundary or not based on somewhat independent knowledge sources (i.e., lexical, prosodic, and gestural cues). Our multimodal SU classification task can be generalized as follows: Ê = arg max P(E W, F, G) E where E denotes the inter-word boundary event sequence (SU boundary or not), W denotes the corresponding lexical feature vector, F denotes the prosodic feature vector, and G denotes the corresponding gestural feature vector. Our goal is to determine the boundary classifications that have the highest probability given the observed lexical, prosodic, and gestural features. 5.1 Evaluation metrics In our evaluation, we use the performance metric defined by NIST for evaluating SU detection for DARPA EARS program [13]. To calculate the NIST error rate (ERR), the estimated SU string is compared with the gold standard SU reference string to determine the number of misclassified boundaries per reference SU boundary. Since SU boundaries may be incorrectly deleted or inserted, the insertion rate (INS) and deletion rate (DEL) are also calculated to allow us to compare patterns of insertions and deletions among different models. INS is the number of incorrect insertions of an SU in the estimated SU string that do not appear in the gold standard per reference SU boundary; whereas, DEL is the number of incorrect deletions of an SU that

12 appears in the gold standard string per reference SU boundary. The formulas for these three metrics are as follows: INS = DEL = number of inserted SUs total number of reference SUs number of deleted SUs total number of reference SUs NIST error rate = INS+ DEL The error rate in the NIST metric can be greater than 100%. The following example shows a system SU hypothesis aligned with the reference SUs: Reference: w1 w2 w3 / w4 System : w1 / w2 w3 w4 INS DEL where w i is a word and / indicates an SU boundary. There are two misclassified boundaries, one insertion error (INS) and one deletion error (DEL), in the example above. Since there is only one reference SU boundary, the NIST SU error rate for this system output is 200%. If a system hypothesized a non-event boundary at each inter-word boundary, then the NIST error rate would be 100% for the boundary detection task, all due to deletion errors. Because we would also like to compare pairs of models to find out whether one model is significantly better or worse than another model, we also use the signtest [29] on the classification error rate (CER). The CER is defined as the number of incorrectly classified samples divided by the total number of samples (not just SUs). number of incorrect boundaries CER = (2) total number of word boundaries In the example above, there are four word boundaries, among which two are misclassified; therefore, the CER is 2/4 = 50%. 5.2 Evaluation setup The focus of our research is to determine whether utilizing gestural information provides additional helpful information for SU detection beyond using lexical and prosodic information. Because of this, we need to build a strong speech-only statistical model first. Then we add gestural features to the speech model to determine their impact. So, our basic evaluation procedure consists of: (1) building and testing a statistical model using only speech features, and then (2) testing the impact of adding the gestural information into the speech-based statistical model. When building a model, some parameters need to be selected to achieve optimal performance. Therefore, we need to hold out a subset of data for parameter optimization purposes. From the data taken from 14 speakers in the three VACE meetings (Jan07, Mar18, and Apr25), we hold out the data from two speakers for this purpose as a development set. Particularly, we selected speaker C in the Jan07 meeting and speaker F in the Mar18 meeting. Both speakers had typical gesture frequencies.

13 Data for the remaining 12 speakers were used for training and testing. Although we significantly increased the multimodal data size for the experiments reported here, the available multimodal data resource is still quite limited when compared to the speech data available for SU detection research. For example, the RT04S audio corpus used in the DARPA EARS program for SU detection contains about 480K words for training and about64k words for testing. In contrast, the entire multimodal corpus used in this study only contains about 26K words. 2 Due to the limited size of our multimodal data set, we utilized two evaluation approaches. The first approach involved cross-validation. When model training used features extracted from the VACE corpus, we used a leave-one-out evaluation method among the 12 participants not in the development set. In particular, among these 12 participants, we iteratively tested instances belonging to one participant after training on the remaining 11 participants. The second approach used additional available large-sized speech corpora. When the model training used features extracted from these speech corpora, we tested the trained models on all 12 speakers in the VACE corpus. 6 Multimodal features In this section, we describe the multimodal features used in the experiments reported in Section 8. Section6.1 describes lexical features; Section 6.2 describes prosodic features; Section 6.3 describes gestural features. 6.1 Lexical features The words from speech transcriptions constitute a primary knowledge source for SU detection. Syntactic classes of words (e.g., part-of-speech (POS) tags) provide additional lexical information for SU detection. For example, verbs in English, a subject-verb-object language, tend to be in the middle of sentences. Words in human conversations can be obtained from an automatic speech recognition system or from human-generated transcripts. In our experiments, we used human-generated transcripts for several reasons. First, the focus of our experiments is to investigate possible benefits from incorporating gestural features in SU detection. Second, using human-generated transcripts allows us to test gesture s impact on a strong speech-only model. Finally, previous studies of SU detection using speech only cues have shown that insights gained using human-generated transcripts generalize well to the automatic recognition case [32]. 6.2 Prosodic features Prosodic features are based on duration, pitch, and energy patterns in regions around the word boundaries following work in [47]. In this study, we used the Purdue 2 The multimodal data used in this study was much more expensive to collect and annotate since it required processing both audio and video events.

14 Prosodic Feature Extraction (PPFE) tool [23] to extract prosodic features. The prosodic features computed using the PPFE tool are briefly described below: Duration Features: The durations of words, phones, and vowels are extracted from time-aligned word transcriptions. Since pause duration is an important cue, we compute pause duration both before and after each word boundary. Since a possible indicator for an SU boundary is pre-boundary lengthening, we also extract features such as the duration of the last vowel or stressed vowel in a multisyllabic word, as well as its normalization, e.g., normalization according to the feature s range. More details about the normalization can be found in [23]. F 0 Features: We use Praat s auto-correlation based pitch tracker to obtain the raw pitch contour. The pitch baseline, top-line, and pitch range are computed based on the mean and variance of the log F 0 values. Voiced regions are identified and the pitch curve is stylized over each voiced segment. Using the stylized pitch contour, we compute several different types of F 0 features: Range features: These features reflect the pitch range of a single word or windows around a word boundary. These features include the minimum, maximum,mean,andlastf 0 values for each word boundary. These features are normalized by baseline F 0 values using a linear difference and log difference. It is expected that speakers are more likely to fall near the bottom of their pitch range at a sentence boundary. Movement features: These features reflect the pitch movement on a single word or windows around a word boundary. The features are obtained from the stylized F 0 contours for the voiced regions of the word preceding and the word following a word boundary. Examples of such movement features are the minimum, maximum, and mean F 0 values, and the starting or ending stylized F 0 values, using various normalization methods. Slope features: These features, e.g., rising or falling pitch contours, reflect slopes of the stylized pitch contour within a word or a predefined window length around a word boundary. We also considered the slope across a boundary to capture local pitch variation. A continuous trajectory is more likely to correlate with non-boundaries; whereas, a broken trajectory tends to indicate a boundary of some type. Energy Features: We compute energy features based on the intensity contour produced by Praat. Similar to the F 0 features, a variety of energy-related range features, movement features, and slope features are computed, using various normalization methods. Additional Features: We include the gender of each meeting participant as an additional feature. In our research, we do not automatically determine the gender of the speaker. Since all 12 speakers from the VACE data set who were selected for training and testing were male speakers, this feature doesn t play a role when using a prosodic model trained on VACE data. However, when using a prosody model trained from other available large speech corpora containing both male and female speakers, this feature becomes useful.

15 Fig. 4 Feature selection windows used to extract effort-related and location-related gestural features 6.3 Gestural features According to growth point theory [36], an ongoing gesture tends to indicate that the accompanying speech production is not finished. Therefore, word boundaries with corresponding gestural patterns indicating that the gesture is ongoing (e.g., a large effort change, a large distance to the rest area, or being mid-gesture) are unlikely to be SU boundaries. Our data analysis described in Section 4 also supports this expectation. Since pause duration has proven to be an effective prosodic feature for SU detection [47], one might expect that a hold, the gestural correlate of a pause, would also be highly informative for SU detection. Holds have been found to be relatedtoaudiopauses[18]. Gesture feature extraction involves (1) determining parameters required for the feature extraction and (2) selecting relevant gestural features from a large pool of possible features. To tackle these two tasks, we used information gain (IG) analysis. Information gain is defined as the entropy reduction given the presence of the features. To measure the predictive ability of a feature, we use the information gain ratio (IGR), the ratio of the information gain to the event s entropy. Higher IGR means more uncertainty reduction about the predicted event given the presence of features. Parameters and features are selected based on corresponding IGRs on the development set described in Section Gestural features related to effort (E) and location (L) As depicted in Fig. 4, following feature extraction regions were used for extracting effort and location related features: previous word, the word immediately preceding a word boundary following word, the word immediately following a word boundary 3 previous window, a fixed-length window (window length is 0.2 s) immediately preceding a word boundary 4 following window, the word following the word boundary Effort values are normalized so that the value range of a speakers effort is between 0.0 and 1.0. The normalized effort value (e normalized ) was derived from its original value (e original )byusing(e original e MIN )/(e MAX e MIN ),wheree MIN and e MAX are the minimum and maximum values of effort belonging to a speaker. Average Effort values on those four feature extraction regions were extracted to 3 If the interval from the current word boundary to the end of the following word is not more than 2.0 s, we use this interval; otherwise we use only the interval of 2.0 s after the current word boundary. 4 The window length was selected to be the one that provided the optimal IGR on the development set from window lengths from 0.1 to 0.5 s.

16 be EFFORT_PREV_WORD, EFFORT_NEXT_WORD, EFFORT_PREV_WIN, and EFFORT_NEXT_WIN. In addition, differences across words and windows were extracted to be EFFORT_DIFF_WORD and EFFORT_DIFF_WIN. Location measurement uses distances between hand positions to the rest area. Average distance on those four feature extraction regions were extracted to be LOCA- TION_PREV_WORD, LOCATION_NEXT_WORD, LOCATION_PREV_WIN, and LOCATION_NEXT_WIN. In addition, similar to effort-related features, differences across words and windows were extracted to be LOCATION_DIFF_WORD and LOCATION_DIFF_WIN Gestural features related to hold (H) As depicted in Fig. 5, we computed two features related to the hold duration around a word boundary. HOLD_OVERLAP_WORD is the ratio of hold frames within the time interval marked by the beginning of the word immediately preceding a word boundary up to the end of the word immediately following the word boundary. The grey bar under the hold displayed in Fig. 5 shows the hold frames used to compute this feature. The idea behind this feature is that an end of a visual expression suggested by the presence of a hold may help signal the presence of an SU boundary. HOLD_OVERLAP_PAUSE is the ratio of hold frames that overlap with a silent pause at a word boundary, a potentially helpful feature for signaling SU boundaries. The grey bar above the words displayed in Fig. 5 shows the hold frames used to compute this feature. The idea behind this feature is that cues corresponding to the end of both the visual and the vocal expression may help predict more accurate SU boundaries Gestural features related to timing (T) We defined four features related to the timing of gestures. TIME_GESTURE_PART is a feature representing a word s temporal proximity to a gesture. All word boundaries can be classified into three types according to their temporal proximity to nearby gestures: (1) near boundary, meaning that a word boundary is close to a gesture s boundary (the distance from the word boundary to a gesture boundary is less than 0.5 s, which was selected based on the best IGR from the development set), (2) internal, meaning that a word boundary is internal to a gesture and its distance to any gesture boundary is larger than 0.5 s, and (3) outside, meaning that a word boundary does not belong to either the boundary case or the internal case. On the data set used for training and testing (containing 12 speakers), Fig. 5 Hold-related gestural features

17 Most likely SU boundary within a gap between two adjacent gestures Gesture Gap between two adjacent gestures Gesture boundary boundary boundary boundary Most likely SU boundary around a gesture boundary Fig. 6 Time-information-related gestural features constrained by the speech model s estimations word boundaries for these three proximity types have different likelihoods of being an SU boundary: near boundary (8.67%), internal (5.20%), and outside (17.31%). The other three features use a speech model s estimations because gestures in general only provide a coarse indication of the exact location of SU boundaries. Therefore, we use each gesture s timing information as an additional constraint on the speech model s estimation to provide a more precise indication of the presence of an SU. 5 Figure 6 illustrates these features. All of the three features are binary; a feature with a true value indicates that the corresponding word boundary s posterior probability, as estimated by the speech model, is a local maximum in the three kinds of regions based on gestures: (1) word boundaries within 1.0 s (the value is obtained from the development set) of the gesture s start, (2) word boundaries within 1.0 s of the gesture s end, and (3) word boundaries within two adjacent gestures. Since two adjacent gestures may be associated with two different SUs, an SU boundary might exist among word boundaries within the interval marked by the end of the previous gesture and the start of the following gesture. The corresponding features based on these three regions are: TIME_LOCALMAX_START, TIME_LOCALMAX_END, and TIME_LOCALMAX_GAP. On the data set used for training and testing (containing 12 speakers), the SU frequencies of word boundaries for each of these features with a true value are 38.6%, 46.8%, and 63.9%, respectively. Using all of the gestural features described above, we calculated each feature s corresponding information gain ratio and selected only features with an IGR of at least 2.0% for use in our SU detection experiments. Note that the threshold 2.0% was selected to remove features with little predictive ability on SUs, while preserving useful features covering a wide range of aspects of gestures. Table 3 lists all of the gestural features used in our experiments along with their IGR value on the development set. The last column reports whether a feature will be used in our automatic SU detection experiments or not. Most of the effort-related features are not predictive of an SU except for EFFORT_DIFF_WORD, which reflects the importance of change in velocity across word boundaries. Hold-related features have large IGR values. 5 In our experiment, we used an HMM model trained from a large-sized audio corpus, which will be described in Section 7.1, as the speech-only SU model.

18 Table 3 A list of gestural features for SU detection experiments Feature Type IGR (%) Selected EFFORT_PREV_WORD Effort <2.0 No EFFORT_NEXT_WORD Effort <2.0 No EFFORT_PREV_WIN Effort <2.0 No EFFORT_NEXT_WIN Effort <2.0 No EFFORT_DIFF_WORD Effort 2.2 EFFORT_DIFF_WIN Effort <2.0 No LOCATION_PREV_WORD Location 4.5 LOCATION_NEXT_WORD Location 3.0 LOCATION_PREV_WIN Location 2.6 LOCATION_NEXT_WIN Location 2.7 LOCATION_DIFF_WORD Location 3.9 LOCATION_DIFF_WIN Location 2.7 HOLD_OVERLAP_WORD Hold 34.1 HOLD_OVERLAP_PAUSE Hold 40.6 TIME_GESTURE_PART Time 3.12 TIME_LOCALMAX_START Time 3.05 TIME_LOCALMAX_END Time 4.81 TIME_LOCALMAX_GAP Time 11.0 Also some of the time-related features (e.g., TIME_LOCALMAX_GAP) produce large IGR values. 7 Statistical modeling This section overviews the models we will use in our experiments. Specific details of how these models are applied will appear in Section Speech-only HMM model Following methods widely used in previous research [30, 47], we constructed a lexical model, a prosodic model, and an HMM speech model using both lexical and prosodic features Language model A hidden event LM is trained based on the co-occurrence of words and SUs. Let W represent the string of spoken words (W 1, W 2,...), and E represent the sequence of inter-word SU labels (E 1, E 2,...). The hidden event LM describes the joint distribution of words and SUs, P(W, E) = P(w 1, e 1,w 2, e 2,...,w n, e n ).In this research, the LM training tool designed by SRI [49] was utilized. For training a hidden event LM, based on word and SU transcriptions, each SU boundary is represented by a special token. For example, in the string any suggestion SU okay, the SU token is included to indicate the presence of an SU boundary. Using a trained hidden event LM, the most likely SU boundary event sequence (E) can be estimated based on the word sequence (W) inanhmmmodel.in this model, the word-event pair corresponds to a hidden state and the words to observations, with the transition probabilities given by the hidden event LM. A

19 forward backward dynamic programming algorithm is used to compute the posterior probability P(E i W) of an event E i at word i. The SU event label E i is found such that it maximizes the posterior probability P(E i W) at each individual boundary Prosody model Using the extracted prosodic features (F) described in Section 6.2, a decision tree [43], which serves as the prosody model to estimate the likely SU boundary sequence (E) on all word boundaries, is constructed. The Classification and Regression Tree (CART) algorithm in the NASA-IND package [5] was used to train and prune decision trees in our experiments. The prosody model provides the estimation of P(E i F i ),wheree i is the SU label on word boundary i,andf i is the corresponding prosodic feature vector The HMM model integrating lexical and prosodic features Given the word sequence W and prosodic features F, the most likely SU event sequence Ê in the HMM framework is calculated as follows: Ê = arg max P(E W, F) (3) E = arg max P(W, E, F) (4) E = arg max P(W, E)P(F W, E) (5) E The conditional independence of the prosodic feature F and the word sequence W given the event E is then assumed, that is: P(F W, E) P(F E) Therefore, the computation of Ê can be simplified as: n P(F i E i ) Ê = arg max P(E W, F) P(W, E)P(F E) (6) E P(W, E) is related to the hidden state transition probabilities and P(F E) corresponds to the observation probability. The transition probabilities are computed using the hidden event n-gram language model (LM) described in Section The observation probabilities are obtained as follows: P(F i E i ) = P(E i F i ) P(E i ) The term P(E i F i ) is given by the posterior probability from the prosody model (decision tree) described in Section The most likely sequence Ê given W and F can thus be obtained as follows: Ê = arg max E i=1 P(E W, F) arg max P(W, E) E i P(E i F i ) P(E i ) To find the optimal event label sequence Ê, the forward backward algorithm is used for HMMs [44] since this generally minimizes the boundary error rate. (7)

20 7.2 Incorporating gesture information in the HMM model In our initial attempts at incorporating gestures for SU detection, we modeled gestures similarly to prosody. Gestural features (i.e., those listed in Table 3) were extracted according to our description in Section 6.3. Using a similar approach as is used to train prosody models, we then trained decision-tree-based gestural models that output posterior probabilities P(E i G i ),wheree i is the SU event of the word boundary i, andg i is the corresponding gesture feature vector. To incorporate gesture features (G) into the SU detection model using the HMM modeling approach, we augmented the observations to include gesture features. Therefore, the most likely event Ê estimated based on words (W), prosody (F), and gestures (G), can be obtained as follows: Ê = arg max P(E W, F, G) E = arg max P(W, E, F, G) E = arg max P(W, E)P(F, G W, E) E arg max E n P(W, E) P(F i, G i E i ) The observation likelihoods P(F i, G i E i ) are computed as follows: P(F i, G i E i ) = P(E i F i, G i ) P(F i, G i ) P(E i ) Because P(F i, G i ) is irrelevant to E i, we can ignore its value: i Ê = arg max E P(E W, F, G) arg max P(W, E) E i P(E i F i, G i ). (8) P(E i ) There are two approaches that we used to obtain the P(E i F i, G i ): (1) training a decision tree model using the prosodic features (F) and gesture features (G) together, or (2) training prosody and gesture models separately, and then interpolating these two models estimations to approximate the joint model s estimations: P(E i F i, G i ) λp(e i F i ) + (1 λ)p(e i G i ), where λ is the weight set using the development set that is used in the interpolation to combine the prosody and gestural models. The joint model can only be trained using the data in the VACE multimodal corpus. A large-sized speech corpus with considerably more prosodic instances, e.g., the RT04S CTS corpus, can not be utilized to train the joint model, because it lacks gestural features. In contrast, the prosody model can be trained by utilizing a larger available speech corpus and then integrated with a gesture model based on the smaller multimodal set. Both methods for integrating prosody and gestures will be evaluated.

21 7.3 Conditional models for SU detection The multimodal HMM SU model is a generative model. It is a stochastic process with hidden event states (SU sequence E) that produces the observations O, including word sequences W, prosodic features F, and gestural features G. The standard HMM training method optimizes the joint probability of the hidden event states and the observations P(E, O), whereas the criterion used to test the HMM s performance is the conditional probability P(E O). It is clear that there exists a mismatch between the criteria used in the training and testing of this generative model. In addition, as a generative model, the HMM cannot easily handle correlated input features. This makes the HMM model less effective for integrating highly correlated input features. To address these two limitations of the HMM, recently researchers have been using conditional models. Instead of estimating the joint probability P(E, O) as in training a generative model, a conditional model is trained by estimating the conditional probability P(E O). For model training, the conditional likelihood is used as the objective function: CL(θ, D) = P(E O) = {e,o} D P(e o) (9) where D is the training data set containing labeled instances {e, o}, CL is the conditional likelihood, and θ represents the model. The conditional likelihood p(e o) is closely related to the individual event posterior probability used for classification, enabling the explicit optimization of the model s discrimination of event labels. We will next describe two commonly used conditional models: Maximum Entropy (ME) and Conditional Random Field (CRF) models. The Maximum Entropy (ME) model [3] has been successfully applied to a variety of natural language processing tasks, including structural event detection in spontaneous speech [32, 33]. The ME model provides one possible solution to estimate the conditional probability of the model P(E O); it models things that are known and does not make assumptions about unknown things. The constraints in the ME model are obtained from the training set, i.e., the empirical distribution of a feature is equal to the expected value of the feature with respect to the model s conditional probabilities P(E O). The ME model finds a probability distribution that satisfies its constraints, and has the maximum conditional entropy: H(E O) = o p(o) p(e o) log p(e o) (10) The conditional model obtained through the optimization has the exponential form: ( ) P(e o) = 1 Z λ (o) exp λ k g k (e, o) (11) where Z λ (o) is the normalization term: Z λ (o) = ( ) exp λ k g k (e, o). (12) e k k

22 To find the parameters {λ k }, the log likelihood i P(e i o) over the training data is maximized. In this study, the L-BFGS parameter estimation method is used, with Gaussian-prior smoothing [12] to avoid overfitting. The rationale behind the use of Gaussian priors is to force the parameters λ to be distributed according to a Gaussian distribution with mean μ and variance σ 2. This prior expectation penalizes parameters that drift away from their mean prior value (μ is usually 0). When Gaussian smoothing is used, a penalty term is added: ˆ CL(λ) = CL(λ) + λ i log ( 1 exp λi 2πσ 2 2σ 2 ). (13) The ME SU model using lexical, prosodic, and gestural features takes the following form: ( ) 1 P(E i W, F, G) = Z λ (W, F, G) exp λ k g k (E i, W, F, G) (14) where Z λ (W, F, G) is the normalization term: Z λ (W, F, G) = ( ) exp λ k g k (E i, W, F, G) E i k CRFs are random fields that are globally conditioned on an observation sequence o. CRFs have been successfully utilized in text processing tasks [26], structural event detection [33], and computer vision [38]. A CRF is an undirected graph, in which the states of the model correspond to event labels e. The observations associated with states are o. Let ={λ k } R K be a parameter vector, and {g k (e, e, o t )} k=1 K be a set of indicator functions. A linear-chain conditional random field [26] has a distribution p(e o) that takes the form: P(e o) = 1 Z λ (o) exp k { } λ k g k (e t, e t 1, o), (15) where Z λ (o) is an instance-specific normalization function Z λ (o) = { } exp λ k g k (e t, e t 1, o). (16) e k The CRF model is trained to maximize the conditional log-likelihood of a given training set. The most likely sequence e is found using the Viterbi algorithm. Similar to training an ME model, Gaussian smoothing is utilized to avoid overfitting. The CRF SU model using lexical, prosodic, and gestural features takes the following form: { } P(e o) = 1 Z λ (o) exp λ k g k (e t, e t 1, o), (17) where o ={W, F, G} and Z λ (o) is an instance-specific normalization function: Z λ (o) = { } exp λ k g k (e t, e t 1, o). e k k k

23 8 Experiments 8.1 HMM models First, we describe the models that use the HMM for model combination. According to methods used for model training, we group all these models into the following four sets. Table 4 lists all of the HMM-integrated models. Language model: We created the hidden event language model using the SRI- LM [49] toolkit, in which SU boundaries are treated as special tokens in the language model training. From the joint word and SU label sequence (W, E), a 4-gram word-based hidden event LM is trained, with Kneser Ney smoothing. Due to the limited size of the VACE corpus (it contains 24,566 words and 3,170 SU labels), we did not construct a hidden event LM using the VACE data. We used the second training method described in Section 5.2; wetrainedthelm using the RT04S conversational dialog data set, which was annotated using the EARS MDE annotation specification V6.2 and contained approximately 480,000 words and 63,651 SU labels. Decision tree based models: Using IND decision tree toolkit [5], we have trained decision-tree-based prosody models (PM CTS trained on the prosodic features extracted from the RT04S CTS data and PM VACE trained on the prosodic features extracted from the VACE data), gesture model (GM trained on the gestural features extracted from the VACE data), and the joint prosody and gesture model (PGM VACE trained on the prosodic and gestural features extracted from the VACE data). When training PM CTS, we used the ensemble bagging approach suggested in [30] to cope with the imbalanced class pattern, i.e., non- SU boundaries are much more common than SU boundaries. Particularly, we created 7 balanced training sets. For each balanced training set, bagging [4](T = 50) was applied. Therefore, we obtain a total of 350 CART trees trained from the balanced prosodic feature sets. When training other models using features extracted from the VACE data, we applied a modified bagging learning method because the size of the VACE corpus is so much smaller than the RT04S data set, Table 4 Models used in HMM integration approach Model Description LM Hidden event LM trained on RT04S CTS data using SRI-LM PM CTS Prosody model trained on RT04S CTS data using IND PM VACE Prosody model trained on VACE data using IND GM Gesture model trained on VACE data using IND PGM VACE Joint prosody and gesture model trained on VACE data using IND PM CTS _GM Interpolated PM CTS and GM with λ as 0.8 PM VACE _GM Interpolated PM VACE and GM with λ as 0.8 LM_PM CTS LM_PM VACE LM_PM CTS _GM LM_PM VACE _GM LM_PGM VACE HMM-integrated LM and PM CTS HMM-integrated LM and PM VACE HMM-integrated LM, PM CTS,andGM HMM-integrated LM, PM VACE,andGM HMM-integrated LM and PGM VACE

24 if we used the ensemble bagging method with 350 decision trees, there would be many duplicate trees using this data. For each iteration, we trained 20 balanced decision trees (involving all instances with a minority class together with a roughly equal number of instances randomly sampled from the majority class), and then we test these decision trees independently on data from the remaining speaker. The posterior probabilities from these balanced decision trees were averaged and then adjusted to reflect the difference between the class distribution in the training (balanced) and in the testing data (imbalanced). The posterior probabilities are adjusted by dividing them by the prior probabilities corresponding to the training set, multiplying them by the new prior probabilities for the test set, and then normalizing the obtained probabilities. This method is able to address the imbalanced class problem and the limited size of the VACE corpus. Interpolated models using prosodic and gestural features: To provide joint estimations from prosody and gesture features P(E F, G), the posterior probabilities from the gesture model (GM) were interpolated with the posterior probabilities from the prosody models (PM CTS and PM VACE ). A series of interpolation weights (λ from 0.0 to 1.0) were evaluated on the development set. A λ of 0.8 was selected to interpolate the prosody and gestural models. The obtained interpolated models include PM CTS + GM and PM VACE + GM. HMM integrated models: According to the HMM integration method described in Section and Section 7.2, using the SRI-LM toolkit [49], we combined the language model (LM) with the prosody models, the interpolated prosody and gesture models, and the joint prosody and gesture model. The obtained models using a combination of lexical and prosodic cues include LM_PM CTS and LM_PM VACE. The obtained models using a combination of lexical, prosodic, and gestural cues include LM_PM CTS _GM, LM_PM VACE _GM,and LM_PGM VACE. 8.2 Conditional models Next, we describe models related to the conditional model combination approach. Two conditional modeling approaches, ME and CRF, were used in our experiment. For ME, we used the ME toolkit designed by Zhang [54]. For CRF, we used a Java based NLP package, Mallet [35]. Using these two toolkits, a series of models were built using different combinations of lexical, prosodic, and gestural features. Table 5 describes the models constructed using ME and CRF toolkits. We describe the feature subsets used in these models in detail below. Word based lexical features (word): We adopted the word features that were used in the conditional model described in [30]. The combinations of preceding and following words were used to encode the word context of the event, e.g., w i, w i,w i+1, w i 1,w i, w i 1,w i,w i+1, w i 2,w i 1,w i,and w i,w i+1,w i+2,where w i refers to the word immediately before a word boundary. POS based lexical features (POS): As described in Section 7.3, the conditional modeling approach can efficiently handle correlated features. In previous research, features correlated to a word n-gram, e.g., part-of-speech (POS) n-gram and class-based n-gram features have been used for SU detection [30]. Therefore we also chose to enhance the set of lexical features used in our research by including POS-based n-gram features. The POS tags were obtained by using a

25 Table 5 Models using ME and CRF modeling approaches Model Description of the used features ME word Word n-gram features ME word+pos Word and POS n-gram features ME word+pm CTS ME word+pm VACE ME word+pos+pm CTS ME word+pos+pm VACE ME word+pos+pm CTS +G ME word+pos+pm VACE +G CRF word CRF word+pos CRF word+pm CTS CRF word+pm VACE CRF word+pos+pm CTS CRF word+pos+pm VACE CRF word+pos+pm CTS +G CRF word+pos+pm VACE +G Word n-gram and prosodic features from PM CTS Word n-gram and prosodic features from PM VACE Lexical and prosodic features from PM CTS Lexical and prosodic features from PM VACE Lexical, prosodic (PM CTS ), and gestural features Lexical, prosodic (PM VACE ), and gestural features Word n-gram features Word and POS n-gram features Word n-gram and prosodic features from PM CTS Word n-gram and prosodic features from PM VACE Lexical and prosodic features from PM CTS Lexical and prosodic features from PM VACE Lexical, prosodic (PM CTS ), and gestural features Lexical, prosodic (PM VACE ), and gestural features POS tagger trained on spontaneous speech content [24]. The combination of preceding and following POS tags was used to encode the POS context of a word boundary, e.g., p i, p i, p i+1, p i 1, p i, p i 1, p i, p i+1, p i 2, p i 1, p i, and p i, p i+1, p i+2,wherep i refers to the POS tag of the word w i. Prosodic features (PM): Using the methods for utilizing prosodic information in conditional models described in [30], the posterior probabilities estimated by the decision-tree-based PM, P(E i F i ), were converted to a series of binary features using cumulative bins, p > 0.1, p > 0.2,..., p > 0.9. The posterior probabilities estimated by two different prosody models described on Section 8.1, PM CTS and PM VACE,wereused. Gestural features (G): The gestural feature set contains numeric features, including features about effort and distance to the rest area. To utilize these features in the conditional model (which prefers categorical features), we converted the numeric values to discrete values using Fayaad and Irani s MDL method [19], which is implemented in WEKA [52]. As described in Section 5.2, when evaluating models that were trained from the VACE corpus, we used the leave-one-out evaluation method on the 12 speakers. However, when evaluating models that were trained from RT04S CTS data, such as LM, PM CTS,andLM + PM CTS, we tested the trained models on all 12 speakers in the VACE data set. In order to decide on the SU label to assign to each inter-word boundary, we calculated the posterior probabilities and determined the SU label by comparing with a predefined threshold. A threshold of 0.5 was used to generate the final decisions for SU boundaries, that is, an inter-word boundary is classified as an SU boundary if the estimated posterior probability is greater than 0.5;otherwise,it is classified as a non-su boundary. The 0.5 decision threshold was selected to minimize the overall classification error based on the assumption that errors associated with each class (SU boundary or non-su boundary) are equally costly.

26 8.3 Experimental results In this section, we describe the performance of models based on the HMM model combination approach described in Section 7.2. In the next two subsections. We report on the performance of models created using HMMs and conditional models HMM integration Table 6 reports the performance of each model, with respect to each of the three NIST metrics (DEL, INS, and ERR), as well as the boundary-based CER. Note that the baseline performance was obtained by always predicting the majority class for each inter-word boundary (i.e., predicting that the inter-word boundaries are always non-su boundaries), resulting in an overall NIST error rate of 100% and a CER of 12.94%. As shown in the table, the lexical model (LM) achieves a lower error rate for SU detection than always predicting there will be no SU boundary, showing the importance of lexical cues for SU detection. The prosody models obtain a significantly lower error rate for SU detection than the lexical model (sign test p < 0.05), possibly due to the fact that prosody is more genre independent than lexical cues. More importantly, the model combining the lexical and prosodic features achieves further statistically significant improvement in performance than using either lexical or prosodic features alone (sign test p < 0.05), reconfirming results in other studies that indicate that lexical and prosodic information are complementary to each other for SU detection. The prosody model trained from the CTS dialog data (PM CTS ) has a lower NIST error rate compared to the prosody model trained from the VACE meeting data (PM VACE ). One possible reason for this is that the CTS dialog data set is a larger set with many more speakers than the VACE meeting data. Although the PM VACE has a higher SU detection error rate than the PM CTS, when it is combined with Table 6 Performance of the prosody model, combined lexical and prosody models, and models involving gestural cues (i.e., the gestural model GM, the combined gesture and prosody models, and the combined multimodal models) Model DEL (%) INS (%) ERR (%) CER (%) always non-su LM PM CTS PM VACE LM_PM CTS LM_PM VACE GM PM CTS _GM PM VACE _GM PGM VACE LM_PM CTS _GM LM_PM VACE _GM LM_PGM VACE

27 the language model LM,theLM_PM VACE (37.72% NIST error) has a significantly lowererrorratethanthelm_pm CTS (37.72% < 38.70% by the sign-test with a p < 0.05). One possible explanation for this is that the prosody model trained from the meeting data helps to compensate for a genre-mismatched lexical model better than the prosody model trained from the conversational telephone speech. The gesture model (GM) obtains a lower error rate than the baseline of always predicting no SU boundary (NIST error rate is 100%), suggesting that gestural features G provide helpful information for SU detection. Compared to the performance of the prosody models (PM CTS and PM VACE ), the gesture model produces a higher error rate, suggesting that gestures are more noisy indicators of SU boundaries than the prosodic features. When combining prosodic and gestural features together using the interpolation method, we obtain only a slightly lower error rate than using only prosodic features. When training a joint model using both prosodic and gestural features for SU detection, the joint model PGM VACE obtains a significantly lower error rate (50.46%) than the interpolated model PM VACE _GM (50.46% < 51.97% bythe sign-test with p < 0.05) and a slightly lower error rate than the interpolated model PM CTS _GM (51.02%). When the gesture model is combined with the language and prosody models, it does not produce additional error rate reduction compared to the corresponding speech model. Both previous studies [30, 47] and our current experiment on the VACE meeting corpus suggest that an HMM modeling approach is an effective way to utilize lexical and prosodic cues for SU detection. By training a decision tree prosody model and combining that model with a language model in an HMM framework, we can obtain significantly improved SU detection accuracy. However, using a similar method for incorporating gesture features, we were unable to obtain additional improvement for SU detection. One possible reason for the unsuccessful use of gesture features in the HMM multimodal model is our strategy of treating gestures similarly to prosody. Although both gesture and prosody are non-lexical, they are quite different. For example, prosodic features usually co-occur with the word stream. However, gestures are not always present in the word stream and are sometimes present despite the fact that that there is no speech. Gestures are also on a different time scale than speech. Furthermore, the absence of gesture cues is somewhat agnostic about the presence or absence of an SU boundary. Another possible contributing factor relates to some of the HMM s drawbacks. The HMM approach has been criticized for its limitations related to model training and its handling of correlated features. In the extraction of gestural features, we used information from the lexical and prosodic channels. The correlation between gestural features and speech features could potentially challenge the HMM s modeling ability Conditional model integration In this section, we report on experimental results obtained using the two conditional models (i.e., ME and CRF) described in Section 8.2. We first discuss the performance of the speech-only conditional models for SU detection. Experimental results for the ME and CRF models trained on lexical (word and POS n-gram) and prosodic features appear in Table 7. Compared to the corresponding HMM speech model, the ME model using word n-gram and prosodic features produces a significantly lower error rate according

28 Table 7 SU detection using lexical features (e.g., word and POS n-gram) and the combination of lexical and prosodic features by Maxent and CRF (bold fonts on ERR and CER indicate that the combined lexical and prosody model has a statistically significantly lower error rate than the corresponding lexical model for SU detection (p < 0.05)) Model DEL (%) INS (%) ERR (%) CER (%) ME word ME word+pm CTS ME word+pm VACE ME word+pos ME word+pos+pm CTS ME word+pos+pm VACE CRF word CRF word+pm CTS CRF word+pm VACE CRF word+pos CRF word+pos+pm CTS CRF word+pos+pm VACE to the sign-test with p < 0.05; MEword+PM CTS has a NIST error rate of 37.01% compared to LM_PM CTS with 38.70% andmeword+pm VACE has a NIST error rate of 35.82% compared to LM_PM VACE with 37.71%. We also find that using the enriched lexical feature set achieves a significantly lower error rate than using only the word features. By adding the POS n-gram features to the lexical feature set, the ME lexical model s NIST error rate is reduced from 65.66% to52.22%. When integrated with the prosodic features, the enriched lexical features reduce the error from 37.01% to34.87% for CTS and from 35.82% to34.62% for VACE. All these error reductions are statistically significant according to the sign-test with p < The CRF modeling approach produces even lower error rates than the ME modeling approach. Using identical features, we find that the CRF model always achieves a significantly lower error rate than the corresponding ME model according to the sign-test with p < Next, we describe the performance of each of the conditional SU models using a combination of speech and gestural features. Experimental results appear in Table 8 Table 8 SU detection using speech features, visual features, and the combination of both types of features by Maxent and CRFs (bold fonts on ERR and CER indicate that the multimodal model has a statistically significantly lower error rate than the corresponding speech model for SU detection (p < 0.05)) Model DEL (%) INS (%) ERR (%) CER (%) ME word+pos+pm CTS ME word+pos+pm CTS +G ME word+pos+pm VACE ME word+pos+pm VACE +G CRF word+pos+pm CTS CRF word+pos+pm CTS +G CRF word+pos+pm VACE CRF word+pos+pm VACE +G

29 together with results of conditional SU models using speech-only features (those including word and POS n-gram features and prosodic features). From this table, we find that a model combining speech and gestural features reduces the error rate of ME models compared to the model using only speech features. For the ME model using the prosodic features computed from the prosodic model trained from CTS data (ME word+pos+pm CTS ), the error reduction brought about by adding gestural features is statistically significant (sign-test p < 0.05). Similar to the ME models, the addition of gestural features reduces the error rate of the CRF models compared to a model using only speech audio features. For CRF models using the prosodic features computed from the prosody model trained from either CTS data (CRF word+pos+pm CTS )orvacedata(crfword+pos+pm VACE ), the error reductions brought about by using gestural features are statistically significant (signtest p < 0.05). The experimental results show that nonverbal cues can be effectively used for more accurate SU detection. Compared with conditional SU models using speech only features that are based on the state-of-the-art, our multimodal SU models obtain a statistically significant reduction in error. The experimental results suggest that the conditional modeling approaches are more appropriate for combining multimodal features together than the HMM. This is probably due to the fact that correlations exist between the gestural features and the prosodic features. 8.4 Analysis of gesture features When using the conditional modeling approaches, gestural features can significantly reduce the error rate of SU detection. In order to determine which features are most helpful, we use a leave-one-out evaluation approach on four subsets of the gestural features, including effort-related (E), location-related (L), hold-related (H), and timing-information-related (T) features described in Section 6.3. On the basis of an SU model using all lexical, prosodic, and gestural features (we used the ME word+pos+pm CTS +G), we built models by routinely leaving one subset of gesture features out and evaluating the error rate. If a subset of gesture features plays an important role for signaling SU boundaries, leaving the subset out should cause a noticeable error increase. In contrast, if a subset of gesture features does not play an important role for signaling SU boundaries, leaving it out may only cause a slight error increase (or possibly an error decrease if it is noisy). Table 9 shows the experimental results when leaving out each subset of gesture features. We find that hold-related gesture features (H) and timing-informationrelated gestural features (T) are more important for signaling SUs than effort-related Table 9 Results for the investigation of individual subsets of gestural features for SU detection Model DEL INS ERR CER ME word+pos+pm CTS ME word+pos+pm CTS +G ME word+pos+pm CTS +G (w/o E) ME word+pos+pm CTS +G (w/o L) ME word+pos+pm CTS +G (w/o H) ME word+pos+pm CTS +G (w/o T)

30 (E) and location-related features (L). Such findings are consistent with results about gesture features information gain ratios described in Table 3. Some of the hold-related features (H) and timing-information-related features (T) had large information gain ratios, indicating that they should be quite helpful for SU detection. 9 Conclusion In this paper, we investigated combining speech and gesture cues to more accurately detect SU boundaries using a data-driven approach. Given the fact that there are relatively few standard data resources supporting this research in this relatively new area, we collected a multimodal corpus (VACE) by collaborating with researchers from psychology and computer vision fields. Using the multimodal corpora and features extracted from them, we analyzed relationships between gesture patterns and SUs. Our analysis suggested that gestures provide helpful information for signaling SU boundaries. For example, word boundaries on which speakers continue to gesture are less likely to be SU boundaries. Armed with the collected multimodal corpus and a deeper understanding of how gesture cues are used for signaling SUs in human communication, we then systematically investigated utilizing gesture features in a multimodal model for SU detection. We extracted gestural features related to several aspects of gesture behaviors (e.g., effort-related, location-related, holdrelated, and timing information related features). We compared different approaches to combine gestural and speech features for SU detection and found that conditional models are more effective for incorporating nonverbal features for SUs detection than HMMs. We demonstrated that systems that utilize speech and gesture features achieve lower detection error rates (that are statistically significant in most cases) than those utilize speech features alone. In future research, we will expand the size of multimodal corpus to work on larger and more diverse data sets. We plan to investigate whether gesture plays an even more important role for detecting SUs when utilizing automatically generated transcriptions. It is possible that gesture offers help in the face of word errors. Additionally, given advances in computer vision technology, one other possible next step would be to evaluate fully automatic visual feature extraction. It would also be beneficial to investigate additional statistical modeling approaches for the automatic detection tasks investigated here. A recent trend in natural language processing and computer vision is to use graphical models with latent states to increase modeling capability. This approach could provide a helpful framework for combining audio and visual cues, as in [16]. We are also interested in utilizing other nonverbal cues (e.g., gaze) to support additional structure event detection tasks (e.g., discourse structure and floor control of meeting conversations). References 1. Argyle M (1988) Bodily communication, 2nd edn. Methuen, London 2. Beeferman D, Berger A, Lafferty J (1998) Cyperpunc: a lightweight punctuation annotation system for speech. In: Proceedings of the international conference of acoustics, speech, and signal processing (ICASSP)

31 3. Berger A, Pietra S, Pietra V (1996) A maximum entropy approach to natural language processing. Comput Linguist 22: Breiman L (1996) Bagging predictors. Mach Learn 24(2): Buntine W (1992) Learning classification trees. Stat Comput 2: Cassell J, Stone M (1999) Living hand to mouth: psychological theories about speech and gesture in interactive dialogue systems. In: Proceedings of the AAAI conference on artificial intelligence 7. Chai J, Hong P, Zhou M (2004) A probabilistic approach to reference resolution in multimodal user interfaces. In: Proceedings of the conference on intelligent user interface (IUI). ACM Press, pp Chen C (1999) Speech recognition with automatic punctuation. In: Proceedings of the European conference on speech processing (EuroSpeech) 9. Chen L, Harper M, Huang Z (2006) Using maximum entropy (ME) model to incorporate gesture cues for SU detection. In: Proceedings of the international conference on multimodal interface (ICMI), Banff, Canada 10. Chen L, Liu Y, Harper M, Shriberg E (2004) Multimodal model integration for sentence unit detection. In: Proceedings of the international conference on multimodal interface (ICMI), University Park, PA 11. Chen L, Rose T, Qiao Y, Kimbara I, Parrill F, Welji H, Xu T, Tu J, Huang Z, Harper M, Quek F, Xiong Y, McNeill D, Tuttle R, Huang TS (2005) VACE multimodal meeting corpus. In: Proceedings of the joint workshop on machine learning and multimodal interaction (MLMI) 12. Chen S, Rosenfeld R (1999) A gaussian prior for smoothing maximum entropy models. Tech. rep., Carnegie Mellon University 13. EARS (2002) DARPA EARS Program Eisenstein J, Davis R (2005) Gestural cues for sentence segmentation. MIT AI Memo 15. Eisenstein J, Davis R (2006) Gesture improves coreference resolution. In: Proceedings of the conference of the North American chapter of the association for computational linguistics (NAACL) 16. Eisenstein J, Davis R (2007) Conditional modality fusion for coreference resolution. In: Proceedings of the conference of annual meeting on association for computational linguistics linguistics (ACL) 17. Ekman P (1965) Communication through nonverbal behavior: a source of information about an interpersonal relationship. In: Tomkinds SS, Izard CE (eds) Affect, cognition and personality. Springer, New York, pp Esposito A, McCullough K, Quek F (2001) Disfluencies in gesture: gestural correlates to speech silent and filled pauses. In: Proceeding of IEEE workshop on cues in communication, Kauai,Hawaii 19. Fayyad U, Irani K (1992) On the handling of continuous-valued attributes in decision tree generation. Mach Learn 8: Garofolo J, Laprum C, Michel M, Stanford V, Tabassi E (2004) The NIST meeting room pilot corpus. In: Proceedings of the conference on language resources and evaluations (LREC) 21. Gotoh Y, Renals S (2000) Sentence boundary detection in broadcast speech transcript. In: Proceedings of the international speech communication association (ISCA) workshop: automatic speech recognition: challenges for the new millennium ASR Huang Z, Harper M (2005) Speech and non-speech detection in meeting audio for transcription. In: Proceedings of NIST RT-05 workshop 23. Huang Z, Chen L, Harper M (2006) An open source prosodic feature extraction tool. In: Proceedings of the conference on language resources and evaluations (LREC) 24. Huang Z, Harper M, Wang W (2007) Mandarin part-of-speech tagging and discriminative reranking. In: Proceedings of the empirical methods in natural language processing (EMNLP), Prague, Czech 25. Kendon A (1974) Movement coordination in social interaction: some examples described. In: Weitz S (ed) Nonverbal communication. Oxford University Press, New York, pp Lafferty J, McCallum A, Pereira F (2001) Conditional random field: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the international conference on machine learning (ICML) 27. (LDC) LDC (2004) Meeting recording quick transcription guidelines, 1st edn. gov/speech/test_beds/mr_proj/meeting_corpus_1/documents/pdf/meetingdataqtrspec-v1. 3.pdf 28. (LDC) LDC (2004) Simple MetaData annotation specification version 6.2, 6th edn. projects.ldc.upenn.edu/mde/guidelines/simplemde_v6.2.pdf

32 29. Lehmann EL (2005) Testing statistical hypotheses, 3rd edn. Springer, New York 30. Liu Y (2004) Structural event detection for rich transcription of speech. Ph.D. thesis, Purdue University 31. Liu Y, Chawla N, Shriberg E, Stolcke A, Harper M (2003) Resampling techniques for sentence boundary detection: a case study in machine learning from imbalanced data for spoken language processing. Tech. rep., International Computer Science Institute 32. Liu Y, Stolcke A, Shriberg E, Harper M (2004) Comparing and combining generative and posterior probability models: some advances in sentence boundary detection in speech. In: Proceedings of the empirical methods in natural language processing (EMNLP) 33. Liu Y, Shriberg E, Stockle A, Harper M (2005) Comparing HMM, maximum entropy, and conditional random fields for disfluency detection. In: Proceedings of the international conference on speech, Lisbon 34. Liu Y, Shriberg E, Stolcke A, Peskin B, Ang J, D H, Ostendorf M, Tomalin M, Woodland P, Harper M (2005) Structural metadata research in the EARS program. In: Proceedings of the international conference of acoustics, speech, and signal processing (ICASSP) 35. McCallum A (2005) Mallet: a machine learning toolkit for language McNeill D (1992) Hand and mind: what gestures reveal about thought. Univ. Chicago Press, Chicago 37. Mehrabian A (1972) Nonverbal communication. Aidine-Atherton, Chicago 38. Morency LP, Quattoni A, Darrell T (2007) Latent-dynamic discriminative models for continuous gesture recognition. In: Proceedings of the IEEE computer vision and pattern recognition (CVPR) 39. Morgan N, Baron D, Bhagat S, Carvey H, Dhillon R, Edwards J, Gelbart D, Janin A, Krupski A, Peskin B, Pfau T, Shriberg E, Stolcke A, Wooters C (2003) Meetings about meetings: research at ICSI on speech in multiparty conversations. In: Proceedings of the international conference of acoustics, speech, and signal processing (ICASSP), vol 4. Hong Kong, Hong Kong, pp Qu S, Chai J (2006) Salience modeling based on non-verbal modalities for spoken language understanding. In: Proceedings of the international conference on multimodal interface (ICMI), Banff, Canada 41. Quek F, McNeill D, Bryll R, Duncan S, Ma X, Kirbas C, McCullough KE, Ansari R (2002) Multimodal human discourse: gesture and speech. ACM Trans Comput-Hum Interact 9(3): Quek F et al (2002) KDI: cross-model analysis signal and sense- data and computational resources for gesture, speech and gaze research Quilan JR (1993) C4.5: programs for machine learning. Morgan Kaufmann, San Francisco 44. Rabiner LR, Juang BH (1986) An introduction to hidden Markov models. IEEE ASSP Mag 3(1): Roark B, Liu Y, Harper M, Stewart R, Lease M, Snover M, Shafran I, Dorr B, Hale J, Krasnyanskaya A, Yung L (2006) Reranking for sentence boundary detection in conversational speech. In: Proceedings of the international conference of acoustics, speech, and signal processing (ICASSP) 46. Rose T, Quek F, Shi Y (2004) MacVissta: a system for multimodal analysis. In: Proceedings of the international conference on multimodal interface (ICMI) 47. Shriberg E, Stolcke A, Hakkani-Tur D, Tur G (2000) Prosody-based automatic segmentation of speech into sentences and topics. Speech Commun 32(1 2): Stevensonm M, Gaizauskasm R (2000) Experiments on sentence boundary detedction. In: Proceedings of the conference of the North American chapter of the association for computational linguistics (NAACL) 49. Stockle A (2002) SRILM a extensible language modeling toolkit. In: Proceedings of the international conference on spoken language processing (ICSLP) 50. Strassel S (2003) Simple metadata annotation specification, 5th edn. Linguistic Data Consortium 51. Sundaram R, Ganapathiraju A, Hamaker J, Picone J (2001) ISIP 2000 conversational speech evaluation system. In: Proceedings of the speech transcription workshop, College Park, Maryland 52. Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques. Morgan Kaufmann, San Francisco 53. Xiong Y, Quek F (2005) Meeting room configuration and multiple camera calibration in meeting analysis. In: Proceedings of the international conference on multimodal interface (ICMI), Trento, Italy 54. Zhang L (2005) Maximum Entropy Modeling Toolkit for Python and C++. inf.ed.ac.uk/s /maxent_toolkit.html

He joined Educational Testing Service (ETS) in 2008 as an associate research scientist in the Research & Development Division in Princeton, NJ.

33 Lei Chen received his Ph.D. in electrical engineering from Purdue University, West Lafayette. His Ph.D thesis is about utilizing nonverbal communication to support understanding face-to-face conversations. He worked as a summer intern in Palo Alto Research Center (PARC) during the summer of He joined Educational Testing Service (ETS) in 2008 as an associate research scientist in the Research & Development Division in Princeton, NJ. His recent work focuses on the automated assessment of spoken language by using speech recognition, natural language processing and machine learning technologies. Mary P. Harper is a Principal Research Scientist at the Johns Hopkins Center for the Advanced Study of Language and an Affiliate Research Professor in Computer Science and Electrical and Computer Engineering at the University of Maryland. Harper s research focuses on developing methods for incorporating multiple types of knowledge into computer algorithms for modeling human communication. Recent research has focused on the integration of speech and natural language processing systems (in English and Mandarin), the integration of speech, gesture, and gaze, and the utilization of hierarchical structure learned in an unsupervised fashion to improve the classification accuracy of documents and images.

Linking Task: Identifying authors and book titles in verbose queries

Linking Task: Identifying authors and book titles in verbose queries Anaïs Ollagnier, Sébastien Fournier, and Patrice Bellot Aix-Marseille University, CNRS, ENSAM, University of Toulon, LSIS UMR 7296,