Model for Discourse Act Recognition in Dialogue Interactions

From: AAAI Techical Report SS-98-01. Compilatio copyright 1998, AAAI (www.aaai.org). All rights reserved. A Statistical Model for Discourse Act Recogitio i Dialogue Iteractios Jeifer Chu-Carroll Bell Laboratories Lucet Techologies 600 Moutai Aveue Murray Hill, NJ 07974, U.S.A. E-mail: jecc@bell-labs.com Abstract This paper discusses a statistical model for recogizig discourse itetios of utteraces durig dialogue iteractios. We argue that this recogitio process should be based o features of the curret utterace as well as o discourse history, ad show that takig ito accout utterace features such as speaker iformatio ad sytactic forms of utteraces dramatically improves the system s performace as compared with a simple trigram model of discourse acts. I additio, we propose that takig ito accout iformatio about discourse structure may allow the system to costruct a more accurate discourse act model ad thus improve recogitio results. Experimet show this proposal to be promisig. Itroductio It is a widely accepted fact that i order for a dialogue system to iteract with its user i a atural ad cooperative fashio, it must be able to recogize the itetios of the user s utteraces. Up util very recetly, this problem of itetio recogitio has maily bee addressed usig traditioal kowledge-based approaches, which suffer from may shortcomigs icludig the large amout of effort ivolved i hadcodig rules for each ew applicatio domai, difficulties i scalig the system up for real world applicatios, etc. Followig success i applyig statistical methods i areas such as speech recogitio ad part-ofspeech taggig, researchers have begu explorig the possibilities of employig such techiques i the recogitio of discourse itetios. Previous ad curret work o statistical approaches to dialogue act modelig focus o two mai problems. The first problem is dialogue act predictio, where the dialogue system predicts the most probable ext dialogue act based o curret discourse. Nagata ad Morimoto (1994a; 1994b) ad Reithiger et al. (1996) performed the predictio dialogue acts based maily o dialogue act history (before the ext utterace is spoke) so that the predicted act may provide top-dow iformatio for other compoets i the dialogue system, such as the speech recogizer. Core (1998), o the other had, use previous utterace tags ad existig tags for the curret utterace to predict possible additioal tags for the curret utterace usig the DAMSL aotatio scheme (Core & Alle 1997). The secod problem is dialogue act recogitio, where the dialogue system determies the dialogue act that best fits the itetio of a utterace give the utterace ad previous discourse (Mast et al. 1995; Reithiger & Klese 1997; Samuel, Carberry, & Vijay-Shaker 1998; Stolcke et al. 1998). Mast et al. (1995) determied the dialogue act of a utterace based the word sequece that comprises the utterace aloe, without ay additioal discourse iformatio. Reithiger ad Klese s model performs dialogue act classificatio based o the word sequece of the curret utterace as well as the dialogue history (Reithiger & Klese 1997). Samuel, Carberry, ad Vijay-Shaker (1998) utilize trasformatiobased learig to produce rules for determiig dialogue acts based o features such as cue phrases, chage of speaker, ad previous ad curret dialogue acts. Fially, Stolcke et al. (1998) determie the most probable sequece of dialogue acts for a set of utteraces, based o a -gram discourse model ad features such as words comprisig the utterace (both trascribed ad recogized) ad prosody. I this paper, we describe a statistical model for discourse act recogitio based o the curret utterace ad discourse history. However, istead of basig the recogitio process o the word sequece of a utterace as i most existig models, we argue for a model that is less sesitive to usee word sequeces. We extract features from the utteraces ad from the discourse that will provide more pertiet iformatio to the recogitio of discourse acts tha the actual words that comprise the utteraces themselves. Thus, we propose a model that performs the recogitio of discourse acts based o utterace features, discourse structure, as well as discourse history. Prelimiary results of our experimets show that utterace features such as speaker chage ad sytactic forms of utteraces provide a great deal of iformatio i determiig the most appropriate discourse act, while takig ito accout discourse structure appears to be a promisig aveue for further exploratio. 12

A Statistical Model for Discourse Act Recogitio Formalism The goal i discourse act recogitio is to determie, give a set of utteraces ul,,~, the sequece of discourse acts dl, that best describes their itetios. For this task, we utilize a hidde Markov model as described i (Rabier 1990). Usig this formalism, we wat fid arg maxdi. P(dl, [Ul,), i.e., the most probable dl, give ul,. By applyig Bayes rule, this formula ca be rewritte as follows, assumig that the curret discourse act is depedet oly o the curret utterace ad the previous two discourse acts: arg max P(dl, lul dl,,) = arg max P(dl,; Ul,) dl, P(Ul,) = argmaxp(ul,,~ldl,)p(dl,) dl, argmax ]-[ P(uildi)P(diIdi_ld~_2) (1) d l, ~--~1 Equatio (1) shows that fidig the best discourse acts to describe a set of utteraces is depedet o a -gram (i this case, trigram) model of discourse acts, as well a set of coditioal probabilities that idicate the probability of seeig a utterace give a particular discourse act. Obviously, this set of coditioal probabilities is very difficult to obtai i ay maer that will be useful for predictig discourse acts o usee data, whether u~ be the actual word sequece that makes up the utterace or a logical form that represets the sematic meaig of the utterace. Thus, istead of attemptig to ifer discourse acts based o the utteraces Ul, themselves, we attempt to recogize discourse acts based o features that ca be extracted from these utteraces i order to more easily geeralize our model. Thus, assumig that each utterace u~ ca be represeted by a set of features fil,,,,, equatio (1) ca agai rewritte as follows, uder the hypothesis that all features are idepedet: arg max P(dl, ]fll......, f,~,,,) arg max ]-[ P(fil,..., fi,, Idi)P(diIdi-ldi-2) dl,. argmax H[P(di[di_ldi_2 ) H P(fijldi)] (2) dl, i=1 j=l By extractig features from utteraces, ow istead of havig to obtai a set of coditioal probabilities related to m full utteraces, the coditioal probabilities eeded i equatio (2) are based o the probability of seeig a particular feature give a discours~ act. Ideally, we wat to select a set of features that will allow us to uiquely idetify the utteraces give their feature values. However, so far we have oly idetified a small umber of features that we believe have a impact o determiig discourse itetios. These features are discussed i the ext sectio. Utterace Features ad Discourse Structure We have idetified two types of features that provide iformatio o determiig the itetio of a utterace, features related to idividual utteraces, as well as features o how utteraces relate to oe aother. Utterace Features Obviously, whe iferrig the itetio of a utterace, a great deal of iformatio ca be extracted from the utterace itself. Oe such feature is the speaker of the utterace which, whe take ito accout i cojuctio with the previous discourse act, provides iformatio about what the curret discourse act may be.1 For istace, cosider the followig two dialogue segmets: Example 1: A: What is your telephoe umber? B: It s 908-555-1212. Example 2: A: What is your telephoe umber? Your home umber I mea. I both examples, the first utterace is iteded as a Request-Referet discourse actio by speaker A. I example 1, the secod utterace is iteded as a respose to the request for iformatio, while i example 2, the utterace is iteded as a elaboratio of the request. The recogitio of the itetio of the secod utterace i each example ca be assisted by recogizig that a speaker chage occurred i the first example but ot i the secod. More explicitly, based o equatio (2), this pheomeo ca be modeled by the followig relatioship, give appropriate probability assigmets: P(di = AswerRef I di-1 = RequestRef)* P(spkchg = y I di = AswerRef) P(di = Elaborate I di-1 = RequestRef). P(spkchg = y I d~ = Elaborate) IReithiger et al. (1996) developed a model for dialogue act predictio, ad oted that speaker iformatio is relevat to dialogue act predictio. However, their model predicts the speaker of the ext utterace alog with the most probable dialogue act, istead of treatig speaker iformatio as give, as i our model for discourse act recogitio. 13

The secod feature we have idetified is the sytactic form of the curret utterace, i.e., whether the utterace is a statemet, a yes-o questio, a wh-questio, or a ackowledgmet such as okay or uh-huh. We believe that the sytactic form of a utterace will have substatial impact o recogizig the itetio of the utterace. For istace, give that the sytactic form of the curret utterace is a statemet, it is likely that the speaker iteded to iform the hearer of a particular propositio or to accept or reject a previous proposal made by the hearer. O the other had, yes-o questios ad wh-questios are more closely related to discourse actios i which the speaker iteds to solicit iformatio from the hearer or to cofirm or clarify a previous utterace by the hearer. Discourse Structure Previous work o costructig - gram models of discourse acts all treated dialogues as liear sequeces of actios whe i fact we kow that discourse is highly structured (see, for example (Grosz& Sider 1986)). We argue that by takig ito accout the discourse structure, 2 we ca better recogize the discourse act for a utterace, as illustrated by the followig example: Example 3: (1) A: I wat to fid out what my credit card balace is. (2) B: What is your accout umber? (3) A: Is that the umber o my card? (4) B: Yes. (5) A: It s 5210 39274900 3911. I this example, utteraces (3) ad (4) costitute a clarificatio subdialogue. Usig a trigram model of discourse act that treats the dialogue as a liear sequece of actios, the recogitio of the itetio of utterace (5) will be most closely related to the itetios of utteraces (3) ad (4). However, a better model will attempt to iterpret utterace (5) with respect to utterace (2), the atecedet utterace the dialogue it is iteded to address. This ca be accomplished by takig ito accout the structure of the discourse, i.e., recogizig that utteraces (3) ad (4) costitute a dialogue ad that the utterace followig the subdialogue (utterace (5)) will be most closely related to the utterace precedig the subdialogue (utterace (2)), istead of the teraces immediately precedig it i the dialogue. Iformatio about discourse structure may affect the system s performace i recogizig discourse acts i two ways, durig the traiig process ad durig the recogitio process. Whe take ito accout durig the traiig process, the discourse structure will help 2We are iterested i the discourse structure that represets how the participats itetios relate to oe aother, as i the &tetio level discussed i (Moore& Pollack 1992). the system relate utteraces that are close i terms of discourse cotet, istead of those that are merely uttered i sequece, allowig the system to costruct a more accurate -gram model of discourse acts. Thus, if the dialogue segmet i example 3 is used i traiig, processig of utterace (5) will cotribute to determiig P(AswerRef ) I RequestRef,RequestRef (P(d5 I d2 dl )), istead of P(AswerRef I AswerIf, RequestIf) (P(dsld4d3)). Utilizig discourse structure durig the recogitio process will help the system determie the appropriate previous discourse acts to be used i -gram modelig, thus allowig it to more accurately determie the curret discourse act. For istace, if the dialogue segmet i example 3 is used i testig, recogitio of the clarificatio subdialogue structure i utteraces (3) ad (4) will lead the system to base the recogitio of the discourse act for utterace (5) o P(d5 I Requestaef,Requestaef) (P(d51d2dl)), istead of P(d5 I Aswerlf,RequestIf) (P(d51d4d3)). However, utilizig kowledge about discourse structure durig the recogitio process requires that the system first be able to recogize the discourse structure. I other words, i additio to recogizig discourse acts from utteraces, the system must also recogize how these actios relate to oe aother. As we have ot ivestigated the problem of recogizig discourse structure, i our experimets, we have limited the use of discourse structure iformatio to the traiig process. Experimets ad Results Based o the Statistical model ad the features discussed i the previous sectio, we labeled a limited set of dialogues ad ra a set of experimets as a prelimiary assessmet of how these proposed features affect the recogitio of discourse acts. This sectio discusses the preparatio of the traiig data, the traiig process, as well as the test results. Discourse Act Taggig We radomly selected 8 dialogues, with a total of 915 utteraces, from a corpus of aturally occurrig airlie reservatio dialogues betwee two huma agets (SRI Trascripts 1992). The utteraces i these dialogues are aotated with the followig two sets of iformatio: Features, which icludc 1) speaker iformatio, which idicates whether or ot a chage i speaker has occurred betwee the last ad curret utteraces, 2) sytactic form of utterace, which may be oe of statemet, whquestio, y-questio, or ackowledgmet, ad 3) a iteger value which represets the chage i level of estedess i the discourse structure betwee the curret ad the previous utteraces. For istace, a value of 1 idicates that the curret utterace iitiates a subdialogue, while a value of 0 idicates that the curret utterace is 14

Discourse Act # Occurreces Percetage Iform 269 29.4% Request-Referet 50 5.5% Aswer-Referet 48 5.2% Request-If 41 4.5% Aswer-If 37 4.0% Cofirm 24 2.6% Clarify 19 2.1% Elaborate 107 11.7% Request-Explaatio 2 0.2% Request-Repeat 1 0.1% Express-Surprise 1 0.1% Accept 162 17.7% Reject 6 0.6% Prompt 113 12.3% Greetigs 35 3.8% Table 1: List of Discourse Acts ad Their Distributio captured at the same level i the discourse structure as the previous utterace. Discourse act, which represets the itetio of the utterace. A list of 15 discourse acts that we used to label our utteraces as well as their distributio i our dataset are show i Table 1. I situatios where multiple discourse acts apply to a utterace, we select the most specific discourse act to represet the itetio of that utterace. For istace, i example 3, utterace (5) could be tagged either a Iform act or a Aswer-Refact. I this case, the Aswer-Refdiscourse act is selected sice it more accurately describes the itetio of the utterace. Traiig As discussed earlier, i order to costruct our statistical model for recogizig discourse itetios, we eed to first compute two sets of probabilities, a trigram model of discourse acts, ad a set of coditioal probabilities P(fj [di) for each feature ad each discourse act. We selected 6 out of the 8 dialogues, cosistig of 671 utteraces, as our traiig data. Because of the limited size of our traiig set, it is ecessary to obtai smoothed trigram probabilities by iterpolatig trigram, bigram, ad uigram frequecies as follows (Jeliek 1990): P(d3ldl, d2) = A3f(d3 Idx, d2) A2f(d3ld2) + )qf(d I the above equatio, A1, A2, ad/~3 must be o-egative ad must sum to 1. Our traiig process iterates through the traiig data i order to select these coefficiets to satisfy the maximum-likelihood criterio (Jeliek 1990). To examie the effect of discourse structure iformatio o discourse act recogitio, we costructed two trigram models of discourse acts. The first model does ot take ito accout iformatio about discourse structure, ad thus treats dialogues as liear sequeces of utteraces. The secod model computes the trigram coditioal probabilities by takig discourse structure ito cosideratio. Durig the traiig process, the system maitais a stack of previous discourse cotexts. If the chage i level of estedess for the curret utterace is 1, the the previous two discourse acts are pushed oto the stack. O the other had, if the chage i discourse level for the curret utterace is - 1, the the discourse acts at the top of the stack are popped off the stack ad are take to be the previous discourse acts for the curret utterace, i.e., the curret utterace will be cosidered a cotiuatio of the discourse before the subdialogue was iitiated. Computig the set of coditioal probabilities P(fi Idi) is a much simpler task, ad requires oly oe pass over the traiig data sice o smoothig is required) Results ad Discussio We ra a series of experimets with both of our traied trigram models o the remaiig 2 dialogues, cosistig of 244 utteraces, to provide a prelimiary assessmet of the effect of the idetified features o the recogitio of discourse acts. The results of these experimets are summarized i Table 2(a), which shows the system s performace usig the model traied without discourse structure iformatio, ad Table 2(b), which shows the system s performace usig the model traied with iformatio about discourse structure. The first rows i Tables 2(a) ad 2(b) show the -best suits usig the a priori probabilities i Table 1. The secod rows i the tables show the results obtaied by determiig the curret discourse act based solely o the trigram probabilistic model of discourse acts. Give the distributio of discourse acts i our traiig set, i most cases, the trigram model performs worse tha the simple a priori - best results. The rest of the tables show the recogitio results of icorporatig differet features ito the trigram model. These results show that, although the simple trigram model performs worse tha the a priori probabilistic model, addig either of the idetified features to the trigram model allows the system to obtai better recogitio resuits tha simply assigig the same discourse acts to every utterace. Furthermore, although icludig speaker iformatio provides some improvemeto the system s performace, takig ito accout the sytactic forms of utteraces 3If the set of dialogue acts is large ad ot all dialogue acts appeared i the traiig data, the it may be beeficial to smooth the coditioal probabilities as well. However, this is ot the case i our traiig set. 15

3-best 2-best 1-best a priori probabilities 59.4% 47.1% 29.4% trigram 53.32% 46.09% 27.04% speakerchage 61.12% 48.88% 33.41% sytactic form 83.03% 73.65% 51.00% both 83.60% 71.27% 49.16% (a) Without Discourse Structure Iformatio 3-best 2-best 1-best a priori probabilities 59.4% 47.1% 29.4% trigram 54.39% 47.21% 28.16% speaker chage 62.13% 50.63% 33.04% sytactic form 84.69% 75.48% 50.63% both 82.69% 70.07% 49.71% (b) With Discourse Structure Iformatio Table 2: Summary of Recogitio Results drastically reduces the error rate i the recogitio of discourse acts. However, addig speaker iformatio o top of the sytactic form of utteraces does ot provide further improvemet i most cases. A compariso betwee correspodig rows i the two tables shows that the system performs slightly better i most cases whe iformatio about discourse structure is take ito accout. However, the differece is very slight ad further experimets are eeded to yield more coclusive resuits. Note that i our curret experimets, we oly utilized discourse structure iformatio durig the traiig process, i.e., for obtaiig a more accurate trigram model of discourse acts. We believe that icorporatig discourse structure iformatio ito the recogitio phase will further improve the recogitio results. We ited to address this issue i future work. Future Work Although the prelimiary results give by our system s performace appear to be ecouragig, i the best case sceario, the system s top recogitio result is oly correct aroud 50% of the time. Obviously much improvemet is eeded before such a recogitio compoet ca be of actual use i a dialogue system. I our future work, we pla to pursue issues alog three mai paths. First, we ited to idetify additioal utterace features that may provide further iformatio o discourse act recogitio. As discussed earlier, ideally, we should select a feature set that allows us to uiquely idetify a utterace give its feature values. Clearly, the features we curretly employ, speaker chage ad sytactic form of utterace, do ot satisfy this criterio. Oe feature that we are cosiderig usig is the predicate type of the sematic represetatio of the utterace, which may provide iformatio about the itetio of the utterace or how the utterace is iteded to relate to previous discourse. Aother potetially useful feature is the itoatio of the utterace, as i (Taylor et al. 1996), which may provide additioal iformatio especially whe the sytactic form of the utterace does ot accurately covey its itetio, such as whe a cofirmatio is phrased sytactically as a statemet, except with a risig itoatio. Secod, the results i Tables 2(a) ad 2(b) idicate o improvemet is gaied by usig both speaker iformatio ad sytactic form of utterace, versus usig sytactic iformatio aloe. This may be because the idepedece assumptio that we used i derivig our formula is too strog. We pla o explorig other models that may better fit the set of features we idetified. For istace, the followig model based o liear iterpolatio does ot make the idepedece assumptio, but istead computes the coditioal probability of a set of features give a dialogue act usig all subsets of the feature set. arg max P (dl, I f11,,~,..., fm~,. dl,~..~ argmax II P(fil,..., fi,. Idi)P(d, ldi-ldi-2) dl /=1 argma~ ]-l[p(d, ld,_ld,_2),~-p(yld,)], dl, "~. ~ whereyi = {fil,..., fire } ye2yi Third, i our curret model, we traied ad tested our system o a limited set of data. We ited to perform some larger scaled experimets to obtai more coclusive results with respect to the effects of utterace features ad discourse structure o discourse act recogitio. Reithiger experimeted with varyig the amout of traiig data ad reported that performace improves up to aroud 35 dialogues i the traiig set (about 2000 utteraces) (Reithiger 1995). This suggests that by icreasig the amout of traiig data, we may be able to further improve our system s performace. Coclusios I this paper, we have discussed our statistical model for recogizig discourse acts durig dialogue iteractio based o 1) utterace features, such as speaker chage ad sytactic forms of utteraces, 2) discourse structure, amely chage i the level of estedess, ad 3) discourse history. Utterace features ad discourse history are take ito accout durig the recogitio process usig a hidde Markov model i order to fid the set of discourse acts 16

that best describe the itetios of a give set of utteraces. Our prelimiary results show that speaker iformatio provides some improvemet to the system s recogitio of discourse acts, while icludig the sytactic form of utteraces greatly ehaces the system s performace. Furthermore, our experimets show that takig ito accout iformatio about discourse structure for discourse act recogitio appears to be a promisig aveue for further research. Ackowledgmets The author would like to thak Christer Samuelsso, Bob Carpeter, ad Jim Hieroymus for helpful discussios, as well as Christer Samuelsso ad the two aoymous reviewers for their commets o earlier drafts of this paper. Refereces Core, M. G., ad Alle, J. F. 1997. Codig dialogs with the DAMSL aotatio scheme. I Workig Notes of the AAAI Fall Symposium o Commuicative Actio i Humas ad Machies. Core, M. G. 1998. Predictig DAMSL utterace tags. I Proceedigs of the AAAI-98 Sprig Symposium o Applyig Machie Learig to Discourse Processig. Grosz, B. J., ad Sider, C. L. 1986. Attetio, itetios ad the structure of discourse. Computatioal Liguistics 12(3): 175-204. Jeliek, E 1990. Self-orgaized laguage modelig for speech recogitio. I Waibel, A., ad Lee, K., eds., Readigs i Speech Recogitio. Morga Kaufma Publishers, Ic. 450-506. Mast, M.; Niema, H.; Noth, E.; ad Schukat- Talamazzii, E. G. 1995. Automatic classificatio of dialog acts with sematic classificatio trees ad polygrams. I IJCAI-95 Workshop o New Approaches to Learig for Natural Laguage Processig, 71-78. Moore, J. D., ad Pollack, M. E. 1992. A problem for RST: The eed tbr multi-level discourse aalysis. Computatioal Liguistics 18(4):537-544. Nagata, M., ad Morimoto, T. 1994a. First steps towards statistical modelig of dialogue to predict the speech act type of the ext utterace. Speech Commuicatio 15:193-203. Nagata, M., ad Morimoto, T. 1994b. A iformatiotheoretic model of discourse for ext utterace type predictio. Trasactio of Iformatio Processig Society of Japa 35(6): 1050-1061. Rabier, L. R. 1990. A tutorial o hidde markov models ad selected applicatios i speech recogitio. I Waibel, A., ad Lee, K., eds., Readigs i Speech Recogitio. Morga Kaufma. 267-296. Reithiger, N., ad Klese, M. 1997. Dialogue act classificatio usig laguage models. I Proceedigs of the 5th Europea Coferece o Speech Commuicatio ad Techology, 2235-2238. Reithiger, N.; Egle, R.; Kipp, M.; ad Klese, M. 1996. Predictig dialogue acts for a speech-to-speech traslatio system. I Proceedigs of the Iteratioal Coferece o Spoke Laguage Processig. Reithiger, N. 1995. Some experimets i speech act predictio. I Proceedigs of the AAAI Sprig Symposium o Empirical Methods i Discourse. Samuel, K.; Carberry, S.; ad Vijay-Shaker, K. 1998. Computig dialogue acts from features with trasformatio-based learig. I Proceedigs of the AAAI-98 Sprig Symposium o Applyig Machie Learig to Discourse Processig. SRI Trascripts. 1992. Trascripts derived from audiotape coversatios made at SRI Iteratioal, Melo Park, CA. Prepared by Jacquelie Kowtko uder the directio of Patti Price. Stolcke, A.; Shriberg, E.; Bates, R.; Coccaro, N.; Jurafsky, D.; Marti, R.; Meteer, M.; Ries, K.; Taylor, P.; ad Va Ess-Dykema, C. 1998. Dialog act modelig for coversatioal speech. I Proceedigs of the AAAI-98 Sprig Symposium o Applyig Machie Learig to Discourse Processig. Taylor, P.; Shimodaira, H.; Isard, S.; Kig, S.; ad Kowtko, J. 1996. Usig prosodic iformatio to costrai laguage models for spoke dialogue. I Proceedigs of the Iteratioal Coferece o Spoke Laguage Processig, 216-219. 17