Using dialogue context to improve parsing performance in dialogue systems

Using dialogue context to improve parsing performance in dialogue systems Ivan Meza-Ruiz and Oliver Lemon School of Informatics, Edinburgh University 2 Buccleuch Place, Edinburgh I.V.Meza-Ruiz@sms.ed.ac.uk, olemon@inf.ed.ac.uk Abstract We explore how to incorporate information from dialogue context to improve the selection of logical forms in the parsing components of dialogue systems. We present a machine learning approach which allows us to identify the most informative elements of the dialogue context for this task, and to improve the performance of a parser in a dialogue system by using a classifier. Features for the classifier are extracted from a dialogue manager [Lemon and Gruenstein, 2004] which implements the Information State Update approach to dialogue management. Our best result to date is a 54.5% reduction in parse error rate, compared to the baseline system of Lemon and Gruenstein [2004]. 1 Introduction Dialogue managers are responsible for controlling the overall behaviour of dialogue systems. One of their main functions is to keep track of information which constitutes the dialogue context. Another critical component of a dialogue system is its parser responsible for producing semantic representations of user inputs, for later integration in the dialogue context. A common problem is that there may be multiple outputs from the parser, each representing the semantics of a different reading of the user input, or representing parses of different speech recognition hypotheses. For example see figure 1, an extract from our corpus 1. This shows n different logical forms produced by the parsing component (Gemini [Dowding et al., 1993]) 1 This example corresponds to the user utterance go to the tower. This has a manual transcription, but the utterance was passed through the speech recogniser and generates n hypotheses. Each hypothesis has a confidence score and a corresponding logical form. 1

of a dialogue system (WITAS, Lemon and Gruenstein [2004]), for a single user utterance. In this work we aim to choose one logical form from the available hypotheses (or to reject the interaction) using information from the hypothesis and the dialogue context. In this example, the baseline system behaviour is to choose the first hypothesis, but the desired behaviour is to choose the second hypothesis, since it corresponds to the transcription of the user utterance. We also have the goal of identifying the informative elements of the dialogue context for this task. In order to accomplish this we use a machine learning approach to build a classifier which is able to identify correct hypotheses from the parser. Transcription : go to the tower Logical form : [command([go],[param_list([[pp_loc(to,arg([np([det([def],the), [n(tower,sg)]])]))]])])] Hypothesis 1 : go to the towers Confidence : 70 Logical form : [command([go],[param_list([[pp_loc(to,arg([np([det([def],the), [n(tower,pl)]])]))]])])] Hypothesis 2 : go to the tower Confidence : 70 Logical form : [command([go],[param_list([[pp_loc(to,arg([np([det([def],the), [n(tower,sg)]])]))]])])]... Hypothesis 5 : to the tower Confidence : 65 Logical form : wh_answer([param_list([[pp_loc(to,arg([np([det([def],the), [n(tower,sg)]])]))]])])... Figure 1: Example corpus excerpt The paper is structured as follows: in section 2, the main characteristics of the Information State Update approach and the dialogue manager are presented. In section 3, we present our chosen technique the Maximum Entropy machine learner. Section 4 explains the experimental setup, and section 5 sets out our results. 2 The Information State Approach The Information State (IS) approach is a theoretical framework for dialogue management. Information States are informally defined in Larsson and Traum [2000] as follows: 2

The term Information State of a dialogue represents the information necessary to distinguish it from other dialogues, representing the cumulative additions from previous actions in the dialogue, and motivating future action. In the present work, the WITAS dialogue system [Lemon et al., 2001, 2002; Lemon and Gruenstein, 2004] was used to collect data. This is a multimodal command and control dialogue system that allows a human operator to interact with a simulated unmanned aerial vehicle : a small robotic helicopter. The WITAS system uses the multi-threaded dialogue manager presented in Lemon et al. [2002] and Lemon and Gruenstein [2004]. This dialogue manager implements the Information State Update approach to dialogue management (see e.g. Larsson and Traum [2000]). The main characteristic of this system s dialogue manager is that it keeps the history of dialogue moves in a tree structure, while other versions of dialogue systems usually use a stack. Each of the branches of the tree represents a thread of the conversation. An attachment algorithm relates incoming moves with the branches of the tree. Much of the information in the dialogue context is derived from the logical forms (LFs) which encode the meanings of utterances produced by the user or system. Various data structures store this information in a systematic way. For instance, in Figure 2 a logical form is presented. This logical form corresponds to the utterance go to the tower. The head of the logical form represents the class of dialogue move (e.g. command, question, answer). The rest of the LF encodes the structure of the sentence and its meaning. In this case, the main verb is go which has as a parameter one destination. In the example, the destination is marked by the use of the preposition to which has a noun phrase as an argument. The noun phrase represents the destination, the tower. command([go],[param_lst([pp_loc(to,arg([np(det([def],the), [n(tower,sg)])]))])]) Figure 2: Logical form for go to the tower The multi-threaded dialogue manager uses the following data structures: Dialogue Move Tree (DMT ); Active Node List (ANL); Activity Tree (AT ); System Agenda (SA); Pending List (PL); Salience List (SL); Salient Task (ST ); Modality Buffer (MB) see Lemon and Gruenstein [2004] for full details. The DMT is used as a message board 3

to keep a record of the history of the dialogue contributions (the moves made by both the user and the system). Each branch in the tree represents a thread in the conversation. The ANL marks the active nodes on the DMT an active node indicates a conversational contribution that is relevant to the current discourse (e.g. an open question). The AT (Activity Tree) stores the current, past, and planned activities of the back-end system. The SA (System Agenda) collects all the utterances that the system intends to produce. The SL (Salience List) is a list of NPs (noun phrases) introduced in the current dialogue ordered by recency. The ST (Salient Task) structure is a list which contains the tasks which have been recently introduced in the dialogue. Finally, the MB buffers click events on the GUI. 3 Maximum Entropy Learning Maximum Entropy (ME) is a machine learning technique which is popular in Natural Language Processing since it has been shown to perform well in different classification tasks (e.g. CONLL03 [Tjong Kim Sang and De Meulder]). Most of this success derives from the fact that the generated model is based on as few assumptions as possible, which allows a lot of flexibility during classification. The created model, which represents the learnt behaviour, relies on a set of features f i and a set of weights λ i which constrain these features. During experimentation several sets of features are proposed to generate different models until a good one is found. A complete list of the features used to generate the different models in our work can be found in subsection 4.3. In particular we use the implementation of ME introduced by Lee [2004] given our previous experience with the tool. For each user utterance, our Maximum Entropy classifier will take as input the features of the current dialogue context (see section 4.3) and of the logical forms of the hypotheses to be classified, and return a decision whether to accept or reject each logical form. 4 Experimental setup We present our experiments as follows: in subsection 4.1, the corpus; in subsection 4.2, the baseline; in subsection 4.3, the extracted features. 4

4.1 The corpus In our experiments we used the corpus developed in Lemon and Gruenstein [2004]; Lemon [2004]; Gabsdil and Lemon [2004]. The corpus corresponds to the interactions of six subjects from Edinburgh University (4 male, 2 female) in which they each perform five simple tasks with the WITAS system, resulting in 30 complete dialogues. The corpus is made up of the manual transcription of each utterance, the 10-best hypotheses from the speech recogniser (Nuance 8.0), the logical form of the transcriptions and hypotheses from the parser (Gemini [Dowding et al., 1993]), and the Information States of the system [Lemon and Gruenstein, 2004]. The corpus consists of 339 utterances. Only 303 utterances are intelligible to the system, which means that 36 correspond to utterances which were identified as noise. The corpus has a total of 188 types of utterances, because many utterances were used several times (e.g. yes, go here ). Table 1 shows the 10 most frequent utterances in the corpus. Utterance Occurrences Utterance Occurrences fly to the church 17 now 10 fly to the tower 16 fight the fire 9 fly to the warehouse 15 yes 9 zoom in 12 show me the buildings 9 land 10 take off 8 Table 1: Most frequent user utterances 4.2 The baseline system Our baseline is the behaviour of the WITAS Dialogue System described in Lemon and Gruenstein [2004]; Lemon [2004]; Gabsdil and Lemon [2004]. Note that this system had good task completion rates (see Hockey et al. [2003]) and thus constitutes a sound baseline. The baseline performance was evaluated by the analysing of the logs from the user study. Each of the hypotheses was labelled with accept or reject labels depending on the reaction of the system. At this point, there are four cases: 1. One hypothesis is accepted, and it corresponds with the manual transcription. This is a true positive (tp) event. 2. One hypothesis is accepted, but it does not correspond with the manual transcription. This is a false positive (fp) event (e.g. the user said 5

Now but the system accepted a logical form corresponding to no ). 3. None of the hypotheses were accepted, but there was a manual transcription for this interaction. This means that the system failed to recognise an input that should have been accepted. This is a false negative (fn) event. 4. Finally, none of the hypothesis was accepted, but there was not a logical form for manual transcription. This means that the interaction should have been rejected by the system (e.g. user self-talk) This is a true negative (tn) event. For the baseline case the precision, recall and F-B1 figures were calculated, and we obtained the results presented in Table 2. These results allow us to evaluate the performance of the baseline system. The strategy used by the system is good at classifying false negatives (only 2 cases), this situation implies an impressive Recall figure, however the precision is not good (there are 97 fps). From Table 2 we can draw our first conclusion: the strategy used by the baseline to choose the logical form is not adequate, since there are many fp. An improvement in the behaviour of the parser would imply a reduction in the number of fp. A key challenge for the classifier we will construct is the case when the logical form to be accepted is not the first hypothesis but appears later in the list. In total, there are 22 cases where the correct logical form is in the set of LF hypotheses but is not the first case. Precision 60.08% Recall 98.65% FB1 74.78% Table 2: The baseline system evaluation 4.3 The Context Features We divide the features into three groups (speech processing, logical form, and information state), depending on the type of information they represent: 4.3.1 Speech processing features In this group there are just two features: the confidence of the speech recogniser in the hypothesis and the position of this related to the other hypothe- 6

ses. The confidence is a score from 0 to 100%, in this case we created 10 groups each representing 10 units of confidence. 4.3.2 Logical Form features We extract information about the dialogue move class, the main verb, and the complement in order to collect features of the logical form. The following features are extracted in this group: Dialogue move class (dm): this is codified on the head of the logical form. (e.g., command and wh_query). Main verb (verb): the main verb involved in the logical form. Complements (comps): in this case the information corresponds to the complements of the main verb. We extract the noun (noun), the number (num), and the the preposition (prep) (if the noun phrase is headed by a preposition). 4.3.3 Information State Features We extract IS features from: the hypothesis, current and past moves, and the information state data structures. In particular, we extracted dialogue context features from the following data structures of the Information State: Dialogue move tree (DMT ): In this case we extracted from the n-most active nodes the dm and verb features (for 2 n 6) Salience list: In this case we extracted from the n most recent noun phases the noun and num features (for 2 n 4) Last n-moves: This corresponds to the dm and verb from the last moves (for 1 n 2). This case can be considered as uni-gram or bi-gram of the dialogue moves. We also extracted the turn which corresponds to an identifier of the speaker who had the previous turn (user or system). No more information was used in this case. Table 3 summarises the context features. 7

Feature Confidence (conf ) Position (pos) Turn (turn) Dialogue Move (dm) Main verb (verb) Complements (comps) Noun phrase (noun), number(num) Source speech recogniser speech recogniser Information State hypothesis, current move, previous moves, DMT hypothesis, current move, previous moves, DMT hypothesis, current move Salience List (SL) 4.4 Selection procedure Table 3: Context Features for the Classifier The last important point related to set the experimentation is the procedure for selecting a hypothesis. For each user utterance, the classifier receives a set of hypotheses. The classifier then labels them as accept or reject with a certain confidence score. To choose one hypothesis from this list, or else reject the utterance, we use the following procedure: 1. Scan the list of classified n-best logical forms top-down. Return the highest confidence hypothesis which was classified as accept, and pass it to the dialogue manager as the LF chosen for the current move. 2. If step 1 fails, classify the interaction as rejected. (This means that all the hypotheses were classified as reject.) 5 Results and Analysis We used the standard metrics precision, recall and F B1 to measure the performance of the system. Of course, one important limitation of the current study is the small size of the corpus. For this reason a leave-oneout cross validation method was chosen to evaluate performance. Our current results, after experimenting with several combinations of feature groups, are shown in Table 4. Note that the raw figure for parse correctness of the system has risen by 17.8%. This is a 54.5% reduction in parse error rate. We expect that better results could be achieved with further experimentation and parameter optimisation. In qualitative terms, the strategy used by the classifier is also better than the baseline. First, there is not such a large difference between the precision and recall figures. Second, it was possible for the classifier to identify cases where the first hypothesis does not correspond to the correct one. 8

Baseline ME Classifier True positives 146 141 False positives 97 21 True negatives 58 117 False negatives 2 23 Precision 60.08% 87.04% Recall 98.65% 85.98% FB1 74.78% 86.50% Parse correctness 67.33% 85.15% Table 4: Results: Baseline versus Maximum Entropy Classifier During experimentation with different feature groups, we found the following contextual features to be most informative for this task, 1. Speech processing (both confidence and position features) 2. Dialogue move, main verb, and complement of the hypothesis 3. Dialogue move, main verb, and complement of the previous move 4. The turn feature 5. The dialogue move of the four most active nodes on the DMT 5.1 Error Analysis Even though the improvement of the system is promising against the baseline (11.75% on the figure of F B1), there is still a lot of work to do. The first type of error corresponds to the false positives. There are three causes of these errors: plural and singular nouns, adverbs and adjectives, and speech processing. It was difficult for the system to distinguish between plural and singular nouns. In many cases the plural/singular and position features were the only difference between the available hypotheses. The created classifier showed a preference for the singular cases, since they got a better position on the n-best list hypotheses of the speech recogniser. Finally, some of the errors were caused by the fact that the speech recogniser components was introducing some noise in the hypotheses as extra information. These cases were harder to classify since the hypotheses were too different to the actual utterance. A second group of errors corresponds to the false negatives. In this case the only pattern we found is that the confidence score is relatively low 9

Num interactions 302 Classes accept, ignore, clarify, ignore Baseline 61.81% weighted f-score Best Result 86.39% weighted f-score Table 5: Characteristics and results of Gabsdil and Lemon [2004] compared with the standard cases (this is in the range of 50 65). This makes it impossible for the classifier to choose one of the hypotheses, so it rejects all of them, generating in this case a false negative. The third type of errors is related to the cases where the correct hypothesis is not the first one. The system was able to identify 8 of the 22 cases. Most of the cases which were not identified are related to the use of plurals, singulars, adjectives and adverbs. 5.2 Comparison with previous work The closest related work is that of Gabsdil and Lemon [2004]. However, that research focuses on speech processing. It explores a combination of acoustic confidence and pragmatic features to predict the quality of incoming speech recognition hypotheses. Gabsdil and Lemon [2004] also used Machine Learning (TiMBL and RIPPER) with parameter optimisation to identify the most relevant features (the results are presented in Table 5). As can be seen from the table, these experiments include more classes. Additionally to accept and reject, the clarify and ignore classes were used. This makes it difficult to compare the results. However, a key point is that in both cases the information which was contained in the dialogue context was used, and in both cases this inclusion improved the performance of the components. 6 Further work There are several directions for further work. One of the most important is to apply this approach to larger corpora. A corpus used in future work has to be larger in two respects. First, in the actual size of the corpus, and secondly, in the size of the dialogues involved. We plan to run similar experiments on the communicator corpus Walker et al. [2001] which is being annotated with Information States in the TALK Project Another important direction to explore is to measure the actual impact of the improvement in parsing on the dialogue system overall (e.g. on user 10

task completion metrics). Up to now, we have supposed that the improvement shown on the parsing task implies an improvement of the dialogue system. However, we do not currently have an estimate about how much this improvement on the dialogue system is. Since this work is related to Gabsdil and Lemon [2004], a useful direction to be explored is the integration of both approaches. There are two ways to integrate them. First, we can put both filters together in a pipeline architecture. Second, we can incorporate the features of Gabsdil and Lemon [2004] with our features and use only one filter. In both cases we do not expect the increment in the performance to be the sum of both approaches. Another direction to be explored is related to the use of more robust parsers (e.g. the TRIPS parser Allen [1995]). 7 Conclusion The results presented above confirm that information from dialogue context can be used to improve the performance of the parsing component of a dialogue system. Our best result, using a Maximum Entropy learner, results in a 54.5% reduction in parse error rate, compared to the baseline system of Lemon and Gruenstein [2004]. Further, we also identified the most informative elements of the dialogue context for this task. We used techniques and features which we can expect to generalise across dialogue systems which implement the Information State Update approach to dialogue management. Acknowledgements This research was supported by a CONACYT 178316/192613 scholarship and by the TALK Project (Sixth Framework Programme of the European Community, contract no. IST-507802, www.talk-project.org) 2 References J. F. Allen. Natural Language Understanding. Benjamin Cummings, 1995. J. Dowding, J. M. Gawron, D. E. Appelt, J. Bear, L. Cherny, R. Moore, and D. B. Moran. GEMINI: A natural language system for spoken-language 2 The authors are solely responsible for the content of this document. It does not represent the opinion of the European Community, and the Community is not responsible for any use that might be made of the information contained herein. 11

understanding. In Meeting of the Association for Computational Linguistics, pages 54 61, 1993. M. Gabsdil and O. Lemon. Combining acoustic and pragmatic features to predict recognition performance in spoken dialogue systems. In Proc. Association for Computational Linguistics (ACL-04), 2004. B.-A. Hockey, O. Lemon, E. Campana, L. Hiatt, G. Aist, J. Hieronymus, A. Gruenstein, and J. Dowding. Targeted help for spoken dialogue systems: intelligent feedback improves naive users performance. In Proceedings of European Association for Computational Linguistics (EACL 03), pages 147 154, 2003. S. Larsson and D. Traum. Information state and dialogue management in the TRINDI Dialogue Move Engine Toolkit. Natural Language Engineering, 6(3-4):323 340, 2000. Z. Lee. Maximum Entropy Modeling Toolkit for Python and C++, 2004. URL www.nlplab.cn/zhangle/maxent_toolkit.html. O. Lemon. Context-sensitive speech recognition in Information-State Update dialogue systems: results for the grammar switching approach. In Proc. 8th Workshop on the Semantics and Pragmatics of Dialogue, CAT- ALOG 04, 2004. O. Lemon, A. Bracy, A. Gruenstein, and S. Peters. The WITAS Mult-Modal Dialogue System I. In EuroSpeech, 2001. O. Lemon and A. Gruenstein. Multithreaded context for robust conversational interfaces: context-sensitive speech recognition and interpretation of corrective fragments. ACM Transactions on Computer-Human Interaction (ACM TOCHI), 11(3):241 267, 2004. O. Lemon, A. Gruenstein, A. Battle, and S. Peters. Collaborative activities and multi-tasking in dialogue systems. In Proceedings of SIGdial, 2002. E. F. Tjong Kim Sang and F. De Meulder. Introduction to the CoNLL- 2003 Shared Task: Language-Independent Named Entity Recognition. In Proceedings of CoNLL-2003. M. A. Walker, R. J. Passonneau, and J. E. Boland. Quantitative and Qualitative Evaluation of Darpa Communicator Spoken Dialogue Systems. In Meeting of the Association for Computational Linguistics, pages 515 522, 2001. 12