2009 IEEE International Conference on Semantic Computing A metric for automatically evaluating coherent summaries via context chains Frank Schilder and Ravi Kondadadi Thomson Reuters Corporation Research & Development 610 Opperman Drive St. Paul, MN 55104, USA frank.schilder,ravikumar.kondadadi@thomsonreuters.com Abstract This paper introduces a new metric for automatically evaluation summaries called ContextChain. Based on an in-depth analysis of the TAC 2008 update summarization results, we show that previous automatic metrics such as ROUGE-2 and BE cannot reliably predict strong performing systems. We introduce two new terms called Correlation Recall and Correlation Precision and discuss how they cast more light on the coverage and the correctness of the respective metric. Our newly proposed metric called ContextChain incorporates findings from Giannakopoulos et al. (2008) and Barzilay and Lapata (2008) [2]. We show that our metric correlates with responsiveness scores even for the top n systems that participated in the TAC 2008 update summarization task, whereas ROUGE-2 and BE do not show a correlation for the top 25 systems. 1. Introduction NIST has been organizing summarization competitions for the last several years and produced manually evaluations based on a metric called Responsiveness. Responsiveness is defined as a metric for how well a summary can meet the information need of a user asking a complex question. Creating such an evaluation is very labor-intensive, because every summary has to be judged by a human. Consequently, much effort has been put into the development of an automatic metric for evaluating summarization systems in order to advance the state-of-the-art for automatic summarization more quickly. NIST evaluated automatically generated summaries by utilizing the ROUGE metric for the recent Document Understanding Conferences (DUC) [9] and last year s Text Analysis Conference (TAC) [3]. ROUGE relies on the statistical analysis of co-occurring word n-grams between the peer and reference summary. However, the last two summarization tasks defined by NIST for 2007 and 2008 showed that ROUGE has two shortcomings. First, the best systems for the DUC 2007 competition received ROUGE-2 values close or equivalent to some human written summaries. Given this situation, it becomes more and more difficult to measure progress via an automatic metric. Second, a closer analysis of the top systems showed that there was no or very little correlation between the automatic metric and the Responsiveness score, although the overall correlation between these automatic metrics and responsiveness was still high for the full set of 58 evaluated systems [10]. We conclude that low ROUGE-2 (and BE) scores can be seen as reliable indication for low summarization performance, but high ROUGE-2 (and BE) scores are not a sufficient differentiator for good and very good performing systems. The two main contributions of our paper are the following 1. a more detailed analysis of how automatic metrics indicate top performing systems than described in [10]. We present a comparison of the top 10-35 systems sorted according to (a) Responsiveness and (b) the automatic metric. We note that an analysis according to the first sorting criteria indicates whether all good systems are reliably found by the automatic metric, whereas an analysis according to sorting criteria (b) describes to what degree the metric delivers correct results. These two views can be seen as recall and precision, respectively. 2. a new evaluation metric called ContextChain focussing on the local coherence of the automatic summaries. The new evaluation metric we propose relies on an n- gram graph and puts more emphasis on the linear coherence of the written summaries. An automatically generated summary may receive a high ROUGE score, because it contains many relevant n-grams, but may be 978-0-7695-3800-6/09 $26.00 2009 IEEE DOI 10.1109/ICSC.2009.100 65
badly structured because the system did not consider the local coherence constraints. Our automatic metric tries to capture the local coherence by extracting the local context of named entities and keeping the typical sequence of how entities and concepts are introduced. Our analysis shows that ContextChain offers significantly better correlation between responsiveness than previously used automatic metrics for the top 10-35 performing systems and comparable performance and partly better performance than the recently proposed AutoSumm metric [4]. The remainder of this paper is organized as follows. First, we discuss related work before we describe in more detail our new approach. Section 3 provides an overview of our new evaluation metric. Section 4 gives the main task definition for TAC 2008. Section 5 discusses the evaluation for the TAC 2008 update task while a more in-depth analysis of this evaluation and more automatic metrics is presented in section 6. Section 7 concludes the paper and discusses next research steps. 2. Related Work ROUGE [6, 7] is one of the first automatic summarization evaluation metrics proposed. ROUGE uses lexical n-grams to compare human written model summaries with automatically generated summaries. Hovy et. al. Later, [5] proposed an approach to automatic evaluation based on the concept of Basic Element. A Basic Element (BE) is a semantic unit extracted from a sentence such as subject-object relation, modifier-object relation. Systems with higher overlap of system-summary BEs and humansummary BEs get higher BE scores. Recently, AutoSummENG was introduced as a summarization evaluation method that evaluates summaries by extracting and comparing graphs of character and word n- grams [4]. Both the model and system summaries are represented as graphs. Edges in the graph are created based on the adjacency relation between n-grams. The edges are weighted according to the distance between the neighbors or the number of cooccurences with in the text. Similarity between two graphs is computed as number of common edges. Similarity can also include the weights of the common edges. In section 6, we evaluate these three automatic metrics with the TAC 2008 evaluation results. Two other proposals for new evaluation metrics address the question of improving the evalution metric in general, but they do not address the problem of low correlations for top n system discussed by this paper. Tratz and Hovy (2008) [11] describe a new implementation of the BE method, called BE with Transformations for Evaluation (BEwTE) that includes a significantly improved matching capability using a variety of operations to transform and match BEs in various ways. Louis and Nenkova (2008) [1] use features based on distribution of terms in the input summary and the model summary. They use KL, JS Divergence and cosine similarity to compute the similarity of term distribution of the input and the model summary. 3. Context Chains and n-gram graphs [4] propose a method called AutoSummENG that generate n-gram graphs for the model summaries and the automatically generated summaries. The AutoSummENG summarization evaluation metric is based on the similarity between the n-gram graph representations for the generated system summaries and model summaries. An n-gram graph can be generated for word or character windows. An 2-gram graph for n=2 for the following sentence can constructed by first generating all 2-grams: A quick brown fox jumps over the lazy dog. Figure 1 shows the complete graph generated from this sentence. In addition, weights on the edges can indidcate the distance between the neighbors or the number of occurrences in the text. By creating edges between the adjacent n-grams, this approach takes the contextual information into consideration as opposed to approaches that only use the n-gram overlap between the system and model summaries. Similarity between the graphs is computed via the Value Similarity, the Size Similarity, and the Co-occurrence Similarity. 1 They show that their approach is superior over past automatic metrics such as ROUGE and BE for the DUC 2005, 2006 and 2007 summarization tasks. Our approach is an extension of AutoSummENG that generates n-gram graphs based on co-reference chains. Our approach also models local coherence by establishing chains of potentially co-referent named entities and definite descriptions. The n-gram graph is then generated from the context of these referents. Consider the the beginning of a news story shown in figure 2. These n-grams can be seen as the events the entities mentioned in the summaries are involved in and the links determine the sequence in which the events should be mentioned. The links, therefore, capture the local coherence, as found in the model summaries. Note that this is a main difference between our approach and the other purely n-gram based approaches. An automatically generated summary may share lots of n-grams with the model summaries, but the sequence of how the events are presented may be incoherent and hence decreases the readability of the summary. 1 See [4] on how to compute these scores. 66
Figure 1. An n-gram graph The Justice Department is conducting an anti-trust trial against Microsoft Corp with evidence that the company is increasingly attempting to crush competitors. Microsoft is accused of trying to forcefully buy into markets... All context 4-grams (minus stop words) for the named entity Microsoft: Department conducting anti-trust trial evidence company increasingly attempting accused trying forcefully buy Two context chains are generated: Department conducting anti-trust trial accused trying forcefully buy evidence company increasingly attempting accused trying forcefully buy Figure 2. Example text and 2 example context chains generated for one named entity We implemented our approach within the AutoSumm GUI that is freely available. For the named entity extraction and chunking, we used LingPipe s named entity tagger and chunker. 2 4. TAC 2008 main task descriptions The main task in 2008 addressed the challenge of providing an update summary for a cluster of documents, given that the user has already read documents on this topic. Consequently, the update summary should not contain information that the user is already aware of. More precisely, the task is divided into two sub-tasks. The goal of the first summarization sub-task is to produce a normal query-based multi-document summary of a cluster of news documents. The second sub-task assumes that the information described in the first cluster is already known to a user who would like to receive a summary for a second cluster. The first cluster of documents needs to be summarized as a multi-document summary, whereas the second cluster is to be summarized taking into account the knowledge present in the information described by the first cluster. The input for this entire update task is a list of topics, each of which contain a title, a sequence of questions and 2 Baldwin, B. and B. Carpenter. LingPipe. http://www.aliasi.com/lingpipe/. two clusters of 10 documents each. An example topic and the question is the following: <title> Kyoto Protocol Implementation </title> <narrative> Track the implementation of key elements of the Kyoto Protocol by the 30 signatory countries. Describe what specific measures will actually be taken or not taken by various countries in response to the key climate change mitigation elements of the Kyoto Protocol. </narrative> 5. TAC 2008 Evaluations NIST carried out a manual evaluation and an automatic evaluation. This section presents the overall results with respect to how different metrics correlate to each other. 3. 3 For more detailed information on the systems approaches and performances see [3] 67
5.1. Manual evaluation Summaries were manually evaluated for linguistic quality, Responsiveness and Pyramid score. The overall Responsiveness score is an integer between 1 (very poor) and 5 (very good) and is based on both the linguistic quality of the summary and the amount of information in the summary that helps to satisfy the information need expressed in the topic narrative. The Pyramid scores were created by NIST assessors from the four model summaries for each document set and the peer summaries using the pyramid guidelines provided by Columbia University. Responsiveness and Pyramid scores correlate highly with each other, as shown in Figure 3. The Pearson coefficient for these two manual metrics was 0.950684. Responsiveness 1.5 2.0 2.5 0.05 0.10 0.15 0.20 0.25 0.30 Update task: manual evaluations Pyramid Figure 3. Responsiveness/Pyramid 5.2. Automatic evaluation At TAC 2008, two automatic metrics were used: ROUGE and BE. Analyzing the correlation between the ROUGE-2 and Responsiveness score, two observations can be made. The Pearson coefficient for all systems is still high, but not as high as for Responsiveness vs. Pyramid score (i.e., 0.8941). However, focussing only on the top systems there is no correlation between Responsiveness and ROUGE-2. For example, figure 4 indicates that the top 22 systems according to ROUGE-2 (i.e., ROUGE-2 scores > 0.08) show no correlation between ROUGE-2 and Responsiveness. The Pearson coefficient for these systems is 0.0687. Given this first observation, we may conclude that only low ROUGE-2 scores (i.e., < 0.08) can be seen as indication for the summarization performance, but high ROUGE- 2 scores cannot differentiate good and very good performing systems. Responsiveness 1.5 2.0 2.5 top rest top Update task: manual vs. automatic metrics 0.04 0.05 0.06 0.07 0.08 0.09 0.10 ROUGE 2 Figure 4. Responsiveness/ROUGE-2 For the BE evaluation, a similar picture emerged. The correlation between Responsiveness and BE for all systems was relatively high. Pearson s r was 0.9106. Considering only the top 22 systems, one observes that the Pearson coefficient for BE and Responsiveness showed a weak correlation between the automatic and the manual evaluation metric. The Pearson coefficient is 0.4409 and the confidence intervals show that this correlation is not significant. 6. More automatic metrics Given the low correlations between ROUGE-2 and BE for the top 22 systems, we investigated other metrics that may show a higher correlation. We tried the AutoSumm metric and developed our own metric utilizing the AutoSumm software. In our experiments, we ran the TAC 2008 systems through the AutoSumm and our new system called ContextChain for different n top systems (n = 10, 15, 20, 25, 30, 35). For the obtained results, we computed Pearson coefficients in two ways: Responsiveness-sorted: The two vectors of results were sorted according to Responsiveness scores. This could mean that systems that obtained high scores from the automatic metric, but low Responsiveness 68
Pearson's r coefficient scores were not considered for the correlation evaluation. Automatic evaluation metric-sorted: the two vectors of results were sorted according to the automatic metric. This could mean that systems that obtained high Responsiveness scores, but low automatic metric scores were not considered for the correlation evaluation. 0.0 0.2 0.4 0.6 0.8 1.0 Correlation Recall 10 15 20 25 30 35 n top systems Metrics ContextChain AutoSummENG (words) BE AutoSummENG (characters) ROUGE-2 Figure 5. Correlations for n top systems sorted by Responsiveness Figure 5 shows the Pearson coefficients for the top 10-35 systems, respectively, if sorted according to Responsiveness. This set-up of the experiment focusses on the top n systems determined by the manual evaluation metric. An automatic metric that shows high coefficients throughout the different number of top systems, shows high coverage (or recall) of the top performing systems. We define this set-up as Correlation Recall. Conversely, an automatic system that shows a consistently high coefficient for systems sorted according to the automatic metric, is reliable in terms of its precision. In other words, a high automatic score is likely to indicate a high performing system in terms of Responsiveness. We define this set-up as Correlation Precision. Figure 5 contains the Pearson coefficients for the top n systems for all metrics dicussed. The values for ROUGE-2 and BE are generally low and only high n allow the conclusion that the metric correlates with the human evaluation metric (cf. tables in appendix). Note that AutoSummENG using characters for the n-gram graphs does not preform Pearson's r coefficient -1.0-0.5 0.0 0.5 1.0 Correlation Precision 10 15 20 25 30 35 n top systems Metrics ContextChain AutoSummENG (words) BE AutoSummENG (characters) ROUGE-2 Figure 6. Correlations for n top systems sorted by evaluation metric very well either. We tried out AutoSummENG with words instead which resulted in a better overall performance similar to our proposed ContectChain metric. The Correlation Precision, on the other hand, seems to be a better indicator for how good the metric can predict strong performing systems. Figure 6 indicates that the previously used automatic metrics ROUGE-2 and BE show low or no correlations for up to the 25 top systems. In fact, for the top 10 or 15 systems the relation between Responsiveness and ROUGE-2 may even be inverse. 4 BE shows a better performance than ROUGE in that respect, but its Pearson coefficients are still lower than the other two metrics. Auto- SummENG based on characters performs again not as well as when words are used for generating the graph. ContextChain is very similar to AutoSummENG with words while showing a higher coefficient when only the top 10 systems are considered. This difference, however, is not significant. We also analyzed the results of these experiments regarding their significance. The appendix contains tables showing the confidence intervals computed via the Fishers r-to-z transformation [8] for Correlation Recall and Correlation Precision, respectively. 4 Bear in mind that the confidence intervals for these small samples are very large. 69
7. Conclusions This paper introduced a new metric for automatically evaluation summaries. Based on an in-depth analysis of the recent TAC 2008 update summarization results, we showed that previous automatic metrics such as ROUGE-2 and BE cannot reliably predict strong performing systems. We introduced two new terms called Correlation Recall and Correlation Precision and discussed how they cast more light on the coverage and the correctness of the respective metric. Our newly introduced metric uses only the context of named entities and definite descriptions in a summary. Linking the contexts of named entities and definite descriptions turned out to be a useful tool for predicting the quality of a summary. We hypothesize that these tuples of n- grams capture important semantic and discourse-level links between entities described by the text. Hence, these context links may also be useful for other applications such as information extraction or discourse parsing. Whether these semantic applications can benefit from context chains is left to future research. A. Confidence intervals Correlation Recall (sorted by Responsiveness) ContextChain 10 15 20 25 30 Pearson s r 0.548 0.717 ** 0.741 ** 0.775 ** 0.867 ** upper range 0.875 0.899 0.891 0.896 0.935 lower range -0.125 0.323 0.445 0.548 0.736 AutoSumm 10 15 20 25 30 Pearson s r 0.822 ** 0.837 ** 0.565 ** 0.686 ** 0.828 ** upper range 0.957 0.944 0.806 0.85 0.915 lower range 0.399 0.569 0.164 0.399 0.666 ROUGE-2 10 15 20 25 30 Pearson s r 0.405 0.272 0.278 0.218 0.496 ** upper range 0.824 0.688 0.642 0.565 0.726 lower range -0.301-0.279-0.188-0.193 0.165 BE 10 15 20 25 30 Pearson s r 0.494 0.409 0.532 ** 0.448 ** 0.641 ** upper range 0.857 0.762 0.789 0.716 0.813 lower range -0.197-0.13 0.117 0.064 0.365 Correlation Precision (sorted by automatic metric) ContextChain 10 15 20 25 30 Pearson s r 0.546 0.626 ** 0.544 ** 0.627 ** 0.681 ** upper range 0.875 0.862 0.795 0.819 0.836 lower range -0.127 0.167 0.134 0.309 0.425 AutoSumm 10 15 20 25 30 Pearson s r 0.296 0.578 ** 0.553 ** 0.695 ** 0.747 ** upper range 0.78 0.841 0.8 0.855 0.872 lower range -0.41 0.094 0.147 0.413 0.529 ROUGE-2 10 15 20 25 30 Pearson s r -0.662-0.213-0.008 0.244 0.478 ** upper range -0.055 0.336 0.436 0.583 0.715 lower range -0.912-0.654-0.449-0.167 0.143 BE 10 15 20 25 30 Pearson s r 0.233 0.125 0.377 0.438 ** 0.624 ** upper range 0.752 0.599 0.702 0.71 0.804 lower range -0.465-0.413-0.078 0.052 0.341 References [1] A. N. A. Louis. Automatic summary evaluation without human models. Proceedings of Text Understanding Conference (TAC), 2008. [2] R. Barzilay and M. Lapata. Modeling local coherence: An entity-based approach. Computational Linguistics, 34(1):1 34, 2008. [3] H. T. Dang. Update summarization task and opinion summarization pilot task. In Proceedings of the First Text Analysis Conference (TAC 2008), Gaithersburg, MD, Nov. 2008. National Institute of Standards and Technology. [4] G. Giannakopoulos, V. Karkaletsis, G. Vouros, and P. Stamatopoulos. Summarization system evaluation revisited: N- gram graphs. ACM Trans. Speech Lang. Process., 5(3):1 39, 2008. [5] E. Hovy, C.-Y. Lin, and L. Zhou. Evaluating duc 2005 using basic elements. Proceedings of Document Understanding Conference (DUC). Vancouver, B.C., Canada, 2005. [6] C.-Y. Lin. Rouge: A package for automatic evaluation of summaries. In Proc. ACL workshop on Text Summarization Branches Out, page 10, 2004. [7] C.-Y. Lin and E. Hovy. Manual and automatic evaluation of summaries. In Proceedings of the ACL-02 Workshop on Automatic Summarization, pages 45 51, Morristown, NJ, USA, 2002. Association for Computational Linguistics. [8] G. Loftus and E. Loftus. Essence of Statistics. McGraw Hill, 2nd edition edition, 1988. [9] P. Over, H. Dang, and D. Harman. DUC in context. Information Processing and Management, 43(6):1506 1520, 2007. [10] F. Schilder, R. Kondadadi, J. L. Leidner, and J. G. Conrad. Thomson Reuters at TAC 2008: Aggressive Filtering with FastSum for Update and Opinion Summarization. In Proceedings of the First Text Analysis Conference (TAC), Gaithersburg, MD, 2008. NIST. [11] S. Tratz and E. Hovy. Summarization evaluation using transformed basic elements. Proceedings of Text Understanding Conference (TAC), 2008. 70