A metric for automatically evaluating coherent summaries via context chains

Similar documents
LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization

HLTCOE at TREC 2013: Temporal Summarization

Query-based Opinion Summarization for Legal Blog Entries

A Case Study: News Classification Based on Term Frequency

Columbia University at DUC 2004

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Writing a Basic Assessment Report. CUNY Office of Undergraduate Studies

Linking Task: Identifying authors and book titles in verbose queries

The Smart/Empire TIPSTER IR System

Variations of the Similarity Function of TextRank for Automated Summarization

CSC200: Lecture 4. Allan Borodin

Probabilistic Latent Semantic Analysis

Word Segmentation of Off-line Handwritten Documents

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Longest Common Subsequence: A Method for Automatic Evaluation of Handwritten Essays

M55205-Mastering Microsoft Project 2016

Designing a Rubric to Assess the Modelling Phase of Student Design Projects in Upper Year Engineering Courses

Scoring Guide for Candidates For retake candidates who began the Certification process in and earlier.

Extending Place Value with Whole Numbers to 1,000,000

The Effect of Discourse Markers on the Speaking Production of EFL Students. Iman Moradimanesh

Evidence for Reliability, Validity and Learning Effectiveness

arxiv: v1 [cs.cl] 2 Apr 2017

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Vocabulary Agreement Among Model Summaries And Source Documents 1

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

Using dialogue context to improve parsing performance in dialogue systems

Cross Language Information Retrieval

THE PENNSYLVANIA STATE UNIVERSITY SCHREYER HONORS COLLEGE DEPARTMENT OF MATHEMATICS ASSESSING THE EFFECTIVENESS OF MULTIPLE CHOICE MATH TESTS

A Note on Structuring Employability Skills for Accounting Students

Math-U-See Correlation with the Common Core State Standards for Mathematical Content for Third Grade

Learning Methods for Fuzzy Systems

South Carolina English Language Arts

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN

Short Text Understanding Through Lexical-Semantic Analysis

Efficient Online Summarization of Microblogging Streams

On the Combined Behavior of Autonomous Resource Management Agents

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

An NFR Pattern Approach to Dealing with Non-Functional Requirements

VOL. 3, NO. 5, May 2012 ISSN Journal of Emerging Trends in Computing and Information Sciences CIS Journal. All rights reserved.

Reducing Features to Improve Bug Prediction

Beyond the Blend: Optimizing the Use of your Learning Technologies. Bryan Chapman, Chapman Alliance

Learning Methods in Multilingual Speech Recognition

Levels of processing: Qualitative differences or task-demand differences?

Major Milestones, Team Activities, and Individual Deliverables

AQUA: An Ontology-Driven Question Answering System

PELLISSIPPI STATE TECHNICAL COMMUNITY COLLEGE MASTER SYLLABUS APPLIED MECHANICS MET 2025

Lecturing for Deeper Learning Effective, Efficient, Research-based Strategies

How to Judge the Quality of an Objective Classroom Test

Connect Mcgraw Hill Managerial Accounting Promo Code

Leveraging Sentiment to Compute Word Similarity

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Statewide Framework Document for:

Mcgraw Hill Financial Accounting Connect Promo Code

Classifying combinations: Do students distinguish between different types of combination problems?

Data Integration through Clustering and Finding Statistical Relations - Validation of Approach

Textbook Evalyation:

Visual CP Representation of Knowledge

How to read a Paper ISMLL. Dr. Josif Grabocka, Carlotta Schatten

PIRLS. International Achievement in the Processes of Reading Comprehension Results from PIRLS 2001 in 35 Countries

On document relevance and lexical cohesion between query terms

Instructor: Mario D. Garrett, Ph.D. Phone: Office: Hepner Hall (HH) 100

Short vs. Extended Answer Questions in Computer Science Exams

GROUP COMPOSITION IN THE NAVIGATION SIMULATOR A PILOT STUDY Magnus Boström (Kalmar Maritime Academy, Sweden)

The Importance of Social Network Structure in the Open Source Software Developer Community

VIEW: An Assessment of Problem Solving Style

Measuring the relative compositionality of verb-noun (V-N) collocations by integrating features

Language Acquisition Chart

A Comparison of Two Text Representations for Sentiment Analysis

On Human Computer Interaction, HCI. Dr. Saif al Zahir Electrical and Computer Engineering Department UBC

Hawai i Pacific University Sees Stellar Response Rates for Course Evaluations

Methods for the Qualitative Evaluation of Lexical Association Measures

Australian Journal of Basic and Applied Sciences

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Meta Comments for Summarizing Meeting Speech

An Evaluation of the Interactive-Activation Model Using Masked Partial-Word Priming. Jason R. Perry. University of Western Ontario. Stephen J.

M-Learning. Hauptseminar E-Learning Sommersemester Michael Kellerer LFE Medieninformatik

Control and Boundedness

What is PDE? Research Report. Paul Nichols

Rule Learning With Negation: Issues Regarding Effectiveness

OVERVIEW OF CURRICULUM-BASED MEASUREMENT AS A GENERAL OUTCOME MEASURE

Task Tolerance of MT Output in Integrated Text Processes

MODULE 4 Data Collection and Hypothesis Development. Trainer Outline

WHY SOLVE PROBLEMS? INTERVIEWING COLLEGE FACULTY ABOUT THE LEARNING AND TEACHING OF PROBLEM SOLVING

Rote rehearsal and spacing effects in the free recall of pure and mixed lists. By: Peter P.J.L. Verkoeijen and Peter F. Delaney

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Honors Mathematics. Introduction and Definition of Honors Mathematics

Developing an Assessment Plan to Learn About Student Learning

PREPARING FOR THE SITE VISIT IN YOUR FUTURE

Audit Of Teaching Assignments. An Integrated Analysis of Teacher Educational Background and Courses Taught October 2007

Practical Research. Planning and Design. Paul D. Leedy. Jeanne Ellis Ormrod. Upper Saddle River, New Jersey Columbus, Ohio

A heuristic framework for pivot-based bilingual dictionary induction

Assignment 1: Predicting Amazon Review Ratings

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

College Pricing. Ben Johnson. April 30, Abstract. Colleges in the United States price discriminate based on student characteristics

CHAPTER 4: REIMBURSEMENT STRATEGIES 24

Organizational Knowledge Distribution: An Experimental Evaluation

Summarizing Text Documents: Carnegie Mellon University 4616 Henry Street

Corpus Linguistics (L615)

STA 225: Introductory Statistics (CT)

Transcription:

2009 IEEE International Conference on Semantic Computing A metric for automatically evaluating coherent summaries via context chains Frank Schilder and Ravi Kondadadi Thomson Reuters Corporation Research & Development 610 Opperman Drive St. Paul, MN 55104, USA frank.schilder,ravikumar.kondadadi@thomsonreuters.com Abstract This paper introduces a new metric for automatically evaluation summaries called ContextChain. Based on an in-depth analysis of the TAC 2008 update summarization results, we show that previous automatic metrics such as ROUGE-2 and BE cannot reliably predict strong performing systems. We introduce two new terms called Correlation Recall and Correlation Precision and discuss how they cast more light on the coverage and the correctness of the respective metric. Our newly proposed metric called ContextChain incorporates findings from Giannakopoulos et al. (2008) and Barzilay and Lapata (2008) [2]. We show that our metric correlates with responsiveness scores even for the top n systems that participated in the TAC 2008 update summarization task, whereas ROUGE-2 and BE do not show a correlation for the top 25 systems. 1. Introduction NIST has been organizing summarization competitions for the last several years and produced manually evaluations based on a metric called Responsiveness. Responsiveness is defined as a metric for how well a summary can meet the information need of a user asking a complex question. Creating such an evaluation is very labor-intensive, because every summary has to be judged by a human. Consequently, much effort has been put into the development of an automatic metric for evaluating summarization systems in order to advance the state-of-the-art for automatic summarization more quickly. NIST evaluated automatically generated summaries by utilizing the ROUGE metric for the recent Document Understanding Conferences (DUC) [9] and last year s Text Analysis Conference (TAC) [3]. ROUGE relies on the statistical analysis of co-occurring word n-grams between the peer and reference summary. However, the last two summarization tasks defined by NIST for 2007 and 2008 showed that ROUGE has two shortcomings. First, the best systems for the DUC 2007 competition received ROUGE-2 values close or equivalent to some human written summaries. Given this situation, it becomes more and more difficult to measure progress via an automatic metric. Second, a closer analysis of the top systems showed that there was no or very little correlation between the automatic metric and the Responsiveness score, although the overall correlation between these automatic metrics and responsiveness was still high for the full set of 58 evaluated systems [10]. We conclude that low ROUGE-2 (and BE) scores can be seen as reliable indication for low summarization performance, but high ROUGE-2 (and BE) scores are not a sufficient differentiator for good and very good performing systems. The two main contributions of our paper are the following 1. a more detailed analysis of how automatic metrics indicate top performing systems than described in [10]. We present a comparison of the top 10-35 systems sorted according to (a) Responsiveness and (b) the automatic metric. We note that an analysis according to the first sorting criteria indicates whether all good systems are reliably found by the automatic metric, whereas an analysis according to sorting criteria (b) describes to what degree the metric delivers correct results. These two views can be seen as recall and precision, respectively. 2. a new evaluation metric called ContextChain focussing on the local coherence of the automatic summaries. The new evaluation metric we propose relies on an n- gram graph and puts more emphasis on the linear coherence of the written summaries. An automatically generated summary may receive a high ROUGE score, because it contains many relevant n-grams, but may be 978-0-7695-3800-6/09 $26.00 2009 IEEE DOI 10.1109/ICSC.2009.100 65

badly structured because the system did not consider the local coherence constraints. Our automatic metric tries to capture the local coherence by extracting the local context of named entities and keeping the typical sequence of how entities and concepts are introduced. Our analysis shows that ContextChain offers significantly better correlation between responsiveness than previously used automatic metrics for the top 10-35 performing systems and comparable performance and partly better performance than the recently proposed AutoSumm metric [4]. The remainder of this paper is organized as follows. First, we discuss related work before we describe in more detail our new approach. Section 3 provides an overview of our new evaluation metric. Section 4 gives the main task definition for TAC 2008. Section 5 discusses the evaluation for the TAC 2008 update task while a more in-depth analysis of this evaluation and more automatic metrics is presented in section 6. Section 7 concludes the paper and discusses next research steps. 2. Related Work ROUGE [6, 7] is one of the first automatic summarization evaluation metrics proposed. ROUGE uses lexical n-grams to compare human written model summaries with automatically generated summaries. Hovy et. al. Later, [5] proposed an approach to automatic evaluation based on the concept of Basic Element. A Basic Element (BE) is a semantic unit extracted from a sentence such as subject-object relation, modifier-object relation. Systems with higher overlap of system-summary BEs and humansummary BEs get higher BE scores. Recently, AutoSummENG was introduced as a summarization evaluation method that evaluates summaries by extracting and comparing graphs of character and word n- grams [4]. Both the model and system summaries are represented as graphs. Edges in the graph are created based on the adjacency relation between n-grams. The edges are weighted according to the distance between the neighbors or the number of cooccurences with in the text. Similarity between two graphs is computed as number of common edges. Similarity can also include the weights of the common edges. In section 6, we evaluate these three automatic metrics with the TAC 2008 evaluation results. Two other proposals for new evaluation metrics address the question of improving the evalution metric in general, but they do not address the problem of low correlations for top n system discussed by this paper. Tratz and Hovy (2008) [11] describe a new implementation of the BE method, called BE with Transformations for Evaluation (BEwTE) that includes a significantly improved matching capability using a variety of operations to transform and match BEs in various ways. Louis and Nenkova (2008) [1] use features based on distribution of terms in the input summary and the model summary. They use KL, JS Divergence and cosine similarity to compute the similarity of term distribution of the input and the model summary. 3. Context Chains and n-gram graphs [4] propose a method called AutoSummENG that generate n-gram graphs for the model summaries and the automatically generated summaries. The AutoSummENG summarization evaluation metric is based on the similarity between the n-gram graph representations for the generated system summaries and model summaries. An n-gram graph can be generated for word or character windows. An 2-gram graph for n=2 for the following sentence can constructed by first generating all 2-grams: A quick brown fox jumps over the lazy dog. Figure 1 shows the complete graph generated from this sentence. In addition, weights on the edges can indidcate the distance between the neighbors or the number of occurrences in the text. By creating edges between the adjacent n-grams, this approach takes the contextual information into consideration as opposed to approaches that only use the n-gram overlap between the system and model summaries. Similarity between the graphs is computed via the Value Similarity, the Size Similarity, and the Co-occurrence Similarity. 1 They show that their approach is superior over past automatic metrics such as ROUGE and BE for the DUC 2005, 2006 and 2007 summarization tasks. Our approach is an extension of AutoSummENG that generates n-gram graphs based on co-reference chains. Our approach also models local coherence by establishing chains of potentially co-referent named entities and definite descriptions. The n-gram graph is then generated from the context of these referents. Consider the the beginning of a news story shown in figure 2. These n-grams can be seen as the events the entities mentioned in the summaries are involved in and the links determine the sequence in which the events should be mentioned. The links, therefore, capture the local coherence, as found in the model summaries. Note that this is a main difference between our approach and the other purely n-gram based approaches. An automatically generated summary may share lots of n-grams with the model summaries, but the sequence of how the events are presented may be incoherent and hence decreases the readability of the summary. 1 See [4] on how to compute these scores. 66

Figure 1. An n-gram graph The Justice Department is conducting an anti-trust trial against Microsoft Corp with evidence that the company is increasingly attempting to crush competitors. Microsoft is accused of trying to forcefully buy into markets... All context 4-grams (minus stop words) for the named entity Microsoft: Department conducting anti-trust trial evidence company increasingly attempting accused trying forcefully buy Two context chains are generated: Department conducting anti-trust trial accused trying forcefully buy evidence company increasingly attempting accused trying forcefully buy Figure 2. Example text and 2 example context chains generated for one named entity We implemented our approach within the AutoSumm GUI that is freely available. For the named entity extraction and chunking, we used LingPipe s named entity tagger and chunker. 2 4. TAC 2008 main task descriptions The main task in 2008 addressed the challenge of providing an update summary for a cluster of documents, given that the user has already read documents on this topic. Consequently, the update summary should not contain information that the user is already aware of. More precisely, the task is divided into two sub-tasks. The goal of the first summarization sub-task is to produce a normal query-based multi-document summary of a cluster of news documents. The second sub-task assumes that the information described in the first cluster is already known to a user who would like to receive a summary for a second cluster. The first cluster of documents needs to be summarized as a multi-document summary, whereas the second cluster is to be summarized taking into account the knowledge present in the information described by the first cluster. The input for this entire update task is a list of topics, each of which contain a title, a sequence of questions and 2 Baldwin, B. and B. Carpenter. LingPipe. http://www.aliasi.com/lingpipe/. two clusters of 10 documents each. An example topic and the question is the following: <title> Kyoto Protocol Implementation </title> <narrative> Track the implementation of key elements of the Kyoto Protocol by the 30 signatory countries. Describe what specific measures will actually be taken or not taken by various countries in response to the key climate change mitigation elements of the Kyoto Protocol. </narrative> 5. TAC 2008 Evaluations NIST carried out a manual evaluation and an automatic evaluation. This section presents the overall results with respect to how different metrics correlate to each other. 3. 3 For more detailed information on the systems approaches and performances see [3] 67

5.1. Manual evaluation Summaries were manually evaluated for linguistic quality, Responsiveness and Pyramid score. The overall Responsiveness score is an integer between 1 (very poor) and 5 (very good) and is based on both the linguistic quality of the summary and the amount of information in the summary that helps to satisfy the information need expressed in the topic narrative. The Pyramid scores were created by NIST assessors from the four model summaries for each document set and the peer summaries using the pyramid guidelines provided by Columbia University. Responsiveness and Pyramid scores correlate highly with each other, as shown in Figure 3. The Pearson coefficient for these two manual metrics was 0.950684. Responsiveness 1.5 2.0 2.5 0.05 0.10 0.15 0.20 0.25 0.30 Update task: manual evaluations Pyramid Figure 3. Responsiveness/Pyramid 5.2. Automatic evaluation At TAC 2008, two automatic metrics were used: ROUGE and BE. Analyzing the correlation between the ROUGE-2 and Responsiveness score, two observations can be made. The Pearson coefficient for all systems is still high, but not as high as for Responsiveness vs. Pyramid score (i.e., 0.8941). However, focussing only on the top systems there is no correlation between Responsiveness and ROUGE-2. For example, figure 4 indicates that the top 22 systems according to ROUGE-2 (i.e., ROUGE-2 scores > 0.08) show no correlation between ROUGE-2 and Responsiveness. The Pearson coefficient for these systems is 0.0687. Given this first observation, we may conclude that only low ROUGE-2 scores (i.e., < 0.08) can be seen as indication for the summarization performance, but high ROUGE- 2 scores cannot differentiate good and very good performing systems. Responsiveness 1.5 2.0 2.5 top rest top Update task: manual vs. automatic metrics 0.04 0.05 0.06 0.07 0.08 0.09 0.10 ROUGE 2 Figure 4. Responsiveness/ROUGE-2 For the BE evaluation, a similar picture emerged. The correlation between Responsiveness and BE for all systems was relatively high. Pearson s r was 0.9106. Considering only the top 22 systems, one observes that the Pearson coefficient for BE and Responsiveness showed a weak correlation between the automatic and the manual evaluation metric. The Pearson coefficient is 0.4409 and the confidence intervals show that this correlation is not significant. 6. More automatic metrics Given the low correlations between ROUGE-2 and BE for the top 22 systems, we investigated other metrics that may show a higher correlation. We tried the AutoSumm metric and developed our own metric utilizing the AutoSumm software. In our experiments, we ran the TAC 2008 systems through the AutoSumm and our new system called ContextChain for different n top systems (n = 10, 15, 20, 25, 30, 35). For the obtained results, we computed Pearson coefficients in two ways: Responsiveness-sorted: The two vectors of results were sorted according to Responsiveness scores. This could mean that systems that obtained high scores from the automatic metric, but low Responsiveness 68

Pearson's r coefficient scores were not considered for the correlation evaluation. Automatic evaluation metric-sorted: the two vectors of results were sorted according to the automatic metric. This could mean that systems that obtained high Responsiveness scores, but low automatic metric scores were not considered for the correlation evaluation. 0.0 0.2 0.4 0.6 0.8 1.0 Correlation Recall 10 15 20 25 30 35 n top systems Metrics ContextChain AutoSummENG (words) BE AutoSummENG (characters) ROUGE-2 Figure 5. Correlations for n top systems sorted by Responsiveness Figure 5 shows the Pearson coefficients for the top 10-35 systems, respectively, if sorted according to Responsiveness. This set-up of the experiment focusses on the top n systems determined by the manual evaluation metric. An automatic metric that shows high coefficients throughout the different number of top systems, shows high coverage (or recall) of the top performing systems. We define this set-up as Correlation Recall. Conversely, an automatic system that shows a consistently high coefficient for systems sorted according to the automatic metric, is reliable in terms of its precision. In other words, a high automatic score is likely to indicate a high performing system in terms of Responsiveness. We define this set-up as Correlation Precision. Figure 5 contains the Pearson coefficients for the top n systems for all metrics dicussed. The values for ROUGE-2 and BE are generally low and only high n allow the conclusion that the metric correlates with the human evaluation metric (cf. tables in appendix). Note that AutoSummENG using characters for the n-gram graphs does not preform Pearson's r coefficient -1.0-0.5 0.0 0.5 1.0 Correlation Precision 10 15 20 25 30 35 n top systems Metrics ContextChain AutoSummENG (words) BE AutoSummENG (characters) ROUGE-2 Figure 6. Correlations for n top systems sorted by evaluation metric very well either. We tried out AutoSummENG with words instead which resulted in a better overall performance similar to our proposed ContectChain metric. The Correlation Precision, on the other hand, seems to be a better indicator for how good the metric can predict strong performing systems. Figure 6 indicates that the previously used automatic metrics ROUGE-2 and BE show low or no correlations for up to the 25 top systems. In fact, for the top 10 or 15 systems the relation between Responsiveness and ROUGE-2 may even be inverse. 4 BE shows a better performance than ROUGE in that respect, but its Pearson coefficients are still lower than the other two metrics. Auto- SummENG based on characters performs again not as well as when words are used for generating the graph. ContextChain is very similar to AutoSummENG with words while showing a higher coefficient when only the top 10 systems are considered. This difference, however, is not significant. We also analyzed the results of these experiments regarding their significance. The appendix contains tables showing the confidence intervals computed via the Fishers r-to-z transformation [8] for Correlation Recall and Correlation Precision, respectively. 4 Bear in mind that the confidence intervals for these small samples are very large. 69

7. Conclusions This paper introduced a new metric for automatically evaluation summaries. Based on an in-depth analysis of the recent TAC 2008 update summarization results, we showed that previous automatic metrics such as ROUGE-2 and BE cannot reliably predict strong performing systems. We introduced two new terms called Correlation Recall and Correlation Precision and discussed how they cast more light on the coverage and the correctness of the respective metric. Our newly introduced metric uses only the context of named entities and definite descriptions in a summary. Linking the contexts of named entities and definite descriptions turned out to be a useful tool for predicting the quality of a summary. We hypothesize that these tuples of n- grams capture important semantic and discourse-level links between entities described by the text. Hence, these context links may also be useful for other applications such as information extraction or discourse parsing. Whether these semantic applications can benefit from context chains is left to future research. A. Confidence intervals Correlation Recall (sorted by Responsiveness) ContextChain 10 15 20 25 30 Pearson s r 0.548 0.717 ** 0.741 ** 0.775 ** 0.867 ** upper range 0.875 0.899 0.891 0.896 0.935 lower range -0.125 0.323 0.445 0.548 0.736 AutoSumm 10 15 20 25 30 Pearson s r 0.822 ** 0.837 ** 0.565 ** 0.686 ** 0.828 ** upper range 0.957 0.944 0.806 0.85 0.915 lower range 0.399 0.569 0.164 0.399 0.666 ROUGE-2 10 15 20 25 30 Pearson s r 0.405 0.272 0.278 0.218 0.496 ** upper range 0.824 0.688 0.642 0.565 0.726 lower range -0.301-0.279-0.188-0.193 0.165 BE 10 15 20 25 30 Pearson s r 0.494 0.409 0.532 ** 0.448 ** 0.641 ** upper range 0.857 0.762 0.789 0.716 0.813 lower range -0.197-0.13 0.117 0.064 0.365 Correlation Precision (sorted by automatic metric) ContextChain 10 15 20 25 30 Pearson s r 0.546 0.626 ** 0.544 ** 0.627 ** 0.681 ** upper range 0.875 0.862 0.795 0.819 0.836 lower range -0.127 0.167 0.134 0.309 0.425 AutoSumm 10 15 20 25 30 Pearson s r 0.296 0.578 ** 0.553 ** 0.695 ** 0.747 ** upper range 0.78 0.841 0.8 0.855 0.872 lower range -0.41 0.094 0.147 0.413 0.529 ROUGE-2 10 15 20 25 30 Pearson s r -0.662-0.213-0.008 0.244 0.478 ** upper range -0.055 0.336 0.436 0.583 0.715 lower range -0.912-0.654-0.449-0.167 0.143 BE 10 15 20 25 30 Pearson s r 0.233 0.125 0.377 0.438 ** 0.624 ** upper range 0.752 0.599 0.702 0.71 0.804 lower range -0.465-0.413-0.078 0.052 0.341 References [1] A. N. A. Louis. Automatic summary evaluation without human models. Proceedings of Text Understanding Conference (TAC), 2008. [2] R. Barzilay and M. Lapata. Modeling local coherence: An entity-based approach. Computational Linguistics, 34(1):1 34, 2008. [3] H. T. Dang. Update summarization task and opinion summarization pilot task. In Proceedings of the First Text Analysis Conference (TAC 2008), Gaithersburg, MD, Nov. 2008. National Institute of Standards and Technology. [4] G. Giannakopoulos, V. Karkaletsis, G. Vouros, and P. Stamatopoulos. Summarization system evaluation revisited: N- gram graphs. ACM Trans. Speech Lang. Process., 5(3):1 39, 2008. [5] E. Hovy, C.-Y. Lin, and L. Zhou. Evaluating duc 2005 using basic elements. Proceedings of Document Understanding Conference (DUC). Vancouver, B.C., Canada, 2005. [6] C.-Y. Lin. Rouge: A package for automatic evaluation of summaries. In Proc. ACL workshop on Text Summarization Branches Out, page 10, 2004. [7] C.-Y. Lin and E. Hovy. Manual and automatic evaluation of summaries. In Proceedings of the ACL-02 Workshop on Automatic Summarization, pages 45 51, Morristown, NJ, USA, 2002. Association for Computational Linguistics. [8] G. Loftus and E. Loftus. Essence of Statistics. McGraw Hill, 2nd edition edition, 1988. [9] P. Over, H. Dang, and D. Harman. DUC in context. Information Processing and Management, 43(6):1506 1520, 2007. [10] F. Schilder, R. Kondadadi, J. L. Leidner, and J. G. Conrad. Thomson Reuters at TAC 2008: Aggressive Filtering with FastSum for Update and Opinion Summarization. In Proceedings of the First Text Analysis Conference (TAC), Gaithersburg, MD, 2008. NIST. [11] S. Tratz and E. Hovy. Summarization evaluation using transformed basic elements. Proceedings of Text Understanding Conference (TAC), 2008. 70