Automatic Text Summarization of Newswire: Lessons Learned from the Document Understanding Conference Ani Nenkova Columbia University 1214 Amsterdam Ave New York, NY 10027 ani@cs.columbia.edu Abstract DUC coverage evaluation procedure Since 2001, the Document Understanding Conferences have been the forum for researchers in automatic text summarization to compare methods and results on common test sets. Over the years, several types of summarization tasks have been addressed—single document summarization, multi-document summarization, summarization focused by question, and headline generation. This paper is an overview of the achieved results in the different types of summarization tasks. We compare both the broader classes of baselines, systems and humans, as well as individual pairs of summarizers (both human and automatic). An analysis of variance model is fitted, with summarizer and input set as independent variables, and the coverage score as the dependent variable, and simulation-based multiple comparisons were performed. The results document the progress in the field as a whole, rather then focusing on a single system, and thus can serve as a future reference on the work done up to date, as well as a starting point in the formulation of future tasks. Results also indicate that most progress in the field has been achieved in generic multi-document summarization and that the most challenging task is that of producing a focused summary in answer to a question/topic. The most important evaluation measure used in DUC 2001 to 2004 is coverage, intended to capture how well summarizers perform content selection and not addressing issues such as readability and other text qualities of the summaries. The coverage scores are computed as the overlap between a human-authored summary (model) and a target summary to be evaluated (peer, produced either by a system or a human). The model summary is split in clauses and a human compares the model and peer summaries, and for each clause in the model summary answers the question “To what extent is the information conveyed in the model clause expressed in the peer summary?”. In 2001 the answer was given on a five point scale where 1 means “The model clause is not covered at all by the peer” and 5 means “The meaning of the model clause is fully expressed in the peer summary”. In later years, a six point scale was adopted, ranging from 0 to 1 with 0.20 increments. A choice X was interpreted to mean that X percent of the content of the model clause is covered by the peer. The coverage score for the entire summary is the average coverage of model clauses. In this paper, we compare the results across years on the coverage measure. Methodology and analysis Introduction The Document Understanding Conference (DUC) has been run annually since 2001 and has been the major forum for comparing summarization systems on a shared test set. Four main groups of task have been addressed and manually evaluated for coverage (overlap between the system produced summary and a single human model): generic singledocument summarization, generic multi-document summarization, headline generation and summarization focused by question or topic. Many summarization systems have been developed and evaluated in DUC, and an overview of the results each year have been provided by DUC organizers1. But no comprehensive overview of the results for specific tasks across the years exist to document the progress, or lack thereof, for certain tasks. c 2005, American Association for Artificial IntelliCopyright gence (www.aaai.org). All rights reserved. 1 The online proceedings of the conferences are available at http://duc.nist.gov In order to study the differences between summarizers, we fit a two-way analysis of variance model with the systems and the input as factors and the coverage score as dependent variable. The simple main effect model we use is Yij = µ + αi + βj + ij (1) Yij is the coverage score for system i for input j, and it is equal to µ, a grand mean coverage for any summary, adjusted for the effect of the summarizer (αi ) and the effect of the input (βj ) and some random noise ij . Other possible sources of variation such as evaluator, or system/evaluator interactions and the like, are not included in the model, since they have been controlled for in the evaluation set up as discussed in (Harman & Over 2004). For all tasks, both main effects are significant with p = 0, which indicates that significant differences between summarizers exists, as well as that some sets or documents are easier to summarize than others (and summarizers get higher coverage scores for them). It is of interest to be able to compare each two summarizers against each other, total AAAI-05 / 1436 sys 1 A B C D E F G H I J 15 16 18 19 21 23 25 27 28 29 31 better than systems 15 16 17 18 23 25 30 1 15 16 17 18 21 23 25 31 1 15 16 17 18 19 21 23 25 27 28 29 31 1 15 16 17 18 19 21 23 25 27 28 29 31 1 15 16 17 18 19 21 23 25 27 28 29 31 1 15 16 17 18 19 21 23 25 27 28 29 31 15 16 17 18 23 25 29 31 1 15 16 17 18 19 21 23 25 27 28 29 31 1 15 16 17 18 19 21 23 25 27 28 29 31 A F J 1 15 16 17 18 19 21 23 25 27 28 29 31 15 16 17 18 23 25 29 31 17 25 30 17 30 17 30 15 16 17 18 23 25 30 15 16 17 18 25 30 17 25 30 17 30 15 16 17 18 23 25 30 15 16 17 18 23 25 30 16 17 18 25 30 16 17 18 25 30 Table 1: Significant differences based on 95% confidence intervals for single document summarizers in 2002. The second column lists the summarizers that the summarizer in the first column is significantly better than. of N2 comparisons when N is the number of summarizers. Thus, the number of pairwise comparisons to be made is quite large, since each year more than 10 automatic summarizers were tested and several human summarizers wrote summaries for comparison purposes. In order to control the probability of declaring a difference between two systems to be significant when it is actually not (Type I error), we use a simulation based multiple comparisons procedure that controls for the overall experiment-wise probability of error. 95% confidence intervals for the differences between system means are computed with simulation size = 12616. Pairs of summarizers where the confidence interval for the difference between the two means excludes zero can be declared to be significantly different. Summarizer codes conventions In all years and tasks, human summarizers have letter codes A to K. Baselines have codes 1 to 5. The remaining codes are those for participating systems. Each year, non-model human summaries were available for each set/document, but no human wrote summaries for all the sets. This means that the average scores for humans are estimated from a smaller sample than those of system averages. Generic single-document summarization The generic single document summarization task was addressed in DUC 2001 and 2002, with target summary length of 100 words. The baseline in both years, with summarizer code 1, was the same: taking the first 100 words of the input document. Space constraints would not permit to list all 507 pairwise comparisons and the respective 95% confidence interval, so table 1 lists only the pairs of summarizers where a significant difference could be declared in 2002. The trends are similar in 2001, with humans performing better than systems and the baselines, and systems not outperforming the baseline. The peer averages for 2001 are not listed since the coverage evaluation was done on a different scale from the one used in following years and thus results across years cannot be compared to track progress. The peer averages for 2002 are given in table 2. Both years, none of the systems outperforms the baseline (and the systems as a group do not outperform the baseline) and in fact the baseline has better coverage than most of the automatic systems (see the first row in table 1). It has often been noted that this baseline is indeed quite strong, due to journalistic convention for putting the most important part of an article in the initial paragraphs. But the fact that human summarizers (with the exception of F and J) significantly outperform the baseline shows that the task is meaningful and that better-than-baseline performance is possible. The difference between the worst human and the baseline is 0.1 while the difference between the best system and the baseline is 0.01. Seven out of the nine human summarizers are significantly better than all the systems, but humans F, J and A are not significantly better than all systems. This suggests that while humans as a group are much better than automatic summarizers, specific human summarization strategies are not so good and do not lead to significant advantage over some automatic systems. These facts indicate that while the single-document task has not been kept in later years, the task remains an open problem. Generic multi-document summarization Generic multi-document summaries were produced in 2001, 2002 and 2004, with target lengths of 50, 100, 200 and 400 words in 2001 and 50, 100, and 200 words in 2002 and 100 words only in 2004. The two baselines used are 1) the first n words of the most recent document (corresponding to system code 1 in 2001 and code 2 in 2002) and 2) the first sentence from each document in chronological order until the length requirement is reached (codes 2 and 3 in 2001 and 2002 respectively). Note that the first baseline is equivalent to what is currently used in news aggregation systems such as Google News. Automatic summarizers in general perform better than this baseline as can be seen in table 3 and table 4. The peer averages for 2002 and 2004 are given in tables 5 and 6 respectively. When the target summary size is added as a factor in the analysis of variance model, it is highly significant with p < 0.0001 and multiple comparisons between the between the four lengths in 2001 show that scores significantly increase as the summary size increases. This indicates that the summary size impacts coverage scores and thus when performing evaluation one should control for the size of summaries that are being compared. If size is not controlled for, it can confound the results in coverage scores comparisons. An interesting questions arises of what would happen AAAI-05 / 1437 summarizer coverage sums 1 .37 291 15 .33 295 16 .30 295 17 .08 292 18 .32 294 19 .39 295 21 .37 295 23 .34 289 25 .29 294 27 .38 292 28 .38 295 29 .36 294 summarizer coverage sums 30 .06 294 31 .36 292 A .47 30 B .51 30 C .46 24 D .52 30 E .49 30 F .47 25 G .54 25 H .57 30 I .53 29 J .47 30 Table 2: Summarizer code, coverage and number of summaries for the single document summarization task at DUC 2002 2 A B C D E F G H I K L M N P R S T Y 1UW 12LMOPRSUWYZ 12LMNOPRSUWYZ 12LMNOPRSUWYZ 12MORSUWZ 12HLMNOPRSTUWYZ 12HLMNOPRSTUWYZ 12LMNOPRSUWYZ 1UWZ 12LMNOPRSTUWYZ 12LMNOPRSTUWYZ 1UWZ 1U 1MORSUWZ 1UWZ 1UY 1UW 12MORSUWZ 1MOUWZ tently. Several interesting observations can be made from the data in table 3 (the results from 2001). 1. Only one system significantly outperforms the baseline of selecting first sentences from the input articles (Id=2). 2. Not all humans significantly outperform all automatic systems. This is very positive and indicates that on a set of 30 input clusters, humans do not perform significantly better2 . 3. Eight (out of 12) systems outperform baseline 1 (selecting the first N words of one of the articles in the cluster). This approach is used by news aggregation sites, such as Google news. The DUC results show that overall, specifically developed multi-document summarization systems perform better than such a baseline. Table 3: Significant differences for the 2001 multi-document task. The summary length has been added as a factor in the analysis of variance model. if systems are compared separately at different target summary lengths—would the results differ. For the 2001 data, we compared the groups of systems, humans, and baselines. At all lengths the humans performed significantly better than both the systems and the baselines. But the difference between the baselines and the automatic systems is significant only for summary lengths of 200 and 400 words and is not significant for target length 50 or 100 words. The same trend can be seen in the 2002 data as well (table 4). The inability to show significance in the difference between systems and baselines with shorter summary lengths might mean two things. The evaluation method might be too restrictive, measuring the overlap with a single human model. The work of (Nenkova & Passonneau 2004) discusses an alternative evaluation approach that looks for overlap between the summary to be evaluated and multiple human models, which is less restrictive and allows for the possibility that there might be more than one, different, but equally acceptable, summaries for the same input. It will be interesting to see if in upcoming DUC competitions the new pyramid scoring procedure will lead to greater ability to show significance. Another possible explanation is that as a shorter summary length is required, the task of choosing the most important information becomes more difficult and no approach works well consis- 4. In the comparisons between different humans, only two are significant (E and F are both significantly better than H). The fact that there are significant differences between human summarizers suggests that a cognitive study of the human summarization techniques can be useful in shedding light for the development of automatic summarizers. Figure 1 shows a pair of summaries written on for the same input by humans F and H. The examples give an idea why the unusual summarization strategy of H can lead to worse summaries. Summarizer H also happens to be the human that does not outperform significantly most of the automatic systems. Table 4 shows the significant pairwise comparisons between multi-document summarizers for 2002. These data also indicate that more pairs are significant when longer summaries are compared. It is again encouraging that several automatic systems are not significantly worse than humans. Table 7 gives the pairs of significantly different summarizers from DUC 2004. The generic multi-document summarization task is the only one evaluated in the same way across several years. For the 100 word summary in 2002, the best system result is 0.24, while in 2004 it is 0.30, and the average of system coverage was 0.17 and 0.21 respectively. The difference is significant according to a Wilcoxon rank-sum test (p-value = 0.02). This could be an indication that there is an improvement of systems across the years. Unfortunately, such a conclusion cannot be readily drawn, because as the analysis of variance model shows, the input to a summarizer has a significant impact on the final coverage score and it is not 2 But given more data, the observed differences could become significant AAAI-05 / 1438 Set 34; H SUMMARY Category 5 hurricanes include the 1935 storm striking the Florida Keys, 1969’s Camille striking Mississippi and Louisiana, 1988’s Gilbert striking from the Lesser Antilles to Mexico, and 1989’s Hugo striking the Lesser Antilles and South Carolina. Other killer hurricanes struck Galveston in 1900, the Florida Keys in 1965 (Betsy), the Dominican Republic in 1987 (Emily), Central America in 1988 (Joan), the Gulf of Mexico in 1989 (Chantal), and Mexico in 1990 (Diana). Hurricanes also hit Miami in 1926, 1947, 1949, 1950 and 1964 (Cleo); the Florida Keys in 1960 (Donna) and 1987 (Floyd); Mississippi, Alabama and Florida in 1985 (Elena); and Mexico in 1985 (Debby). Set 34; F SUMMARY Hurricanes are spanned in the Atlantic off the coast of Africa. Average or heavy rains in the Sahel result in more, and more intense, hurricanes and dry seasons lead to fewer hurricanes according to Dr. William Gray a Colorado meteorologist. Since accurate records have been kept, three category five hurricanes have hit the Atlantic or Gulf coasts and caused great loss of life and billions of dollars of damages. In 1900, the Galveston hurricane killed 6,000. More and better satellites will improve hurricane tracking. However, the potential for loss of life and extensive damages are higher because of increased population and construction in coastal areas. Figure 1: Summaries by humans F and H for the same input clear if the sets in 2002 were markedly different (i.e. harder) than those in 2004. In order to gain some preliminary insight into what factors might make a set easier or harder to summarize, we looked at the two 2004 sets for which the peer coverage was highest (coverage = 0.49, set D30010) and lowest (coverage = 0.10, set D30049). Set D30010, for which the highest peer coverage was achieved, consists of articles reporting on a suicide bombing on a Jerusalem market. The set with lowest average coverage, D30049, consists of articles dealing with different aspects of the controversy surrounding the revival of North Korea’s nuclear program. Thus, the set where high coverage was obtained was more scenario oriented, where the pieces of information that need to be supplied in order to give a summary of the event can be easily identified—where the attack took place, who was the perpetrator, what were the casualties. The set with low coverage, on the other hand, does not evoke a readily available scenario. Deeper analysis of the input factors is due in order to understand which are the subareas in which summarization is most successful. Previous research (Barzilay & Lee 2004) has shown that content selection and ordering can be approached in a very principled way for scenario-type inputs. Generation of headlines The headline generation task is to produce a 10-word summary, either as a sentence or as a set of keywords. In 2002, the headlines were generated for multiple documents on the same topic, while in 2003 they were for single documents. The average coverage per peer in the two years can be seen it tables 8 and 9 respectively. In 2002, one of the systems, 19, A B C D E F G H 2 19 27 34 44 55 65 81 93 102 120 123 124 138 all automatic summarizers all automatic summarizers all automatic summarizers but 65 all automatic summarizers all automatic summarizers all automatic summarizers all automatic summarizers all automatic summarizers 111 117 111 117 111 111 117 111 117 123 138 27 111 117 138 27 11 111 117 123 138 2 27 34 111 117 123 138 27 111 117 123 138 27 111 117 138 27 111 117 138 27 111 111 117 138 27 111 Table 7: Significant differences for 100 word multidocument summarizers in 2004 was not significantly worse than either of the humans and outperforms all other automatic systems. For headline generation for single documents, the human written headline (available in the HEADLINE tag of the documents) was taken as a baseline. Thus the baseline was human written and very high, and in fact outperformed all automatic systems. Five out of ten human summarizers outperform the baseline. Focused multi-document summaries In 2003 and 2004, the multi-document task was modified so that the produced 100-word summary is not generic, but either responding to a topic description or a question (in 2004 the question was “Who is X?” where X is a person). The same baselines as for generic multi-document summarization are used and have the smallest number system code. For the question task in 2003, systems were also given a list of all the sentences that a human had marked as answering the question. Thus the task was actually to find the most important among the potential answers, while in 2004 the systems had only articles and had to identify and rank the answers. Humans overwhelmingly outperform all of the systems (with a few exceptions for the modified question task in 2003). At the same time few systems perform better than the baselines: in 2003 (for the summaries focused by topic) systems 6, 13, 16, and 20 are better than baseline 2 (first 100 words of the first document), and only system 13 outperforms the baseline that takes first sentences from the input articles until the requested length is reached (code 3). In 2004, no system significantly outperforms the baseline as can be seen in table 10. The coverage for human summarizers for the questionfocused task in 2004 seems higher than the human cov- AAAI-05 / 1439 sys A B C D E F G H I J 3 19 20 24 26 28 50 words 16 2 20 24 25 29 3 16 19 2 20 24 25 26 28 29 3 100 words 16 2 20 24 25 26 29 3 6 19 2 20 24 25 26 28 29 3 16 2 20 25 26 29 16 2 25 29 16 2 20 24 25 29 3) 16 2 20 24 25 26 28 29 3 16 2 20 24 25 3 16 2 20 24 25 29 3 16 2 20 24 25 26 28 29 3 200 words 16 2 20 24 25 29 3 6 19 2 20 24 25 26 28 29 3 16 19 2 20 24 25 26 28 29 3 16 19 2 20 24 25 26) 28 29 3 2 16 2 25 29 16 2 25 29 16 19 2 20 24 25 26 28 29 3 16 19 2 20 24 25 26 28 29 3 16 2 25 29 16 2 25 29 16 20 24 25 29 16 2 25 16 2 25 29 16 2 25 29 16 2 25 29 16 2 25 29 16 20 24 25 3 16 2 20 24 25 26 29 3 16 2 24 25 26 28 29 3 16 2 25 29 16 2 25 16 25 29 16 20 24 25 16 20 25 26 2 20 16 2 20 25 29 Table 4: Significant differences for multi-document summarizers in 2002 for different summary lengths summarizer 50 wrds 100 wrds 200 wrds sums 16 .12 .13 .15 59 19 .23 .24 .25 59 2 .14 .13 .13 59 20 .12 .16 .21 59 24 .13 .19 .22 59 25 .10 .12 .16 59 26 .21 .19 .24 59 28 .21 .23 .23 59 29 .16 .14 .17 59 3 .14 .20 .22 59 summarizer 50 wrds 100 wrds 200 wrds sums A .37 .37 .37 6 B .44 .43 .42 6 C .25 .38 .45 5 D .40 .28 .39 6 E .46 .25 .27 6 F .33 .28 .30 5 G .44 .36 .29 5 H .32 .31 .37 6 I .26 .39 0.41 6 J .22 .26 .26 6 Table 5: Coverage for summarizers at the multi-document task at DUC 2002, for different target summary lengths summarizer coverage sums 102 .24 50 11 .22 50 111 .05 50 117 .16 50 120 .24 50 123 .17 50 124 .26 50 138 .17 50 19 .23 49 2 .20 50 27 .17 50 34 .22 50 summarizer coverage sums 55 .24 50 65 .30 50 81 .25 50 93 .26 50 A .48 18 B .44 18 C .38 25 D .50 18 E .42 18 F .46 17 G .45 18 H .47 18 44 .26 50 Table 6: Coverage for the multi-document summarization task in DUC 2004 summarizer coverage sums 16 .12 59 19 .39 59 20 .13 59 25 .14 59 26 .26 59 29 .09 59 A .40 6 B .65 6 C .55 5 D .45 6 E .57 6 F .48 6 G .32 5 H .62 6 I .30 6 J .50 6 Table 8: Average coverage for headlines for multiple documents, DUC 2002. The first row contains the system codes, the second the average coverage and the third–the number of summaries produced by the summarizer. summarizer coverage sums 1 .48 624 10 .15 622 13 .25 624 15 .17 624 17 .40 624 18 .36 624 21 .32 624 22 .31 624 24 .24 564 25 .29 624 26 .38 624 7 .3 624 summarizer coverage sums 8 .51 623 9 .27 624 A .49 184 B .57 192 C .56 189 D .53 187 E .61 183 F .60 182 G .57 185 H .52 192 I .52 188 J .49 190 Table 9: Summarizer codes, average coverage and number of produced summaries for single document headlines in DUC 2003 AAAI-05 / 1440 summarizer coverage sums 109 .24 50 116 .17 50 122 .18 50 125 .19 50 147 .22 50 16 .16 50 24 .21 50 30 .20 50 43 .20 50 49 .21 50 5 .19 50 summarizer coverage sums 71 .21 50 86 .15 50 96 .22 50 A .48 18 B .55 18 C .35 25 D .47 18 E .47 18 F .52 17 G .54 18 H .44 18 62 .20 50 Table 10: Coverage for the 100 words “Who is X?” multi-document summaries at DUC 2004 A B C D E F G H 109 109 116 122 125 147 16 24 30 43 49 5 62 71 86 96 109 116 122 125 147 16 24 30 43 49 5 62 71 86 96 C 109 116 122 125 147 16 24 30 43 49 5 62 71 86 96 109 116 122 125 147 16 24 30 43 49 5 62 71 86 96 C 109 116 122 125 147 16 24 30 43 49 5 62 71 86 96 C 109 116 122 125 147 16 24 30 43 49 5 62 71 86 96 C 109 116 122 125 147 16 24 30 43 49 5 62 71 86 96 C H 109 116 122 125 147 16 24 30 43 49 5 62 71 86 96 16 86 Table 11: Significant differences for 100 word multidocument summarizers (focused by question) in 2004 erage for generic summarization the same year— three of the human summarizers in the focused task have coverage score above 0.5 (the highest score achieved in the generic task). But the overall human average is 0.45 for the generic and 0.47 for the focused task, which is not significant (the Wilcoxon rank-sum test p-value is 0.23), suggesting that the intuition that non-generic summarization can lead to better inter-human agreement (Sparck-Jones 1998) is not supported by the 2004 data. At the same time, the average machine scores for the focused summaries are lower than those for generic summaries (p=0.075), that is—the focused task is harder for machines than the generic summary one. Discussion and conclusions Several interesting conclusions follow from the presented results, that should be kept in mind when developing future systems and when defining tasks. In generic multi-document summarization we observed that differences between systems become more significant when longer (100 or 200) word summaries are produced. But for summaries of such length, text qualities such as coherence and cohesion become important. While some quality aspects of summaries have been evaluated in DUC, little attention has been paid to them, with the main focus always being on coverage. Thus, in order to be able to develop practically useful summarization systems, quality issues need to addressed and assessed as well. The need for concentrating more on text quality is also supported by the fact that one of the systems for multidocument headlines was not significantly worse in terms of coverage than any human. But the ability of a system to be as useful as a human summarizer would also depend on the overall readability of the automatic headlines. Another somewhat surprising result is that systems are further from human performance in single document sum- marization task than in the multi-document task. This is to a certain extent counter-intuitive to the general feeling that multi-document summarization is the more difficult of the two tasks. The result can be partly explained by the fact that repetition across input document can be used as an indication of importance in multi-document summarization, while such cues are not available in single document summarization. At the same time, the coverage scores of humans for the single document summaries are significantly higher than human coverage score for multi-document summarization (cf. tables 2 and 6), indicating that there is better agreement between humans on content selections decisions in single document summarization than in multi-document and that thus the single document task is easier and better defined for humans. Future DUC tasks will continue the emphasis on focused summarization. Past experience indicates that this task is difficult and it might make sense to break it down into simpler, more manageable subtasks. In the analysis of variance model, the input set for multidocument summarization and the document for single document summarization were significant factors. Thus, some inputs are more amenable to summarization (or at least lead to better coverage scores) than others. No systematic study has been done to identify types of “easy” or “hard” inputs. Such a study can bring insight for the development of more specialized summarizers. As a final note, we want to mention again that all the results reported on here concern the summarization of newswire and that multi-document summaries were produced for inputs of about 10 articles. The results might not carry over to different domains or to sets containing dramatically more input articles. References Barzilay, R., and Lee, L. 2004. Catching the drift: probabilistic content models, with applications to generation and summarization. In Proceedings of NAACL-HLT’04. Harman, D., and Over, P. 2004. The effects of human variation in duc summarization evaluation. In Text summarization branches out workshop at ACL’2004. Nenkova, A., and Passonneau, R. 2004. Evaluating content selection in summarization: The pyramid method. In Proceedings of HLT/NAACL 2004. Sparck-Jones, K. 1998. Automatic summarizing: factors and directions. In Advances in automatic text summarization, eds. Mani and Maybury. AAAI-05 / 1441