Multi-Document Summary Space:What do People Agree is Important? John M. Conroy Institute for Defense Analyses Center for Computing Sciences Bowie, MD Outline • • • • Problem statement. Human Summaries. Oracle Estimates. Algorithms. Query-Based Multi-document Summarization • • • • User types query. Relevant documents are retrieved. Retrieved documents are clustered. Summaries for each cluster are displayed. Example Query: “hurricane earthquake” QuickTime™ and a QuickTime™ and a decompressor TIFF (Uncompressed) TIFF (Uncompressed) are needed decompressor to see this picture. are needed to see this picture. columbia michagan Recent Evaluation and Problem Definition Efforts • Document Understanding Conferences – 2001-2004: 100 word generic summaries. – 2005-2006: 250 word “focused” summaries. – http://duc.nist.gov/ • Multi-lingual Summarization Evaluation 20052006. (MSE) – Given a cluster of translated documents and English documents produce100 word. – http://www.isi.edu/~cyl/MTSE2005/ Overview of Techniques • Linguistic Tools (find sentence boundaries, to shorten sentences, extract features). – – – – Part of speech. Parsing. Entity Extraction. Bag of words, position in document. • Statistical Classifier. – Linear classifiers. – Bayesian methods, HMM, SVM, etc. • Redundancy Removal. – Maximum marginal relevance (MMR). – QR. Sample Data DUC 2005. – 50 topics. – 25 to 50 relevant documents per topic. – 4 or 9 human summaries. Linguistic Processing • Use heuristic patterns to find phrases/clauses/words to eliminate –Shallow processing –Value of full sentence elimination? Linguistic Processing • Phrase elimination – Gerund phrases Example: “Suicide bombers targeted a crowded open-air market Friday, setting off blasts that killed the two assailants, injured 21 shoppers and passersby and prompted the Israeli Cabinet to put off action on ….” Example Topic Description Title: Reasons for Train Wrecks Narative: What causes train wrecks and what can be done to prevent them? Train wrecks are those events that result in actual damage to the trains themselves not just accidents where people are killed or injured. Type: General Example Human Summary Train wrecks are caused by a number of factors: human, mechanical and equipment errors, spotty maintenance, insufficient training, load shifting, vandalism, and natural phenomenon. The most common types of mechanical and equipment errors are: brake failures, signal light and gate failures, track defects, and rail bed collapses. Spotty maintenance is characterized by failure to consistently inspect and repair equipment. Lack of electricians and mechanics results in letting equipment run down until someone complains. Engineers are often unprepared to Another Example Topic Title: Human Toll of Tr opical Storms • What has been the human toll in death or injury of tropical storms in recent years? Where and when have each of the storms caused human casualties? What are the approximate total number of casualties attributed to each of the storms? • Granularity: Specific Example Human Summary • January 1989 through October 1994 tolled 641,257 tropical storm deaths and 5,277 injuries world-wide. • In May 1991, Bangladesh suffered 500,000 deaths; 140,000 in March 1993; and 110 deaths and 5,000 injuries in May 1994. • The Philippines had 29 deaths in July 1989 and 149 in October; 30 in June 1990, 13 in August and 14 in November. • South Carolina had 18 deaths and two injuries in October 1989; 29 deaths in April 1990 and three in October. Inter-Human Word Agreement Evaluation of Summaries Ideally each machine summary would be judged by multiple humans for 1. Responsiveness to query. 2. Cohesiveness, grammar, etc. Reality: This would take too much time! Plan: Use Metric which correlates at 90-97% with human responsiveness judgments. Recall Oriented Understanding for Gisting Evaluation ROUGE-1 Scores ROUGE-2 Scores Frequency and Summarization • Ani Nenkova, Columbia and Lucy Vanderwende, Microsoft report: – High frequency content words correlate with high frequency words chosen by humans. – SumBasic, a simple method based on this principle, produces “state of the art” generic summaries, e.g., DUC 04 and MSE 05. • Van Halteren and Teufel 2003, Radev et. Al. 2003, Copeck and Szpakowicz 2004. What is Summary Space? • Is there enough information in the documents to approach human performance as measured by ROUGE? • Do humans abstract so much that extracts don’t suffice? • Is a unigram distribution enough? A Candidate • Suppose an oracle gave us: • Pr(t)=Probability that a human will choose term t to be included in a summary. – t is a non-stop word term. • Estimate based on our data. – E.g., 0, 1/4, 1/2, 3/4, or 1 if 4 human summaries are provided. A Oracle Simple Score • Generate extracts: – Score sentences by the expected percentage of abstract terms they contain. – Discard any short sentences or any long sentences. – Pivoted QR to remove redundancy. The Oracle Pleases Everyone! Approximate Pr(t) • Two bits of Information: • Topic Description. – Extract query phrases. • Documents Retrieved. – Extract terms which are indicative or give the “signature” of the documents. Query Terms • Given Topic Description. • Tag it for part of speech. – Take any NN (noun), VB (verb), JJ (adjective), RB (adverb), multi-word groupings of NNP. – E.g. train, wrecks, train wrecks, causes, prevent, events, result, actual, actual damage,trains, accidents, killed, injured. Signature Terms • Term: space-delimited string of characters from {a,b,c,…,z}, after text is lower cased and all other characters and stop words are removed. • Need to restrict our attention to indicative terms (signature terms). – Terms that occur more often then expected. Signature Terms Terms that occur more often than expected • Based on a 22 contingency table of relevance counts. • Log-likelihood; equivalent to mutual information. • Dunning 1993, Hovy Lin 2000. Hypothesis Testing ~ H0: P(C|ti)=p=P(C| ti) ~ H1: P(C|ti)=p1p2=P(C| ti) ML Estimate p, p1, and p2 O11 O21 p , O11 O21 O12 O22 O11 ti p1 , ~ O11 O12 ti O21 p2 O21 O22 C O11 O21 ~C O12 O22 Likelihood of H0 vs. H1 and Mutual Information L(H 0 ) b(p;O11,O11 O12 )b(p;O21,O21 O22 ) L(H1) b(p1;O11,O11 O12 )b(p2 ;O21,O21 O22 ) L(H 0 ) 2log 2NI(C | t), where I(C | t) is the L(H1) mutual information statistic, when b is binomial. Example Signature Terms accident accidents ammunition angeles avenue beach bernardino blamed board boulevard boxcars brake brakes braking cab car cargo cars caused cc cd collided collision column conductor coroner crash crew crews crossing curve derail derailed desk driver edition emergency engineer engineers equipment failures fe fog freight ft grade holland injured injuries investigators killed line loaded locomotives los maintenance mechanical metro miles nn ntsb occurred pacific page part passenger path photo pipeline rail railroad railroads railway runaway safety san santa scene seal shells sheriff signals southern speed staff station switch track tracks train trains transportation truck weight westminster words An Approximation of Pr(t) • For a given data set and topic description – Let Q be the set of query terms. – Let S be the set of signature terms. • Estimate of Pr(t)=(Q (t) + S(t))/2 where (t)=1 if tA and 0 otherwise. Our Approach Use expected abstract word score to select candidate sentences (~2w). s1 sn a11 a1n – Terms as sentence features t1 • Terms: {t1, …, tm} Rm • Sentences: {s1, …, sn} Rn tm am1 amn • Scaling: each column scaled to score. –Use Pivoted QR to select sentences. Redundancy Removal • Pivoted QR – Choose column with maximum norm (aj) – Subtract components along aj from remaining columns, i.e., remaining columns are orthogonal to the chosen column – Stop criteria: chosen sentences (columns) ~w (~2w) words • Removes semantic redundancy Results Conclusions • Pr(t), the oracle score produces summaries which “please everyone.” • A simple estimate of Pr(t) induced by query and signature terms gives rise to a top scoring system. Future Work • Better estimates for Pr(t). – Pseudo-relevance feedback. – LSI or similar dimension reduction tricks? • Ordering of sentences for readability is important. (with Dianne O’Leary) – A 250 word summary has approximately 12 sentences. • Two directions in linguistic preprocessing: – Eugene Charniak’s parser. (with Bonnie Dorr and David Zaijac) – Simple rule based (POS lite). (Judith Schlesinger). On Brevity "I Will Be Brief. Not Nearly So Brief As Salvador Dali, Who Gave the World's Shortest Speech. He Said I Will Be So Brief I Have Already Finished, and He Sat Down. - Edward O. Wilson