Multi-Document Summary Space:What do People Agree is Important? John M. Conroy

advertisement
Multi-Document Summary
Space:What do People Agree
is Important?
John M. Conroy
Institute for Defense Analyses
Center for Computing Sciences
Bowie, MD
Outline
•
•
•
•
Problem statement.
Human Summaries.
Oracle Estimates.
Algorithms.
Query-Based Multi-document
Summarization
•
•
•
•
User types query.
Relevant documents are retrieved.
Retrieved documents are clustered.
Summaries for each cluster are
displayed.
Example Query:
“hurricane earthquake”
QuickTime™ and a
QuickTime™
and a decompressor
TIFF
(Uncompressed)
TIFF (Uncompressed)
are needed decompressor
to see this picture.
are needed to see this picture.
columbia
michagan
Recent Evaluation and
Problem Definition Efforts
• Document Understanding Conferences
– 2001-2004: 100 word generic summaries.
– 2005-2006: 250 word “focused” summaries.
– http://duc.nist.gov/
• Multi-lingual Summarization Evaluation 20052006. (MSE)
– Given a cluster of translated documents and
English documents produce100 word.
– http://www.isi.edu/~cyl/MTSE2005/
Overview of Techniques
• Linguistic Tools (find sentence boundaries, to
shorten sentences, extract features).
–
–
–
–
Part of speech.
Parsing.
Entity Extraction.
Bag of words, position in document.
• Statistical Classifier.
– Linear classifiers.
– Bayesian methods, HMM, SVM, etc.
• Redundancy Removal.
– Maximum marginal relevance (MMR).
– QR.
Sample Data
DUC 2005.
– 50 topics.
– 25 to 50 relevant documents per topic.
– 4 or 9 human summaries.
Linguistic Processing
• Use heuristic patterns to find
phrases/clauses/words to
eliminate
–Shallow processing
–Value of full sentence
elimination?
Linguistic Processing
• Phrase elimination
– Gerund phrases
Example:
“Suicide bombers targeted a crowded
open-air market Friday, setting off blasts
that killed the two assailants, injured 21
shoppers and passersby and prompted the
Israeli Cabinet to put off action on ….”
Example Topic Description
Title: Reasons for Train Wrecks
Narative: What causes train wrecks and what
can be done to prevent them? Train wrecks
are those events that result in actual damage
to the trains themselves not just accidents
where people are killed or injured.
Type: General
Example Human Summary
Train wrecks are caused by a number of
factors: human, mechanical and equipment
errors, spotty maintenance, insufficient
training, load shifting, vandalism, and natural
phenomenon. The most common types of
mechanical and equipment errors are: brake
failures, signal light and gate failures, track
defects, and rail bed collapses. Spotty
maintenance is characterized by failure to
consistently inspect and repair equipment.
Lack of electricians and mechanics results in
letting equipment run down until someone
complains. Engineers are often unprepared to
Another Example Topic
Title: Human Toll of Tr opical Storms
• What has been the human toll in death or
injury of tropical storms in recent years?
Where and when have each of the storms
caused human casualties? What are the
approximate total number of casualties
attributed to each of the storms?
• Granularity: Specific
Example Human Summary
• January 1989 through October 1994 tolled
641,257 tropical storm deaths and 5,277
injuries world-wide.
• In May 1991, Bangladesh suffered 500,000
deaths; 140,000 in March 1993; and 110
deaths and 5,000 injuries in May 1994.
• The Philippines had 29 deaths in July 1989
and 149 in October; 30 in June 1990, 13 in
August and 14 in November.
• South Carolina had 18 deaths and two
injuries in October 1989; 29 deaths in April
1990 and three in October.
Inter-Human Word Agreement
Evaluation of Summaries
Ideally each machine summary would be
judged by multiple humans for
1. Responsiveness to query.
2. Cohesiveness, grammar, etc.
Reality: This would take too much time!
Plan: Use Metric which correlates at 90-97%
with human responsiveness judgments.
Recall Oriented
Understanding for Gisting
Evaluation
ROUGE-1 Scores
ROUGE-2 Scores
Frequency and
Summarization
• Ani Nenkova, Columbia and Lucy
Vanderwende, Microsoft report:
– High frequency content words correlate with high
frequency words chosen by humans.
– SumBasic, a simple method based on this
principle, produces “state of the art” generic
summaries,
e.g., DUC 04 and MSE 05.
• Van Halteren and Teufel 2003, Radev et. Al.
2003, Copeck and Szpakowicz 2004.
What is Summary Space?
• Is there enough information in the
documents to approach human
performance as measured by ROUGE?
• Do humans abstract so much that
extracts don’t suffice?
• Is a unigram distribution enough?
A Candidate
• Suppose an oracle gave us:
• Pr(t)=Probability that a human will
choose term t to be included in a
summary.
– t is a non-stop word term.
• Estimate based on our data.
– E.g., 0, 1/4, 1/2, 3/4, or 1 if 4 human
summaries are provided.
A Oracle Simple Score
• Generate extracts:
– Score sentences by the expected
percentage of abstract terms they contain.
– Discard any short sentences or any long
sentences.
– Pivoted QR to remove redundancy.
The Oracle Pleases Everyone!
Approximate Pr(t)
• Two bits of Information:
• Topic Description.
– Extract query phrases.
• Documents Retrieved.
– Extract terms which are indicative or give
the “signature” of the documents.
Query Terms
• Given Topic Description.
• Tag it for part of speech.
– Take any NN (noun), VB (verb), JJ
(adjective), RB (adverb), multi-word
groupings of NNP.
– E.g. train, wrecks, train wrecks, causes,
prevent, events, result, actual, actual
damage,trains, accidents, killed, injured.
Signature Terms
• Term: space-delimited string of characters
from {a,b,c,…,z}, after text is lower cased and
all other characters and stop words are
removed.
• Need to restrict our attention to indicative
terms (signature terms).
– Terms that occur more often then expected.
Signature Terms
Terms that occur more often than
expected
• Based on a 22 contingency table of
relevance counts.
• Log-likelihood; equivalent to mutual
information.
• Dunning 1993, Hovy Lin 2000.
Hypothesis Testing
~
H0: P(C|ti)=p=P(C| ti)
~
H1: P(C|ti)=p1p2=P(C| ti)
ML Estimate p, p1, and p2
O11  O21
p
,
O11  O21  O12  O22
O11
ti
p1 
,
~
O11  O12
ti
O21
p2 
O21  O22
C
O11
O21
~C
O12
O22
Likelihood of H0 vs. H1 and
Mutual Information
L(H 0 ) b(p;O11,O11  O12 )b(p;O21,O21  O22 )

L(H1) b(p1;O11,O11  O12 )b(p2 ;O21,O21  O22 )
L(H 0 ) 
2log
 2NI(C | t), where I(C | t) is the
L(H1) 
mutual information statistic, when b is binomial.
Example Signature Terms
accident accidents ammunition angeles avenue beach
bernardino blamed board boulevard boxcars brake
brakes braking cab car cargo cars caused cc cd
collided collision column conductor coroner crash
crew crews crossing curve derail derailed desk driver
edition emergency engineer engineers equipment
failures fe fog freight ft grade holland injured injuries
investigators killed line loaded locomotives los
maintenance mechanical metro miles nn ntsb
occurred pacific page part passenger path photo
pipeline rail railroad railroads railway runaway safety
san santa scene seal shells sheriff signals southern
speed staff station switch track tracks train trains
transportation truck weight westminster words
An Approximation of Pr(t)
• For a given data set and topic
description
– Let Q be the set of query terms.
– Let S be the set of signature terms.
• Estimate of Pr(t)=(Q (t) + S(t))/2
where (t)=1 if tA and 0 otherwise.
Our Approach
Use expected abstract word score to
select candidate sentences (~2w).
s1  sn
a11  a1n
  
– Terms as sentence features
t1
• Terms: {t1, …, tm}  Rm

• Sentences: {s1, …, sn}  Rn tm am1  amn
• Scaling: each column scaled to score.
–Use Pivoted QR to select sentences.
Redundancy Removal
• Pivoted QR
– Choose column with maximum norm (aj)
– Subtract components along aj from
remaining columns, i.e., remaining
columns are orthogonal to the chosen
column
– Stop criteria: chosen sentences (columns)
 ~w (~2w) words
• Removes semantic redundancy
Results
Conclusions
• Pr(t), the oracle score produces
summaries which “please everyone.”
• A simple estimate of Pr(t) induced by
query and signature terms gives rise to
a top scoring system.
Future Work
• Better estimates for Pr(t).
– Pseudo-relevance feedback.
– LSI or similar dimension reduction tricks?
• Ordering of sentences for readability is
important. (with Dianne O’Leary)
– A 250 word summary has approximately 12
sentences.
• Two directions in linguistic preprocessing:
– Eugene Charniak’s parser. (with Bonnie Dorr and
David Zaijac)
– Simple rule based (POS lite). (Judith Schlesinger).
On Brevity
"I Will Be Brief. Not Nearly So
Brief As Salvador Dali, Who
Gave the World's Shortest
Speech. He Said I Will Be So
Brief I Have Already
Finished, and He Sat Down.
- Edward O. Wilson
Download