Some interesting directions in Automatic Summarization Annie Louis CIS 430

advertisement
Some interesting directions in
Automatic Summarization
Annie Louis
CIS 430
12/02/08
1
Today’s lecture
Multi-strategy summarization
 Is one method enough?
Performance Confidence Estimation
 Would be nice to have an indication of expected system
performance on an input
Evaluation without human models
 Can we come up with cheap and fast evaluation
measures?
Beyond generic summarization
 Query focused, updates, blogs, meeting, speech..
2
Relevant Papers :
 Lacatusu et al. LCC’s GISTexter at DUC 2006: Multi-Strategy Multi-
Document Summarization. In Proceedings of the Document
Understanding Workshop (DUC-2006)
 McKeown et al. Columbia multi-document. summarization:
Approach and evaluation. In Proceedings of the Document
Understanding Conference (DUC01), 2001.
 Nenkova et al. Can You Summarize This? Identifying Correlates of
Input Difficulty for Multi-Document Summarization. In
Proceedings of ACL-08: HLT
3
More about DUC 2002 data…
 /project/cis/nlp/tools/Summarization_Data/Inputs2
002
 Newswire texts
 Has 3 categories of inputs !
 ?
4
DUC 2002 input categories
 Single event - 30 inputs
 Eg: d061- Hurricane Gilbert
 Same place, roughly same time, same actions
 Multiple distinct events – 15 inputs
 Eg: d064 - Opening of Mac Donald’s at Russia, Canada, South
Korea..
 Different places, different times, different agents
 Biographies – 15 inputs
 Eg: d065 – Dan Quayle, Bush’s nominee for vice president
 One person – one event, background info – events from the
past
Do you think a single method will do well for all ?
5
Tf-idf summary - d061
Hurricane Gilbert Heads Toward Dominican Coast .
Tropical Storm Gilbert formed in the eastern Caribbean and strengthened into a
hurricane Saturday night.
Gilbert Reaches Jamaican Capital With 110 Mph Winds .
Hurricane warnings were posted for the Cayman Islands, Cuba and Haiti.
Hurricane Hits Jamaica With 115 mph Winds; Communications.
Gilbert reached Jamaica after skirting southern Puerto Rico, Haiti and the
Dominican Republic.
Gilbert was moving west-northwest at 15 mph and winds had decreased to 125
mph.
What Makes Gilbert So Strong?
With PM-Hurricane Gilbert, Bjt .
Hurricane Gilbert Heading for Jamaica With 100 MPH Winds .
Tropical Storm Gilbert
6
Tf-idf summary - d064
First McDonald's to Open in Communist Country .
Police Keep Crowds From Crashing First McDonald's .
McDonald's and Genex contribute $1 million each for the flagship
restaurant.
A Bolshoi Mac Attack in Moscow as First McDonald's Opens .
McDonald's Opens First Restaurant in China .
McDonald's hopes to open a restaurant in Beijing later.
The 500-seat McDonald's restaurant in a three-story building is
operated by McDonald's Restaurant Shenzhen Ltd., a wholly
owned subsidiary of McDonald's Hong Kong.
McDonald's Hong Kong is a 50-50 joint venture with McDonald's in
the United States.
McDonald's officials say it is not a question that
7
Tf-idf summary - d065
Tucker was fascinated by the idea, Quayle said.
But Dan Quayle's got experience, too.
Quayle's Triumph Quickly Tarnished .
Quayle's Biography Inflates State Job; Quayle Concedes Error .
Her statement was released by the Quayle campaign.
But he would go no further in describing what assignments he
would give Quayle.
``I will be a very close adviser to the president,'' Quayle said.
``You're never going to see Dan Quayle telling tales out of
It was everything Quayle had hoped for.
Quayle had said very little and he had said it very well.
There are windows into the workings of the
8
Multi-strategy summarization
 Multiple summarization modules within a single
system
 Better than a single method
 How to employ a multi-strategy system?
 Use all methods, produce multiple summaries, choose
best
 Use a router and summarize by only one specific method
9
Produce multiple summaries and
choose – LCC GISTexter
 Task - Query focused summarization
 Query is decomposed by 3 methods
 Sent to a QA system and a multi-document
summarizer
 6 different summaries
 Select the best summary
 Textual entailment + pyramid scoring
10
Route to a specific module –
Columbia’s multi-document
summarizer
 Features to classify an input as
 Single event
 Biography
 Loosely connected documents
 The result of classification is used to route the input to
one of 3 different summarizers
11
Features - Single Event
 To identify
 Time span between publication dates < 80 days
 More than 50% documents published on same day
 To summarize
 Exploit redundancy, cluster similar sentences into themes
 Rank themes based on size, similarity, ranking of
contained sentences by lexical chains
 Select phrases from each theme
 Generate sentences
12
Features - Biographies
 To identify
 Frequency of most frequent capitalized letter > X
(compensate for NE)
 Frequency of personal pronouns > Y
 To summarize
 Target individual mentioned in sentence ?
 Another individual found in the sentence ?
 Position of most prominent capitalized word in the
sentence
13
Features – Weakly related
documents
 To identify
 Not single event nor biographical
 To summarize
 Words likely to be used in first paragraphs ie important





words – learnt from corpus analysis
Verb specificity
Semantic themes – wordnet concepts
Positional and length features
More weight to recent articles
Downweight sentences with pronouns
14
Characterizing/ Classifying inputs
 Important if you want to route to a specialized
summarizer
 Classification can be made along several lines
 Theme of input – Columbia’s summarizer
 Scientific/ News articles
 Long/ Short documents
 News articles about events/ Editorials
 Difficult/ Easy ??
15
Input difficulty and Performance
Confidence Estimation
 Some inputs are more difficult than others
– Most summarizers produce poor summaries for these
inputs
16
Input to summarizer

Some inputs are easier than others !
Average system scores obtained on different inputs for
100 word summaries
mean
0.55
min
0.07
max
1.65
Data: DUC 2001
 score range 0 - 4

17
Input difficulty &
Content coverage scores
 Content coverage score
 extent of coverage of important content
 Poor content selection –> low score
 If most summaries for an input get low score..
 Most systems could not identify important content
 “ Difficult Input ”
18
Did system performance vary with DUC
2001 input categories?
Multi-document inputs were from 5 categories:
A set of documents describing...
Single event

The Exxon Valdez Oil Spill
Subject

Mad Cow Disease
Biographical

Cohesive / “On topic”
Inputs
Elizabeth Taylor
Multiple distinct events

Different occasions of police misconduct
Opinion
Views of the senate, public, congress, lawyers etc
on the decision by the senate to count illegal aliens
in the 1990 census


Non Cohesive/
“Multiple facets”
Inputs
Single task – generic summarization
19
Input type influenced scores obtained



Biographical
Single event
Subject
are easier to summarize than


Multiple distinct events
Opinions
20
Cohesive inputs are easier to summarize
Scores for cohesive inputs
are significantly* higher
than those for non-cohesive
inputs at
100, 200 and 400 words
*One sided t-tests
95% significance level
Cohesive
 Biographical
 Single event
 Subject
Non Cohesive
 Multiple distinct events
 Opinions
21
Inputs can be easy or difficult =>
 Better summarizers ~ different methods to
summarize different inputs
 multi-strategy
 Enhancing user experience ~ system can flag
summaries that are likely to be poor in content
 low system confidence on difficult inputs
22
First step..
 What characterizes difficult inputs?

Find useful features
 Can we identify difficult inputs with
high accuracy?

Classification task – difficult vs easy
23
Features – Simple length-based
Smaller inputs ~ less loss of information
~ better summaries
Number of sentences
~ information to be captured in the summary
Vocabulary size
~ number of unique words
24
Features – Word distributions in input
% of words used only once
~ lexical repetition
less repetition of content ~ difficult inputs
Type- token ratio
~ lexical variation in the input
fewer types ~ easy inputs
Entropy of the input
i n
H ( X )   p ( wi ) log 2 p ( wi )
i 1
~ descriptive words ~ high probabilities ~ less entropy ~ easy
25
Features – Document similarity and
relatedness
documents with overlapping content ~ easy input
Pair-wise cosine overlap (average, min, max)
~ similarity of the documents
v .v
cos   1 2
v1 ,v2 tf −idf
v1 v2
weightsof content wordsof 2documents
High cosine overlaps
 overlapping content
 easy to summarize
26
Features – Document similarity and
relatedness
tightly-bound by topic ~ easy input
KL Divergence
~ distance from a large collection of random documents
KLdivergence 

wInp
pinp ( w) log 2
pinp ( w)
pcoll ( w)
coll− all documents fromall tasksof DUC2001to2006
Difference between 2 language models
 input & random collection
Greater divergence
 input is unlike random documents, tightly bound input
27
Features – Log likelihood ratio based
more topic terms, similar topic terms ~ topic-oriented, easy
input
Number of topic signature terms
Percentage of topic signatures in the vocabulary
~ control for length of the input
Pair-wise topic signature overlap (average, min, max)
~ similarity between the topic vectors of different documents
~ cosine overlap with reduced & specific vectors
28
What makes some inputs easy?
Easy inputs have
smaller vocabulary
 smaller entropy
 greater divergence from a random collection
 higher % of topic signatures in the vocabulary
 higher avg cosine and topic overlap

29
Input difficulty hypothesis for
systems
Indicator of an input’s difficulty
 Average system coverage score
 Difficult, if most systems select poor content
Defining difficulty of inputs
 2 classes
 Abv/ Below “ mean average system score ”
> mean score – easy
< mean score – difficult
 Equal classes
30
Classification Results
Baseline performance : 50%
Test set: DUC 2002 - 04
10 fold cross validation on 192 observations
Precision and recall of difficult inputs
Accuracy
Precision
Recall
69.27
0.696
0.674
31
Summary Evaluation without
Human Models
 Current Evaluation Measures - Recap
 Content Coverage
 Pyramid
 Responsiveness
 ROUGE
* My work with Ani
32
Need for cheap, fast measures
 All current evaluations require human effort
 Human summaries (content overlap, pyramid, rouge)
 Manual marking of summaries (responsiveness)
 Human summaries are biased
 several summaries for the same input are needed to
remove bias (Pyramid, ROUGE)
 Can we come up with cheaper evaluation techniques
that will produce the same rankings for systems as
human evaluations ?
33
Compare with input – No human
models
 Estimate closeness of summary to input
 The more close a summary is to the input, the better its
content should be
 How do we verify this ?
 Design some features that can reflect how close a
summary is to the input
 Rank summaries based on the value of this feature
 Compare the obtained rankings to rankings given by
humans
 Similar rankings (high correlation) – you have succeeded
34

What features should we use?
 We want to know how well a summary reflects the
input’s content.
 Guesses ?
35
Features - Divergence between
input and summary
Smaller divergence ~ better summary
 KL divergence input – summary
 KL divergence summary – input
 Jensen Shannon Divergence
Inp || Summ  H ( 12 Inp  12 Summ)  12 H ( Inp)  12 H (Summ)
36
Features – Use of topic words from
the input
More topic words ~ better summary
 % of summary composed of topic words
 % of input’s topic words carried over to the summary
37
Features – Similarity between input
and summary
More similar to the input ~ better summary
 Cosine similarity
 input – summary words
 Cosine similarity
 input’s topic signatures – summary words
38
Features - Summary Probability
Higher likelihood of summary given input ~ better
summary
n
 Unigram
pInpsummary
( w1 ) n pInp ( w2 )probability
 pInp ( wr ) n
1
2
r
r  summary vocabulary
n1    nr  N
summary size
ni  count in summary of word wi
 Multinomial
summary
probability
N!
n1
n2
n1 !nr !
p Inp ( w1 ) p Inp ( w2 )  p Inp ( wr ) nr
39
Analysis of features
 The value of the feature will be the score for the
summary
 Average the feature values for a particular system over
all inputs
 Compare to average human score
 Spearman (rank) correlation
40
Results
TAC 2008 Query focused summarization
48 inputs, 57 systems
Feature
JSD
% input’s topic in summary
KL div summ-input
Cosine overlap
% of topic summary
KL div input – summ
Unigram summ prob
Mult. Summ prob
Pyramid
-0.8803
-0.8741
-0.7629
0.7117
0.7115
-0.6875
-0.1879
0.2224
Responsivenes
s
-0.7364
-0.8249
-0.6941
0.6469
0.6015
-0.5850
-0.1006
0.2353
41
Evaluation without human models
 Comparison with input – correlates well with human
judgements
 Cheap, fast, unbiased
 No human effort needed
42
Other summarization tasks of
interest
 Update summaries
 The user has read a set of documents A
 Produce a summary of updates from a set B of documents
published later in time
 Query focused
 A topic statement is given to focus content selection
43
Other summarization tasks of
interest
 Blog/ Opinion Summarization
 Mine opinions, good/ bad product reviews etc
 Meeting/ Speech Summarization
 How would you summarize a brainstorming session ? 
44
What you have learnt today..
 How simple features you already know can be put to
use for interesting applications
 Beyond a simple sentence extractor engine –
customizing for inputs/ user/ task-setting is
important
 There are a lot of interesting tasks in summarization
and language processing
45
Download