Methods for Automatic Evaluation of Sentence Extract Summaries

advertisement
Methods for Automatic Evaluation
of Sentence Extract Summaries
G.Ravindra+, N.Balakrishnan+, K.R.Ramakrishnan*
Supercomputer Education & Research Center+
Department of Electrical Engineering*
Indian Institute of Science
Bangalore-INDIA
Agenda

Introduction to Text Summarization


Need for summarization, types of summaries
Evaluating Extract Summaries
Challenges in manual and automatic
evaluation
 Fuzzy Summary Evaluation
 Complexity Scores

What is Text Summarization

Reductive transformation of source text to summary
text by content generalization and/or selection

Loss of information




Types of Summaries


What can be lost and what should not be lost
How much can be lost
What is the size of the summary
Extracts and Abstracts
Influence of genre on the performance of a
summarization algorithm

Newswire stories are favorable to sentence position
Need for Summarization

Explosive growth in availability of digital textual data


Books in digital libraries, mailing-list archives, on-line news
portals
Duplication of textual segments in books


Hand-held devices


Small screens and limited memory
Low power devices and hence limited processing capability


E.g.: 10 introductory books on quantum physics have a
number of paragraphs common to all of them (syntactically
different but semantically the same)
E.g.: Stream a book from a digital library to a hand-held device
Production of information is faster than
consumption
Types of Summaries

Extracts



Abstracts


Independent of genre
Indicative Summaries


Text selection followed by generalization
 Need for linguistic processing
 E.g.: Convert a sentence to a phrase
Generic Summaries


Text selection
 E.g: Paragraphs from books, sentences from editorials,
phrases from e-mails
Application of statistical techniques
Gives a general idea as to the topic of discussion in the text
being summarized
Informational Summaries

Serves as a surrogate to the original text
Evaluating Extract Summaries

Manual evaluation

Human judges are allowed to score a summary on
a well defined scale based on a well defined
criteria


Subject to judge’s understanding of the subject
Depends on judge’s opinions



Guidelines constrain opinions
Individual judges’ scores are combined to
generate the final score
Re-evaluation might result in different scores

Logistic problems for researchers
Automatic Evaluation

Machine-based evaluation




Consistent over multiple runs
Fast, avoids logistic problems
Suitable for researchers experimenting with new
algorithms
Flip-side



Not as accurate as human evaluation
Should be used as precursor to a detailed human
evaluation
Algorithmically handles various sentence
constructs and linguistic variants
Fuzzy Summary Evaluation: FuSE

Proposing the use of Fuzzy union theory to quantify
the similarity of two extract summaries


Similarity between the reference (human generated)
summary and candidate (machine generated) summary
is evaluated
Each sentence is a fuzzy set




Each sentence in the reference summary has a membership
grade in every sentence of the candidate machine generated
summary
Membership grade of a reference summary sentence in the
candidate summary is the union of membership grades
across all candidate summary sentences
Use membership grades to compute an f-score value
Membership grade is the hamming distance between two
sentences based on collocations
Fuzzy F-score
Fuzzy Precision
Fuzzy Recall
Candidate summary sentence set
Reference summary sentence set
Union function
Membership grade of candidate
sentence in reference sentence
Choice of Union operator

Propose the use of Frank’s S-norm
operator
Allows combining partial matches nonlinearly
 Membership grade of a sentence in a
summary is dependent on its length


Automatically includes brevity-bonus into the
scheme
Frank’s S-norm operator
Damping Coefficient
Mean of non-zero membership grades
for a sentence
Sentence length
Length of the longest sentence
Characteristics of Frank’s base
Performance of FuSE for various
sentence lengths
Dictionary-enhanced Fuzzy
Summary Evaluation:DeFuSE

FuSE does not understand sentence
similarity based on synonymy and
hypernymy



Identifying synonymous words makes evaluation
more accurate
Identifying hypernymous word relationships
allows consideration of “gross information”
during evaluation
Note: Very deep hypernymy trees could result in
topic drift and hence improper evaluation
Use of Word Net
Example: Use of hypernymy

HURRICANE GILBERT DEVASTATED DOMINICAN
REPUBLIC AND PARTS OF CUBA
(PHYSICAL PHENOMENON) GILBERT (DESTROY,RUIN)
(REGION) AND PARTS OF (REGION)

TROPICAL STORM GILBERT DESTROYED PARTS
OF HAVANA
TROPICAL (PHYSICAL PHENOMENON) GILBERT DESTROYED
PARTS OF (REGION)
Complexity Score

Attempts to quantify the summarization
algorithm based on the difficulty in
generating a summary of a particular
accuracy


Generating a 9 sentence summary from a 10
sentence document is very easy.

An algorithm which randomly selects 9 sentences will
have a worst case accuracy of 90%

A complicated AI+NLP based algorithm cannot do any
better
If a 2 sentence summary is to be generated from a
10 sentence document, we have 45 possible
candidates out of which one is accurate
Computing Complexity Score

Probability of generating a summary of
a length m1 with accurate sentences l1
when human summary has h
sentences and the document being
summarized has n sentences
Complexity Score (Cont..)

To compare two summaries of equal
length the performance of one relative
to the baseline is given by
Complexity Score (Cont..)

Complexity in
generating a
10% extract
with 12 correct
sentences is
higher than
generating a
30% extract
with 12 correct
sentences
Conclusion




Summary evaluation is as complicated
as summary generation
Fuzzy schemes are ideal for evaluating
extract summaries
Use of synonymy and hypernymy
relations improve evaluation accuracy
Complexity score is a new way of
looking at summary evaluation
Download