2 The Summary Generator

advertisement
Text Summarization as Controlled Search
Terry COPECK
School of Information Technology & Engineering
University of Ottawa
150 Louis Pasteur, P.O. Box 450 Stn. A
Ottawa, Ontario, Canada K1N 6N5
terry@site.uottawa.ca
Nathalie JAPKOWICZ
School of Information Technology & Engineering
University of Ottawa
150 Louis Pasteur, P.O. Box 450 Stn. A
Ottawa, Ontario, Canada K1N 6N5
nat@site.uottawa.ca
Stan SZPAKOWICZ
School of Information Technology & Engineering
University of Ottawa
150 Louis Pasteur, P.O. Box 450 Stn. A
Ottawa, Ontario, Canada K1N 6N5
szpak@site.uottawa.ca
Paper ID: NAACL-2001-XXXX
Keywords: machine learning, automatic evaluation
Contact Author: Terry Copeck
Under consideration for other conferences (specify)? no
Abstract
We present a framework for text summarization based on the generate-and-test model. A
large set of summaries, generated by enumerating all plausible values of six parameters, was
assessed against the abstract included in the document. The number involved dictated the use
of automated means, supervised machine learning, to make this assessment. Results suggest
the method of key phrase extraction is the most, and possibly only, important variable.
Theme Area: Evaluation and Text/Training Corpora
Wordcount: 3,800
Text Summarization as Controlled Search
Abstract
We present a framework for text
summarization based on the generate-andtest model. A large set of summaries was
generated by enumerating all plausible
values of six parameters. Quality was
assessed by measuring them against the
abstract included in the summarized
document. The large number of summaries
involved dictated the use of automated
means to make this assessment; we chose to
use supervised machine learning. Results
suggest that the method of key phrase
extraction is the most, and possibly the only,
important variable. Machine learning
identified parameters and ranges of values
within which the summary generator might
be used with diminished reliability to
summarize new texts for which no author's
abstract exists.
Introduction
Text summarization consists in compressing a
document into a smaller précis of the
information contained in the document. Its goal
is to include the most important facts in that
précis. Summarization can be achieved by
constructing a new document from elements not
necessarily present in the original. Alternatively,
it can be seen as search for the most appropriate
elements, usually sentences, that should be
extracted for inclusion in the summary.
This paper takes the second approach. It
describes a partially automated search engine for
summary building. The engine has two
components. The Summary Generator takes as
input a text and a number of parameters that tune
generation in a variety of ways, and produces a
single summary of the text. While each set of
parameter values is unique, it is possible that
two or more sets will produce the same
summary. The Generation Controller evaluates
the summary output in order to determine the
best parameters for use in generation.
Summary generation has been designed to
combine a number of publicly-available text
processing programs to perform each of three
main stages in its operation. Many parameter
settings are possible. Just selecting among the
available programs produces 18 possible input
combinations; given that many of these
programs accept or require additional
parameterization themselves, evaluating the
possible set of inputs manually becomes
impossible.
Generation control is achieved by supervised
machine learning that associates the parameters
of the summary generation process with
expressions of the generated summary’s quality.
The training data consist of the parameter
settings used to generate each summary, along
with a rating of that summary. The rating is
obtained by comparing automatically generated
summaries against the author’s abstract, which
we treat as the “gold standard” for the document
on the grounds that no one is better able to
summarize a document than its author. The
control process is iterative: once bad parameter
values have been identified, the controller is
applied to the remaining good values in a further
attempt to refine the choice of parameters.
While the summary generator is currently fully
automated, the controller was applied in a semimanual way. We plan to automate the entire
system in the near future.
The remainder of this paper is organized in four
sections. Section 2 describes the overall
architecture of our system. Sections 3 and 4
present its two main components, the Summary
Generator and the Generation Controller, in
greater detail. Section 5 presents our results,
discussing the parameters and combination of
parameters identified as good or bad. Section 6
describes refinements we are planning to
introduce.
Text to be
Summarized
Best Parameters
Parameter
Values
Future Feedback Loop
Parameter
Assessor
Rating
Summary
Assessor
SUMMARY GENERATOR
GENERATION CONTROLLER
Summary + Parameter Values
Author Abstract
Figure 1. Overall System Architecture
1
Overall Architecture
The overall architecture of the system is shown
in Figure 1. Its two main components are the
Summary Generator and the Generation
Controller. The Summary Generator takes as
inputs the text to be summarized and a certain
number of parameter values. It outputs one or
more summaries. These are then fed into the
Generation Controller along with the parameter
values used to produce them. The Generation
Controller has two subcomponents. The
Summary Assessor takes the input summary
along with a “gold standard” summary—the
abstract written by the document’s author—and
issues a rating of the summary with respect to
the abstract. The Parameter Assessor takes an
accumulated set of parameter values used to
generate summaries together with their
corresponding ratings and outputs the set of best
and worst combinations of parameters and
values with respect to the ratings. We plan to
implement an iterative feedback loop that would
feed the output of the Parameter Assessor into
the parameter values input to both the Summary
Generator and Generation Component several
times in order to refine the best possible
parameter values to control summarization.
2
The Summary Generator
The Summary Generator is based on the
assumption that a summary composed of
adjacent, thematically related sentences will
convey more information than one in which each
sentence is independently chosen from the
document as a whole and, as a result, likely
bearing no relation to its neighbors. This
assumption
dictated
both
an
abstract
methodology and a design architecture for this
component: one of our objectives is to validate
this intuitive but unproven restriction of
summary extraction.
Let us first discuss the methodology. A text to
be summarized is initially a) broken down into a
series of sequences of sentences or segments
which talk about the same topic. The set of
segments is then rated in some way. We do this
by b) identifying the most important key phrases
in the document and c) ranking each segment on
its overall number of matches for these key
phrases. A requested number of sentences is
then selected from one or more of the top-ranked
segments and shown to the user in document
order as the summary.
accomplish this we decided to design a
configurable system with alternative modules for
each of the three main tasks identified above;
and where possible to use for these tasks
programs which are well-known to the research
community because they are in the public
domain or are made freely available by their
authors. In this way we hoped to be able to
evaluate our assumption effectively.
Three text segmenters were adopted: the original
Segmenter from Columbia University (Kan et al.
1998), Hearst's (1997) TextTiling program, and
the C99 program (Choi 2000). Three more
programs were used to pick out keywords and
key phrases from the body of a text: Waikato
Because one of our objectives was to investigate
the policy of taking a number of sentences from
within one or more segments rather than those
individually rated best, we wanted to control for
other factors in the summarization process. To
Text
Segmenter
Kea
SEGME
TextTiling
NTING
KEYPHRASE
Extractor
EXTRACTION
Exact
MATCHING
Stem
Ranked Segments
Summary
Figure 2. Summary Generator Architecture
C99
NPSeek
University's Kea (Witten et al. 1999), Extractor
from the Canadian NRC's Institute for
Information Technology (Turney 2000), and
NPSeek, a program developed locally at the
University of Ottawa.
by hitcount. When this is no longer true the
generator takes sentences from the next most
highly rated segment until the required number
of sentences has been accumulated.
3
The Generation Controller
Once a list of key phrases has been assembled,
the next step is to count the number of matches
found in each segment. The obvious way is to
find exact matches for a key in the text.
However there are other ways, and arguments
for using them. To illustrate, consider the word
total. Instances in text include total up the bill,
the total weight and compute a grand total.,
While different parts of speech, each of these
homonyms is an exact match for the key. What
about totally, totals and totalled? Should they be
considered to match total? If the answer is yes,
what then should be done when the key rather
than the match contains a suffix? Counting hits
based on agreement between the root of a
keyword and the root of a match is called stem
matching. The Kea program suite contains
modules which perform both exact and stem
matching, and each these modules was
repackaged as a standalone programs to perform
this task in our program. The architecture of the
summarization component appears in Figure 2,
which shows graphically that independent of
other parameter settings, eighteen different
combinations of various programs can be
invoked to summarize a text.
3.1
The Summary Assessment Component
The particular segmenter, keyphrase extractor
(keyphrase) and type of matching (matchtype) to
use are three of the parameters required to
generate a summary. Three others must also be
specified: the number of sentences to include in
the summary (sentcount), the number of
keyphrases to use in ranking sentences and
segments (keycount), and the minimum number
of matches in a segment (hitcount). The last
parameter is used to test the hypothesis that a
summary composed of adjacent, thematically
related sentences will be better than one made
from individually selected sentences. The
Summary Generator constructs a summary by
taking the most highly rated sentences from the
segment with the greatest average number of
keyphrases, so long as each sentence has at least
the number of instances of keyphrases specified
Summary assessment requires establishing how
effectively a summary represents the original
document. This task is generally considered hard
(Goldstein et al. 1999). First of all, assessment is
highly subjective: different people are very
likely to rate a summary in different ways.
Summary assessment is also goal-dependent:
how will the summary be used? Much more
knowledge of a domain can be presumed in a
specialized audience than in the general public,
and an effective summary must reflect this.
Finally, and perhaps most important when many
summaries
are
produced
automatically,
assessment is very labor-intensive: the assessor
needs to read the full document and all
summaries. It is difficult to recruit volunteers for
such pursuits, and very costly to hire assessors.
As a result of these considerations, we focused
on automating summary assessment. The
procedure turns out to be pretty straightforward,
but the criteria and the assumptions behind them
are arbitrary. In particular, it is not clear whether
the author's abstract can really serve as the “gold
standard” for a summary; we assume that
abstract in journal papers have high enough
quality (Mittal et al. 1999). We work with 75
academic papers published in 1993-1996 in the
on-line Journal of Artificial Intelligence
Research. It also seems that, lacking both
domain-specific semantic knowledge and
statistics derived from a corpus, analysis of a
document based on key phrases at least relies on
the statistically motivated shallow semantics
embodied in key phrase extractors.
A list of keyphrases is constructed for each
paper’s abstract; a keyphrase is any unbroken
sequence of tokens between stop words (taken
from a stop list of 980 common words). To
distinguish such a sequence from the product of
key phrase extraction, we refer to it as a key
phrase in abstract, or KPIA. Sentences in the
body of a paper are ranked on their total number
of KPIAs. Rating a summary is somewhat more
complicated than rating an individual sentence,
in that it seems appropriate to assign value not
only to the absolute number of KPIAs in the
summary, but also to its degree of coverage of
the set of KPIAs. Accordingly, we construct of a
set of KPIAs appearing in the summary,
assigning a weight of 1.0 to each new KPIA
which is added to the set, and 0.5 to each which
duplicates a KPIA already in the set. A
summary’s score is thus:
Score =  1.0 * KPIAunique + 0.5 * KPIAduplicate
The total is normalized for summary length and
KPIA coverage in the document1 by dividing it
by the total KPIA count of an equal number of
those sentences in the document with the highest
KPIA ratings.
Often the sentences with the highest KPIA
instance counts do provide the best possible
summary of a document. However because our
scoring technique gives unduplicated KPIAs
twice the weight of duplicates, sometimes a
slightly higher score could be achieved by taking
sentences with the most unique, rather than the
most, KPIA instances. Computing the highest
possible score for a document which can run to
hundreds of sentences is computationally quite
intensive because it involves looking at all
possible combinations of sentcount sentences.
The ‘nearly best’ summary rating computed by
simply totaling the KPIA instances for those was
deemed an adequate approximation for purposes
of normalization.
3.1
The Summary Assessment Component
The purpose of the Parameter Assessor is to
identify the best and worst parameters for use in
the Summary Generator so that our summary
assessment procedure gives best results. This
task can be construed as supervised machine
learning . Its input is a set of summary
descriptions (including the parameters that were
used to generate that summary) along with the
KPIA rating of each summary. Its output is a set
1
The use of an abstract has been defended on the
grounds that no one is better able to summarize a
document than its author; a criticism of some force is
the fact that on average 24% of KPIAs do not occur
in the body of the 75 texts under study (σ 13.3).
of rules determining which summary
characteristics (including, again, the parameters
with which the summary was generated) yield
good or bad ratings. These rules can then be
interpreted so as to guide our summary
generation procedure, in particular, by telling us
which parameters to use and which to avoid in
future work with the same type of documents.
Because we needed easy rules to interpret, we
chose to use C5.0 decision tree and rule
classifier, one of the state of the art techniques
currently available (Quinlan 1993). Our
(summary description, summary rating) pairs
then had to be formatted according to C5.0's
requirements. In particular, we chose to
represent each summary by the 6 parameters that
were used to generate them:




the segmenter (Segmenter, TextTiling, C99)
the key phraser (Extractor, NPSeek, Kea)
the match type (exact or stemmed match)
key count, the number of key phrases used in
the summary
 sent count, the number of sentences used in
the summary
 hit count, the minimum number of matches
for a sentence to be in the summary
The KPIA ratings obtained from the Summary
Assessor are continuous values and C5.0
requires discrete class values, so the ratings had
to be discretized. We did so using the k-means
clustering technique (McQueen 1967), whose
purpose is to discover different groupings in a
set of data points and assign the same discrete
value to each element within the same grouping
but different values to elements of different
groupings. We tried dividing the KPIA values
into 2, 3, 4 and 5 groupings but ended up
working only with the division into two discrete
values because the use of additional discrete
values led to less reliable results.2
2
The unreliability of results when using more than
two discrete values is caused by the fact that our
summary description is not complete (in particular,
we did not include any description of the original
document). We are planning to augment this
description in the near future and believe that such a
step will improve the results obtained when using
more than two discrete values.
The C5.0 classifier was then run on the prepared
data. This experiment was conducted on 75
different documents, each summarized in 1944
different ways (all plausible combinations of the
values of the 6 parameters that we considered).
Our full data set thus contained 75 * 1944 =
145,800 pairs of summary description and
summary ratings. In order to assess the quality
of the rules C5.0 produced, we divided this set
so that 75% of the data was used for training and
25% of it was used for testing. The results of this
process are reported and discussed in the
following section.
4
Identification of Good and Bad Parameters
Figure 3 presents highlights of the
results obtained by running C5.0 on
our database of automaticallygenerated summaries and their
associated ratings to produce a
division into good or bad
summaries. A discussion of these
results follows.
The value in parentheses near each rule
declaration indicates how many of the 109,350
training examples were covered by the rule. The
value in square brackets appearing near the
outcome of each rule tells us the percentage of
cases in which the application of this rule was
accurate. Altogether, these rules gave us a 7.8%
Rule 1: (cover 3298)
keyphrase = k
keycount > 3
sentcount > 20
hitcount <= 3
-> class good
Rule 2: (cover 3987)
keyphrase = k
keycount > 5
sentcount > 20
-> class good
Rule 3: (cover 5965)
keyphrase = k
sentcount > 20
-> class good
Rule 4: (cover 4071)
keyphrase = e
sentcount > 20
hitcount <= 3
-> class good
accuracy rate when tested on the testing set and
we therefore consider them reliable.
It is interesting that the rules with the highest
coverage and accuracy are negative: they show
combinations of parameters that yield bad
summaries (Rules 5-9 in Figure 3). In particular,
these rules advise us not to use NPSeek as a
keyphrase extractor (Rule 5), not to generate
summaries containing fewer than 20 sentences
(Rule 6), and not to rely on fewer than four
different key phrases to select best sentences and
best segments for inclusion in the summary
(Rule 7). Furthermore, we should not require
that sentences within a segment have more than
three key phrase matches before shifting to the
next best segment (Rule 9) in all cases, and no
more than two key phrase matches when the
number of keys involved is seven or less (Rule
8). These rules seem useful and reasonable. Rule
6 appears excessive, most probably because the
assessment procedure gives a bad rating to a
summary that contains few of the key phrases
from the author's abstract. Rule 8, on the other
hand, seems particularly reasonable.
The positive rules (Rules 1-4 in Figure 3)
determine combinations of parameters that give
good summaries. These rules are less definitive
in terms of coverage and accuracy. Still, they
Rule 5: (cover 36200)
keyphrase = n
-> class bad
[0.672]
[0.989]
Rule 6: (cover 90596)
sentcount <= 20
-> class bad [0.958]
Rule 7: (cover 17990)
keycount <= 3
-> class bad
[0.933]
Rule 8: (cover 36315)
keycount <= 7
hitcount > 2
-> class bad
[0.927]
Rule 9: (cover 36302)
hitcount > 3
-> class bad
[0.915]
[0.665]
[0.591]
[0.554]
Figure 3. Rules Discovered by C5.0
suggest that Kea is the best key phrase extractor,
while Extractor can be good. All four positive
rules confirm the conclusion of Rule 6, that
summaries of no more than 20 sentences are not
good. Rules 1, 2, 4 express the general need to
keep high the number of key phrases used to
select best sentences and best segments for
inclusion in the summary (key count), and to
keep low the number of key phrase matches
within a segment for its sentences to be selected
for inclusion in the summary (hitcount). Rule 1,
just like Rule 8, demonstrates the interplay
between keycount and hitcount.
Interestingly, two parameters do not figure in
any rule, either negative or positive: segmenter
(it decides what segmentation technique is used
to segment the original text), and matchtype (it
says whether it is necessary to match key
phrases on exact or stemmed terms). If this
observation can be confirmed in larger
experiments, we will be able to generate
summaries without bothering to experiment with
two of the three stages in summarization. While
this conclusion is not intuitive, it may turn out
that only keyphrase extraction is important.
Conclusions and Future Work
This paper has presented a framework for text
summarization based on the generate-and-test
model. We generated a large set of summaries
by enumerating all plausible values of six
parameters, and we assessed the quality of these
summaries by measuring them against the
abstract included in the original document.
Because of the high number of automatically
generated summaries, the adequacy of
parameters and their combinations could only be
determined using automated means; we chose
supervised machine learning. The results of
learning suggest that the key phrase extraction
method is the most, and possibly the only,
important variable. The summary generator
could perhaps be used (with a certain amount of
reliability) within these ranges of parameter
values on new texts for which no author's
abstract exists.
We will need to consider at least four questions
in the near future.
 Feedback loop: We only went through a
single iteration of the Parameter Assessment
Component. We should keep the parameters
and their values that we found useful, throw
in new parameters and their values and repeat
the process. This could be done several times,
until no improvement resulted from the
addition of new parameters or new values.
 Better summary description and summary
ratings: For the time being, we consider all
documents equivalent in terms of how
summary generation should be applied. Yet,
it is possible that different documents require
different methodologies. A very simple
example is that of long versus short texts or
dense versus colloquial ones. It is very
possible that short or colloquial texts require
shorter summaries than long or dense ones.
Some description of the summarized
document could be included in the summary
description. The ratings too could be
normalized to take the size of the summary
into consideration, or altogether new
measures of summary quality could be
devised.
 Full automation: In the present system, the
linking of the various components of the
system had to be done manually. This was
time-consuming, and could make the
implementation of the feedback loop
mentioned earlier quite tedious. An important
goal for our system would then be to
automate the linkage between the different
components and let it run on its own.
 Integration
of
other
techniques/
methodologies
into
the
Summary
Generator: Our current system only
experimented with a restricted number of
methodologies and techniques. Once we had
a fully automated system, nothing would
prevent us from trying a variety of new
approaches to text summarization that could
then be assessed automatically until an
optimal one has been devised.
References
Choi, F. 2000. Advances in domain independent
linear text segmentation. To appear in Proceedings
of ANLP/NAACL-00.
Hearst, M. 1997. TexTiling: Segmenting text into
multi-paragraph subtopic passages. Computational
Linguistics 23 (1), pp. 33-64.
Goldstein, J., M. Kantrowitz, V. Mittal, and J.
Carbonell. 1999. Summarizing Text Documents:
Sentence Selection and Evaluation Metrics. In
Proceedings of ACM/SIGIR-99, pp.121-128.
Kan, M.-Y., J. Klavans and K. McKeown. 1998.
Linear Segmentation and Segment Significance. In
Proceedings of WVLC-6, pp. 197-205.
Klavans, J., K. McKeown, M.-Y. Kan and S. Lee.
1998. Resources for Evaluation of Summarization
Techniques. In Proceedings of the 1st International
Conference on Language Resources and
Evaluation.
MacQueen, J. 1967. Some methods for classification
and analysis of multivariate observations. In
Proceedings of the Fifth Berkeley Symposium on
Mathematical statistics and probability, (1), pp.
281-297.
Mittal, V., M. Kantrowitz, J. Goldstein & J.
Carbonell. 1999. Selecting Text Spans for
Document Summaries: Heuristics and Metrics. In
Proceedings of AAAI-99, pp. 467-473.
Quinlan, J.R. 1993. C4.5: Programs for Machine
Learning. Morgan Kaufman, San Mateo, CA.
Turney, P. 2000. Learning algorithms for keyphrase
extraction. Information Retrieval, 2 (4), pp. 303336.
Witten, I.H., G. Paynter, E. Frank, C. Gutwin and C.
Nevill-Manning. 1999. KEA: Practical automatic
keyphrase extraction. In Proceedings of DL-99, pp.
254-256.
Download