Using the Annotated Bibliography as a Resource for Indicative Summarization

advertisement
Columbia
University
Using the Annotated Bibliography as a
Resource for Indicative Summarization
Min-Yen Kan*, Judith L. Klavans** and Kathleen R. McKeown*
{min, judith, kathy}@cs.columbia.edu
Selected Summary Dimensions
Annotated Bibliography Entries
1. Extract versus Abstract
2. Informative versus Indicative
3. Generic versus Query biased
4. Single document versus Multiple
- longer than both card catalog summaries and snippets
Our language resource of annotated bibliography entries was designed to
ease the collection of the corpus as well as to make many features
available for subsequent analysis for summarization and related natural
language applications.
- organized around a theme; ideal standard for ``query-based'' summaries
Presently:
Annotated Bibliography Entries are indicative summaries.
- 1200 documents containing “annotated bibliography” were spidered
- of those, 64 documents were hand parsed yield 2000 entries
- of those 2000, 100 of the parsed <entry> tags were further annotated
with semantic tags
- have explicit comparisons of one resource versus another
- have prefacing overviews of the documents in the bibliography.
- rich in meta-information features.
Corpus
Extract vs. Indicative vs. Generic vs. Single vs.
Uses
Corpus vs.
Abstract Informative Query-based Multidocument Metadata? Algorithm
Both
Informative
Generic Both
Yes
Corpus
Mostly Extract
Informative
Generic Single
No
Corpus
DUC
Ziff Davis
Scientific Abstracts
Snippets
Card Catalog Entries
Annotated Bibliography Entries
Abstract
Abstract
Abstract
Abstract
Informative
Indicative
Indicative
Both
Generic
Both
Generic
Both
Single
Single
Single
Mostly Single
No
Yes
Yes
Yes
We study them as models for summaries, by examining
prescriptive guidelines and performing a corpus study
Prescriptive Guidelines
Snippets
are short indicative descriptions given by authors of web pages. Often very short,
(e.g. Yahoo! or ODP category pages). Amitay (2000) shows strategies for locating
and extracting snippets and how to rank different ones for fitness as a summary.
Scientific Summaries
There have been a number of studies using abstracts of scientific articles as a target
summary (e.g., Kupiec et al 1995). Abstracts tend to summarize the document's topics
well but do not include much use of metadata.
News Summaries
DUC provides a large corpus for informative summaries. Jing and McKeown (1999) use
source document and target summary relations for ``cut and paste'' summarization.
Corpus Study
Catalogued recommended information
from 5 prescriptive guidelines for A.B.E.’s
Ree70
EBC98
Topicality Features
consist of structured fields, of which a summary is an optional field. Other
types of information (such as notes, or book jacket texts, or book reviews) are
often substituted for summaries.
the position
of the entry
on the page
the subject or theme
** Center for
Research on
Information
Access
location of the source document
coarser granularity than title
the internal division in the page
that this entry belongs to
bibEntry id="id26" title="Analysis of covariance"
url="http://www.math.yorku.ca/SCS/biblio.html" type="paper"
domain="statistics“ microCollection="Analysis of Covariance"
offset="4">
<beforeContext>: the text
<
Corpus
Algorithm
Corpus
Corpus
Card Catalog Entries
Corpus Collection & Encoding
* Department
of Computer
Science
Les01
AACC98
Wil02
x
Performed study of 100
entries (see right)
# tags in corpus
% entries with tag
139
72
34
47%
64%
28%
55
43
41
36
26
21
16
13
13
12
12
10
9
8
4
4
4
3
2
2
2
48%
27%
29%
24%
20%
16%
11%
10%
10%
12%
12%
9%
7%
6%
3%
3%
4%
3%
2%
1%
1%
Metadata and Other Features
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
<beforeContext>Maxwell, S. E., Delaney, H. D., &
O'Callaghan, M. F. (1993). Analysis of...</beforeContext>
x
Detail
Overview
Topic
Media Type
Author/Editor
Content Types/Special Feature
Subjective Assess/Coverage
Authority/Authoritativeness
Background/Source
Navigation/Internal Structure
Collection Size
Purpose
Audience
Contributor
Cross-resource comparison
Size/Length
Style
Query Relevance
Readability
Difficulty
Edition/Publication Information
Language
Copyright
Award/Quality/Defects
before the body of the entry
x
<entry><OVERVIEW>This <MEDIATYPES>paper
<entry>: the text with
</MEDIATYPES>gives a brief history of ANCOVA, and then
the 24 semantic tags
discusses ANCOVA in ... contains no matrix
algebra.</DIFFICULTY></entry>
<parsedEntry>PROB 14659 -112.252 0 TOP -112.252 S 105.049 NP-A -8.12201 NPB -7.82967 DT 0 This NN 0 paper
...</parsedEntry>
<parsedEntry>: Collins’
96 parse of the entry
</bibEntry>
Other fields, also optional:
- <afterContext>: text that is distinctly marked off as coming after the entry
- <macroCollection>: the division that the page represents in the set of
related pages
Corpus Availability
The corpus is available for academic and not-for-profit research, by request to:
<min@cs.columbia.edu>
An annotation guide, explaining the annotation tagging guidelines in more detail, is also
available. Command-line and web .CGI utilities are also provided to modify, insert and extract
attributes from the corpus.
Third International Conference on Language Resources and Evaluation, Las Palmas, Canary Islands, Spain, 29-31 May 2002
Download