Columbia University Using the Annotated Bibliography as a Resource for Indicative Summarization Min-Yen Kan*, Judith L. Klavans** and Kathleen R. McKeown* {min, judith, kathy}@cs.columbia.edu Selected Summary Dimensions Annotated Bibliography Entries 1. Extract versus Abstract 2. Informative versus Indicative 3. Generic versus Query biased 4. Single document versus Multiple - longer than both card catalog summaries and snippets Our language resource of annotated bibliography entries was designed to ease the collection of the corpus as well as to make many features available for subsequent analysis for summarization and related natural language applications. - organized around a theme; ideal standard for ``query-based'' summaries Presently: Annotated Bibliography Entries are indicative summaries. - 1200 documents containing “annotated bibliography” were spidered - of those, 64 documents were hand parsed yield 2000 entries - of those 2000, 100 of the parsed <entry> tags were further annotated with semantic tags - have explicit comparisons of one resource versus another - have prefacing overviews of the documents in the bibliography. - rich in meta-information features. Corpus Extract vs. Indicative vs. Generic vs. Single vs. Uses Corpus vs. Abstract Informative Query-based Multidocument Metadata? Algorithm Both Informative Generic Both Yes Corpus Mostly Extract Informative Generic Single No Corpus DUC Ziff Davis Scientific Abstracts Snippets Card Catalog Entries Annotated Bibliography Entries Abstract Abstract Abstract Abstract Informative Indicative Indicative Both Generic Both Generic Both Single Single Single Mostly Single No Yes Yes Yes We study them as models for summaries, by examining prescriptive guidelines and performing a corpus study Prescriptive Guidelines Snippets are short indicative descriptions given by authors of web pages. Often very short, (e.g. Yahoo! or ODP category pages). Amitay (2000) shows strategies for locating and extracting snippets and how to rank different ones for fitness as a summary. Scientific Summaries There have been a number of studies using abstracts of scientific articles as a target summary (e.g., Kupiec et al 1995). Abstracts tend to summarize the document's topics well but do not include much use of metadata. News Summaries DUC provides a large corpus for informative summaries. Jing and McKeown (1999) use source document and target summary relations for ``cut and paste'' summarization. Corpus Study Catalogued recommended information from 5 prescriptive guidelines for A.B.E.’s Ree70 EBC98 Topicality Features consist of structured fields, of which a summary is an optional field. Other types of information (such as notes, or book jacket texts, or book reviews) are often substituted for summaries. the position of the entry on the page the subject or theme ** Center for Research on Information Access location of the source document coarser granularity than title the internal division in the page that this entry belongs to bibEntry id="id26" title="Analysis of covariance" url="http://www.math.yorku.ca/SCS/biblio.html" type="paper" domain="statistics“ microCollection="Analysis of Covariance" offset="4"> <beforeContext>: the text < Corpus Algorithm Corpus Corpus Card Catalog Entries Corpus Collection & Encoding * Department of Computer Science Les01 AACC98 Wil02 x Performed study of 100 entries (see right) # tags in corpus % entries with tag 139 72 34 47% 64% 28% 55 43 41 36 26 21 16 13 13 12 12 10 9 8 4 4 4 3 2 2 2 48% 27% 29% 24% 20% 16% 11% 10% 10% 12% 12% 9% 7% 6% 3% 3% 4% 3% 2% 1% 1% Metadata and Other Features x x x x x x x x x x x x x x x x x x x x x x <beforeContext>Maxwell, S. E., Delaney, H. D., & O'Callaghan, M. F. (1993). Analysis of...</beforeContext> x Detail Overview Topic Media Type Author/Editor Content Types/Special Feature Subjective Assess/Coverage Authority/Authoritativeness Background/Source Navigation/Internal Structure Collection Size Purpose Audience Contributor Cross-resource comparison Size/Length Style Query Relevance Readability Difficulty Edition/Publication Information Language Copyright Award/Quality/Defects before the body of the entry x <entry><OVERVIEW>This <MEDIATYPES>paper <entry>: the text with </MEDIATYPES>gives a brief history of ANCOVA, and then the 24 semantic tags discusses ANCOVA in ... contains no matrix algebra.</DIFFICULTY></entry> <parsedEntry>PROB 14659 -112.252 0 TOP -112.252 S 105.049 NP-A -8.12201 NPB -7.82967 DT 0 This NN 0 paper ...</parsedEntry> <parsedEntry>: Collins’ 96 parse of the entry </bibEntry> Other fields, also optional: - <afterContext>: text that is distinctly marked off as coming after the entry - <macroCollection>: the division that the page represents in the set of related pages Corpus Availability The corpus is available for academic and not-for-profit research, by request to: <min@cs.columbia.edu> An annotation guide, explaining the annotation tagging guidelines in more detail, is also available. Command-line and web .CGI utilities are also provided to modify, insert and extract attributes from the corpus. Third International Conference on Language Resources and Evaluation, Las Palmas, Canary Islands, Spain, 29-31 May 2002