Methods for Automatic Evaluation of Sentence Extract Summaries G.Ravindra+, N.Balakrishnan+, K.R.Ramakrishnan* Supercomputer Education & Research Center+ Department of Electrical Engineering* Indian Institute of Science Bangalore-INDIA Agenda Introduction to Text Summarization Need for summarization, types of summaries Evaluating Extract Summaries Challenges in manual and automatic evaluation Fuzzy Summary Evaluation Complexity Scores What is Text Summarization Reductive transformation of source text to summary text by content generalization and/or selection Loss of information Types of Summaries What can be lost and what should not be lost How much can be lost What is the size of the summary Extracts and Abstracts Influence of genre on the performance of a summarization algorithm Newswire stories are favorable to sentence position Need for Summarization Explosive growth in availability of digital textual data Books in digital libraries, mailing-list archives, on-line news portals Duplication of textual segments in books Hand-held devices Small screens and limited memory Low power devices and hence limited processing capability E.g.: 10 introductory books on quantum physics have a number of paragraphs common to all of them (syntactically different but semantically the same) E.g.: Stream a book from a digital library to a hand-held device Production of information is faster than consumption Types of Summaries Extracts Abstracts Independent of genre Indicative Summaries Text selection followed by generalization Need for linguistic processing E.g.: Convert a sentence to a phrase Generic Summaries Text selection E.g: Paragraphs from books, sentences from editorials, phrases from e-mails Application of statistical techniques Gives a general idea as to the topic of discussion in the text being summarized Informational Summaries Serves as a surrogate to the original text Evaluating Extract Summaries Manual evaluation Human judges are allowed to score a summary on a well defined scale based on a well defined criteria Subject to judge’s understanding of the subject Depends on judge’s opinions Guidelines constrain opinions Individual judges’ scores are combined to generate the final score Re-evaluation might result in different scores Logistic problems for researchers Automatic Evaluation Machine-based evaluation Consistent over multiple runs Fast, avoids logistic problems Suitable for researchers experimenting with new algorithms Flip-side Not as accurate as human evaluation Should be used as precursor to a detailed human evaluation Algorithmically handles various sentence constructs and linguistic variants Fuzzy Summary Evaluation: FuSE Proposing the use of Fuzzy union theory to quantify the similarity of two extract summaries Similarity between the reference (human generated) summary and candidate (machine generated) summary is evaluated Each sentence is a fuzzy set Each sentence in the reference summary has a membership grade in every sentence of the candidate machine generated summary Membership grade of a reference summary sentence in the candidate summary is the union of membership grades across all candidate summary sentences Use membership grades to compute an f-score value Membership grade is the hamming distance between two sentences based on collocations Fuzzy F-score Fuzzy Precision Fuzzy Recall Candidate summary sentence set Reference summary sentence set Union function Membership grade of candidate sentence in reference sentence Choice of Union operator Propose the use of Frank’s S-norm operator Allows combining partial matches nonlinearly Membership grade of a sentence in a summary is dependent on its length Automatically includes brevity-bonus into the scheme Frank’s S-norm operator Damping Coefficient Mean of non-zero membership grades for a sentence Sentence length Length of the longest sentence Characteristics of Frank’s base Performance of FuSE for various sentence lengths Dictionary-enhanced Fuzzy Summary Evaluation:DeFuSE FuSE does not understand sentence similarity based on synonymy and hypernymy Identifying synonymous words makes evaluation more accurate Identifying hypernymous word relationships allows consideration of “gross information” during evaluation Note: Very deep hypernymy trees could result in topic drift and hence improper evaluation Use of Word Net Example: Use of hypernymy HURRICANE GILBERT DEVASTATED DOMINICAN REPUBLIC AND PARTS OF CUBA (PHYSICAL PHENOMENON) GILBERT (DESTROY,RUIN) (REGION) AND PARTS OF (REGION) TROPICAL STORM GILBERT DESTROYED PARTS OF HAVANA TROPICAL (PHYSICAL PHENOMENON) GILBERT DESTROYED PARTS OF (REGION) Complexity Score Attempts to quantify the summarization algorithm based on the difficulty in generating a summary of a particular accuracy Generating a 9 sentence summary from a 10 sentence document is very easy. An algorithm which randomly selects 9 sentences will have a worst case accuracy of 90% A complicated AI+NLP based algorithm cannot do any better If a 2 sentence summary is to be generated from a 10 sentence document, we have 45 possible candidates out of which one is accurate Computing Complexity Score Probability of generating a summary of a length m1 with accurate sentences l1 when human summary has h sentences and the document being summarized has n sentences Complexity Score (Cont..) To compare two summaries of equal length the performance of one relative to the baseline is given by Complexity Score (Cont..) Complexity in generating a 10% extract with 12 correct sentences is higher than generating a 30% extract with 12 correct sentences Conclusion Summary evaluation is as complicated as summary generation Fuzzy schemes are ideal for evaluating extract summaries Use of synonymy and hypernymy relations improve evaluation accuracy Complexity score is a new way of looking at summary evaluation