Approaches to automatic summarization Lecture 5 Types of summaries Extracts – Sentences from the original document are displayed together to form a summary Abstracts – Materials is transformed: paraphrased, restructured, shortened Extractive summarization Each sentence is assigned a score that reflects how important and contenful they are Data-driven approaches – – – Do not use any domain knowledge or external resources Importance “immerges” for the data Probabilistic models of word occurrence and sentence similarity Sentence ranking options Based on word probability – – S is sentence with length n Pi is the probability of the i-th word in the sentence n weight ( S ) log( p ) i i 1 n – n Based on word tf.idf weight ( S ) tf .idf i 1 n i Centrality measures How representative is a sentence of the overall content of a document – The more similar are sentence is to the document, the more representative it is 1 centrality ( Si ) K sim (S , S i j i j ) Data-driven approach Unsupervised---no information about what constitutes a desirable choice How can be supervised approaches used? – For example the scientific article summarization paper from last week Rhetorical status What is the purpose of the sentence? To communicate – – – Background Aim Basis (related work) How can we know which sentence serves each aim? Rhetorical zones Distribution of categories Selecting important sentences (relevance) How well can it be performed by people? – Rather subjective; depends on prior knowledge and interests Even the same person would select 50% different sentences if she performs the task at different times Still, judgments can be solicited by several people to mitigate the problem For each sentence in at article---say if it is important and interesting enough to be included in a summary Annotated data 80 computational linguistics articles Can be used to train classifiers – – Given a sentence, which rhetorical class does it belong to? Given a sentence, should it be included in the summary or not? Features Location – – – Absolute location of the sentence Section structure: first sentence, last sentence, other Paragraph structure What section the sentence appeared in – Introduction, implementation, example, conclusion, result, evaluation, experiment etc Sentence length – Very long and very short sentences are unusual Title word overlap Tf.idf word content – – – Binary feature “yes” if the sentence contains one of the 18 most important words “no” otherwise Presence and type of citation Formulaic expressions – “in traditional approaches”, “a novel method for” Important lessons for us Vector representation of sentences – – Can be words But can also be other features! The probability of a sentences belonging to a class can be computed Complex distinctions can be accurately predicted using simple features Problems with ML for summarization Annotation is expensive – Here---relevance and rhetorical status judgments People don’t agree – – So more annotators are necessary And/or more training of the annotators