Vocabulary, word distributions and term weighting

Approaches to automatic summarization Lecture 5 Types of summaries  Extracts –  Sentences from the original document are displayed together to form a summary Abstracts – Materials is transformed: paraphrased, restructured, shortened Extractive summarization  Each sentence is assigned a score that reflects how important and contenful they are  Data-driven approaches – – – Do not use any domain knowledge or external resources Importance “immerges” for the data Probabilistic models of word occurrence and sentence similarity Sentence ranking options  Based on word probability – – S is sentence with length n Pi is the probability of the i-th word in the sentence n weight ( S )   log( p ) i i 1 n –  n Based on word tf.idf weight ( S )   tf .idf i 1 n i Centrality measures  How representative is a sentence of the overall content of a document – The more similar are sentence is to the document, the more representative it is 1 centrality ( Si )  K  sim (S , S i j i j )  Data-driven approach  Unsupervised---no information about what constitutes a desirable choice  How can be supervised approaches used? – For example the scientific article summarization paper from last week Rhetorical status  What is the purpose of the sentence? To communicate – – –  Background Aim Basis (related work) How can we know which sentence serves each aim? Rhetorical zones Distribution of categories Selecting important sentences (relevance)  How well can it be performed by people? –    Rather subjective; depends on prior knowledge and interests Even the same person would select 50% different sentences if she performs the task at different times Still, judgments can be solicited by several people to mitigate the problem For each sentence in at article---say if it is important and interesting enough to be included in a summary Annotated data  80 computational linguistics articles  Can be used to train classifiers – – Given a sentence, which rhetorical class does it belong to? Given a sentence, should it be included in the summary or not? Features  Location – – –  Absolute location of the sentence Section structure: first sentence, last sentence, other Paragraph structure What section the sentence appeared in – Introduction, implementation, example, conclusion, result, evaluation, experiment etc  Sentence length –   Very long and very short sentences are unusual Title word overlap Tf.idf word content – – – Binary feature “yes” if the sentence contains one of the 18 most important words “no” otherwise   Presence and type of citation Formulaic expressions – “in traditional approaches”, “a novel method for” Important lessons for us  Vector representation of sentences – –   Can be words But can also be other features! The probability of a sentences belonging to a class can be computed Complex distinctions can be accurately predicted using simple features Problems with ML for summarization  Annotation is expensive –  Here---relevance and rhetorical status judgments People don’t agree – – So more annotators are necessary And/or more training of the annotators

Vocabulary, word distributions and term weighting

Related documents

Products

Support

Vocabulary, word distributions and term weighting

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib