Vocabulary, word distributions and term weighting

advertisement
Approaches to automatic
summarization
Lecture 5
Types of summaries

Extracts
–

Sentences from the original document are
displayed together to form a summary
Abstracts
–
Materials is transformed: paraphrased,
restructured, shortened
Extractive summarization

Each sentence is assigned a score that
reflects how important and contenful they are

Data-driven approaches
–
–
–
Do not use any domain knowledge or external
resources
Importance “immerges” for the data
Probabilistic models of word occurrence and
sentence similarity
Sentence ranking options

Based on word
probability
–
–
S is sentence with length
n
Pi is the probability of the
i-th word in the sentence
n
weight ( S ) 
 log( p )
i
i 1
n
–

n
Based on word tf.idf
weight ( S ) 
 tf .idf
i 1
n
i
Centrality measures

How representative is a sentence of the overall
content of a document
–
The more similar are sentence is to the document,
the more representative it is
1
centrality ( Si ) 
K
 sim (S , S
i j
i
j
)

Data-driven approach

Unsupervised---no information about what
constitutes a desirable choice

How can be supervised approaches used?
–
For example the scientific article summarization
paper from last week
Rhetorical status

What is the purpose of the sentence? To
communicate
–
–
–

Background
Aim
Basis (related work)
How can we know which sentence serves
each aim?
Rhetorical zones
Distribution of categories
Selecting important sentences
(relevance)

How well can it be performed by people?
–



Rather subjective; depends on prior knowledge and
interests
Even the same person would select 50% different
sentences if she performs the task at different times
Still, judgments can be solicited by several people to
mitigate the problem
For each sentence in at article---say if it is important
and interesting enough to be included in a summary
Annotated data

80 computational linguistics articles

Can be used to train classifiers
–
–
Given a sentence, which rhetorical class does it
belong to?
Given a sentence, should it be included in the
summary or not?
Features

Location
–
–
–

Absolute location of the sentence
Section structure: first sentence, last sentence,
other
Paragraph structure
What section the sentence appeared in
–
Introduction, implementation, example,
conclusion, result, evaluation, experiment etc

Sentence length
–


Very long and very short sentences are unusual
Title word overlap
Tf.idf word content
–
–
–
Binary feature
“yes” if the sentence contains one of the 18 most
important words
“no” otherwise


Presence and type of citation
Formulaic expressions
–
“in traditional approaches”, “a novel method for”
Important lessons for us

Vector representation of sentences
–
–


Can be words
But can also be other features!
The probability of a sentences belonging to a
class can be computed
Complex distinctions can be accurately
predicted using simple features
Problems with ML for summarization

Annotation is expensive
–

Here---relevance and rhetorical status judgments
People don’t agree
–
–
So more annotators are necessary
And/or more training of the annotators
Download