Course overview Introduction to summarization Lecture 1

Course overview
Introduction to summarization
Lecture 1
Instructor: Ani Nenkova
505 Levine,
Office hours: Tuesdays 3:15—4:15 or by
TA: Annie Louis
No required text
Slides/lecture notes and handouts will be given in class
Speech and Language Processing (second edition, 2007,
Prentice-Hall), by Daniel Jurafsky and James Martin
Also see
Christopher Manning and Hinrich Schutze, “Foundations of
statistical natural language processing”
Advances in Automatic Text Summarization
Edited by Inderjeet Mani and Mark T. Maybury
5 homeworks (65%)
One will be a literature overview assignment
One will be at the end of the semester, instead of
a final
You are encouraged to form teams for the
homework (programming) assignments, but
all write-ups should be individual
Midterm (20%)
Class participation (15%)
“Submit” 5 questions each week
Late submission policy
5 late days for the semester
Can be used for any assignment with no penalty
Late submissions after “late days” have been
used up will not be graded
What you will learn
A lot about summarization and natural
language techniques used in summarization
Tools and resources
Part of speech and named entity taggers, parsers,
Wordnet, WEKA
Problem formalization/distributions
Distributions: Zipfian, Binomial, Multinomial
Graph representations
System comparisons
Statistical significance and statistical tests
Reading scientific articles
Part of the assigned readings
Useful skill, regardless of your future job plans
Improving writing skills
Immensely useful, regardless of your future job plans
The literature overview assignment will focus on this, but in
other assignments the way you describe your work will also
be evaluated
What is summarization?
Columbia Newsblaster
The academic version
What is the input?
News, or clusters of news
a single article or several articles on a related
Email and email thread
Scientific articles
Health information: patients and doctors
Meeting summarization
What is the output
Highlight information in the input
Chunks or speech directly from the input or
paraphrase and aggregate the input in novel
Modality: text, speech, video, graphics
Ideal stages of summarization
Input representation and understanding
Selecting important content
Generating novel text corresponding to the gist of the input
Most current systems
Use shallow analysis methods
Rather than full understanding
Work by sentence selection
Identify important sentences and piece them
together to form a summary
Data-driven approaches
Relying on features of the input documents
that can be easily computes from statistical
Word statistics
Cue phrases
Section headers
Sentence position
Knowledge-based systems
Use more sophisticated natural language
Discourse information
Use external lexical resources
Resolve anaphora, text structure
Wordnet, adjective polarity lists, opinion
Using machine learning
What are summaries useful for?
Relevance judgments
Does this document contain information I am
interested in?
Is this document worth reading?
Save time
Reduce the need to consult the full document
Multi-document summarization
Very useful for presenting and organizing
search results
Many results are very similar, and grouping
closely related documents helps cover more
event facets
Summarizing similarities and differences between
Scientific article summarization
Not only what the article is about, but also
how it relates to work it cites
Determine which approaches are criticized
and which are supported
Automatic genre specific summaries are more
useful than original paper abstracts
Other uses
Document indexing for information retrieval
Automatic essay grading, topic identification
Data-driven summarization
Frequency as indicator of importance
The topic of a document will be repeated
many times
In multi-document summarization, important
content is repeated in different sources
Greedy frequency method
Compute word probability from input
Compute sentence weight as function of
word probability
Pick best sentence
How to deal with redundancy?
Author JK Rowling has won her legal battle in a
New York court to get an unofficial Harry Potter
encyclopaedia banned from publication.
A U.S. federal judge in Manhattan has sided with
author J.K. Rowling and ruled against the
publication of a Harry Potter encyclopedia created
by a fan of the book series.
Shallow techniques not likely to work well
Global optimization for content
What is the best summary? vs What is the
best sentence?
Form all summaries and choose the best
What is the problem with this approach?
Sentence clustering for theme
1. PAL was devastated by a pilots' strike in June and
by the region's currency crisis.
2. In June, PAL was embroiled in a crippling three-week
pilots' strike.
3. Tan wants to retain the 200 pilots because they
stood by him when the majority of PAL's pilots
staged a devastating strike in June.
Cluster sentences from the input into similar
Choose one sentence to represent a theme
Consider bigger themes as more important
Using graph representations
Discourse entities
Between similar sentences
Between related entities
Using machine learning
Ask people to select sentences
Use these as training examples for machine
Each sentence is represented as a number of
Based on the features distinguish sentences that
are appropriate for a summary and sentences that
are not
Run on new inputs
Information ordering
In what order to present the selected
An article with permuted sentences will not be
easy to understand
Very important for multi-document
Sentences coming from different documents
Automatic summary edits
Some expressions might not be appropriate
in the new context
– Putin
– Russian Prime Minister Vladimir Putin
Discourse connectives
However, moreover, subsequently
Requires more sophisticated NLP techniques
Pinochet was placed under arrest in London Friday by
British police acting on a warrant issued by a Spanish
judge. Pinochet has immunity from prosecution in
Chile as a senator-for-life under a new constitution that
his government crafted. Pinochet was detained in the
London clinic while recovering from back surgery.
Gen. Augusto Pinochet, the former Chilean dictator,
was placed under arrest in London Friday by British
police acting on a warrant issued by a Spanish
judge. Pinochet has immunity from prosecution in
Chile as a senator-for-life under a new constitution
that his government crafted. Pinochet was detained
in the London clinic while recovering from back
Turkey has been trying to form a new government
since a coalition government led by Yilmaz collapsed
last month over allegations that he rigged the sale of
a bank. Ecevit refused even to consult with the
leader of the Virtue Party during his efforts to form a
government. Ecevit must now try to build a
government. Demirel consulted Turkey's party
leaders immediately after Ecevit gave up.
Turkey has been trying to form a new government
since a coalition government led by Prime Minister
Mesut Yilmaz collapsed last month over allegations
that he rigged the sale of a bank. Premier-designate
Bulent Ecevit refused even to consult with the leader
of the Virtue Party during his efforts to form a
government. Ecevit must now try to build a
government. President Suleyman Demirel consulted
Turkey's party leaders immediately after Ecevit gave