Processing of large document collections

advertisement
Processing of large document
collections
Fall 2002, Part 2
Outline



1. Term selection: information gain
2. Character code issues
3. Text summarization
2
1. Term selection




a large document collection may contain millions of
words -> document vectors would contain millions
of dimensions
many algorithms cannot handle high dimensionality
of the term space (= large number of terms)
usually only a part of terms are used
how to select terms that are used?

term selection (often called feature selection or
dimensionality reduction) methods
3
Term selection:
information gain


Information gain: measures the
(number of bits of) information
obtained for category prediction by
knowing the presence or absence of a
term in a document
information gain is calculated for each
term and the highest-scoring n terms
are selected
4
Term selection: IG

information gain for a term t:
 P(t )i 1 P(ci | t ) log P(ci | t )  P(~ t )i 1 P(ci |~ t ) log P(ci |~ t )
m
G (t )  i 1 P(ci ) log P(ci )
m
m
5
Estimating probabilities

Doc
Doc
Doc
Doc
Doc

2 classes: c and ~c




1:
2:
3:
4:
5:
cat cat cat (c)
cat cat cat dog (c)
cat dog mouse (~c)
cat cat cat dog dog dog (~c)
mouse (~c)
6
Term selection: estimating
probabilities

P(t): probability of a term t

P(cat) = 4/5, or


‘cat’ occurs in 4 docs of 5
P(cat) = 10/17

the proportion of the occurrences of ´cat’ of
the all term occurrences
7
Term selection: estimating
probabilities

P(~t): probability of the absence of t


P(~cat) = 1/5, or
P(~cat) = 7/17
8
Term selection: estimating
probabilities

P(ci): probability of category i


P(c) = 2/5 (the proportion of documents
belonging to c in the collection), or
P(c) = 7/17 (7 of the 17 terms occur in the
documents belonging to c)
9
Term selection: estimating
probabilities

P(ci | t): probability of category i if t is
in the document; i.e., which proportion
of the documents where t occurs belong
to the category i




P(c | cat) = 2/4 (or 6/10)
P(~c | cat) = 2/4 (or 4/10)
P(c | mouse) = 0
P(~c | mouse) = 1
10
Term selection: estimating
probabilities

P(ci | ~t): probability of category i if t is
not in the document; i.e., which
proportion of the documents where t
does not occur belongs to the category i



P(c | ~cat) = 0 (or 1/7)
P(c | ~dog) = ½ (or 6/12)
P(c | ~mouse) = 2/3 (or 7/15)
11
Term selection: estimating
probabilities


In other words...
Let


term t occurs in B documents, A of them
are in category c
category c has D documents, of the whole
of N documents in the collection
12
Term selection: estimating
probabilities

For instance,





P(t): B/N
P(~t): (N-B)/N
P(c): D/N
P(c|t): A/B
P(c|~t): (D-A)/(N-B)
13
Term selection: IG

information gain for a term t:
 P(t )i 1 P(ci | t ) log P(ci | t )  P(~ t )i 1 P(ci |~ t ) log P(ci |~ t )
m
G (t )  i 1 P(ci ) log P(ci )
m
m



G(cat) = -0.40
G(dog) = -0.38
G(mouse) = -0.01
14
2. Character code issues


Abstract character vs. its graphical
representation (glyph, font)
abstract characters are grouped into
alphabets

each alphabet forms the basis of the
written form of a certain language or a set
of languages
15
Character codes

For instance

for English:







uppercase letters A-Z
lowercase letters a-z
punctuation marks
digits 0-9
common symbols: +, =
ideographic symbols of Chinese and Japanese
phonetic letters of Western languages
16
Some terminology

character repertoire (merkkivalikoima)




a set of distinct characters, alphabet
no internal presentation, ordering etc assumed
usually defined by specifying names of characters
and a sample presentation of characters in visible
form
repertoire may contain characters which look the
same (in some presentations), but are logically
distinct
17
Some terminology

character code (merkkikoodi)



a mapping which defines a one-to-one
correspondence between characters in a
character repertoire and a set of
nonnegative integers
each character is assigned a unique code
position (code number, code value, code
element, code point, code set value, code)
set of codes often has ”holes”
18
Some terminology

character encoding (merkkikoodaus)


an algorithm for presenting characters in digital
form by mapping sequences of code numbers into
sequences of octets (=bytes)
in the simplest case, each character is mapped to
an integer in the range 0-255 according to a
character code and these are used as octets


works only for character repertoires with at most 256
characters
for larger sets, more complicated encodings are needed
19
Character codes

in English:





26 letters in both lower- and uppercase
ten digits + some punctuation marks
in Russian: cyrillic letters
both could use the same set of code
points (if not a bilingual document)
in Japanese: could be over 6000
characters
20
Character codes: standars


Character codes can be arbitrary, but in
practice standardization is needed for
interoperability (between computers,
programs,...)
early standards were designed for
English only, or for a small group of
languages at a time
21
Character codes: standards




ASCII
ISO-8859 (e.g. ISO Latin1)
Unicode
UTF-8, UTF-16
22
ASCII



American Standard Code for Information
Interchange
A seven bit code -> 128 code positions
actually 95 printable characters only


code positions 0-31 and 127 are assigned to
control characters (mostly outdated)
ISO 646 (1972) version of ASCII
incorporated several national variants
(accented letters and currency symbols)

e.g. @[\]{|} replaced
23
ASCII


With 7 bits, the set of code points is too small
for anything else than American English
solution:



8 bits brings more code points (256)
ASCII character repertoire is mapped to the values
0-127
additional symbols are mapped to other values
24
Extended ASCII

Problems:

different manufacturers each developed
their own 8-bit extensions to ASCII


different character repertoires -> translation
between them is not always possible
also 256 code values is not enough to
represent all the alphabets -> different
variants for different languages
25
ISO 8859




Standardization of 8-bit character sets
In the 80´s: multipart standard ISO 8859 was
produced
defines a collection of 8-bit character sets,
each designed for a group of languages
the first part: ISO 8859-1 (ISO Latin1)


covers most Western European languages
0-127: identical to ASCII, 128-159 (mostly)
unused, 96 code values for accented letters and
symbols
26
”Safe” ASCII subset

due to the national variants, only the
following characters can be regarded
”safe” in data transmission:


A-Z, a-z, 0-9
!”%&’()*+,-./:;<=>?
27
Unicode

256 is not enough code positions




for ideographically represented languages
(Chinese, Japanese…)
for simultaneous use of several languages
solution: more than one byte for each
code value
a 16-bit character set has 65,536 code
positions
28
Unicode


16-bit character set; 65,536 code
positions
not sufficient for all the characters
required for Chinese, Japanese, and
Korean scripts in distinct positions

CJK-consolidation: characters of these
scripts are given the same value if they
look the same
29
Unicode

Code values for all the characters used to
write contemporary ’major’ languages



also the classical forms of some languages
Latin, Greek, Cyrillic, Armenian, Hebrew, Arabic,
Devanagari, Bengali, Gurmukhi, Gujarati, Oriya,
Tamil, Telugu, Kannada, Malayalam, Thai, Lao,
Georgian, Tibetan
Chinese, Japanese, and Korean ideograms, and
the Japanese and Korean phonetic and syllabic
scripts
30
Unicode





punctuation marks
technical and mathematical symbols
arrows
dingbats (pointing hands, stars, …)
both accented letters and separate diacritical
marks (accents, tildes…) are included, with a
mechanism for building composite characters


can also create problems: two characters that look the
same may have different code values
->normalization may be necessary
31
Unicode



Code values for nearly 39,000 symbols are
provided
some part is reserved for an expansion
method (see later)
6,400 code points are reserved for private
use

they will never be assigned to any character by
the standard, so they will not conflict with the
standard
32
Unicode: encodings

the ”native” Unicode encoding is UCS-2

presents each code number as two
consecutive octets m and n


code number = 256m + n (=2-byte integer)
can be inefficient

for text containing ISO Latin characters only,
the length of the Unicode encoded sequence is
twice the length of the ISO 8859-1 encoding
33
Unicode: encodings

UTF-8

ASCII code values are likely to be more
common in most text than any other
values


in UTF-8 encoding, ASCII characters are sent
themselves (high-order bit 0)
other characters (two bytes) are encoded using
2-6 bytes (high-order bit is set to 1)
34
Unicode: encodings

UTF-16: expansion method

two 16-bit values are combined to a 32-bit
value -> a million characters available
35
Use of character codes

try to use character codes logically


don’t choose a character just because it
looks right
inform applications of the encoding
used


MIME headers, XML/HTML document
declarations
Should be the responsibility of the
authoring applications… but…
36
3. Text summarization

”Process of distilling the most important
information from a source to produce
an abridged version for a particular user
or task”
37
Text summarization

Many everyday uses:





headlines (from around the world)
outlines (notes for students)
minutes (of a meeting)
reviews (of books, movies)
...
38
Architecture of a text
summarization system

Input:



a single document or multiple documents
text, images, audio, video
database
39
Architecture of a text
summarization system

output:


extract or abstract
compression rate




ratio of summary length to source length
connected text or fragmentary
generic or user-focused/domain-specific
indicative or informative
40
Architecture of a text
summarization system

Three phases:



analyzing the input text
transforming it into a summary
representation
synthesizing an appropriate output form
41
Condensation operations



Selection of more salient (=”keskeinen”,
”essential”) or non-redundant
information
aggregation of information (e.g. from
different parts of the source, or of
different linguistic descriptions)
generalization of specific information
with more general, abstract information
42
The level of processing


surface level
discourse level
43
Surface-level approaches


Tend to represent information in terms
of shallow features
the features are then selectively
combined together to yield a salience
function used to extract information
44
Surface level

Shallow features

thematic features


location


presence of statistically salient terms, based on term
frequency statistics
position in text, position in paragraph, section depth,
particular sections
background

presence of terms from the title or headings in the text,
or from the user’s query
45
Surface level

Cue words and phrases



”in summary”, ”our investigation”
emphasizers like ”important”, ”in particular”
domain-specific bonus (+ ) and stigma (-)
terms
46
Discourse-level approaches


Model the global structure of the text
and its relation to communicative goals
structure can include:



format of the document (e.g. hypertext
markup)
threads of topics as they are revealed in
the text
rhetorical structure of the text, such as
argumentation or narrative structure
47
Classical approaches

Luhn ’58
Edmundson ’69

general idea:



give a score to each sentence
choose the sentences with the highest
score to be included in the summary
48
Luhn’s method


Filter terms in the document using a stoplist
Terms are normalized based on combining
together ortographically similar terms



differentiate, different, differently, difference
-> differen
Frequencies of combined terms are calculated
and non-frequent terms are removed
-> ”significant” terms remain
49
Luhn’s method

Sentences are weighted using the resulting
set of ”significant” terms and a term density
measure:


each sentence is divided into segments bracketed
by significant terms not more than
4 non-significant terms apart
each segment is scored by taking the square of
the number of bracketed significant terms divided
by the total number of bracketed terms
50
Exercise (CNN News)


Let {13, computer, servers, Internet,
traffic, attack, officials, said} be
significant words.
”Nine of the 13 computer servers that
manage global Internet traffic were
crippled by a powerful electronic attack
this week, officials said.”
51
Exercise (CNN News)


Let {13, computer, servers, Internet,
traffic, attack, officials, said} be
significant words.
* * * [13 computer servers * * *
Internet traffic] * * * * * * [attack * *
officials said]
52
Exercise (CNN News)

[13 computer servers * * * Internet
traffic]


score: 52 / 8 = 25/8 = 3.1
[attack * * officials said]

score: 32 / 5 = 9/5 = 1.8
53
Luhn’s method


the score of the highest scoring
segment is taken as the sentence score
the highest scoring sentences are
chosen to the summary

a cutoff value is given
54
”Modern” application



text summarization of web pages on
handheld devices (Buyukkokten, GarciaMolina, Paepcke; 2001)
macro-level summarization
micro-level summarization
55
Web page summarization

macro-level summarization

The web page is partitioned into ‘Semantic
Textual Units’ (STUs)


Hierarchy of STUs is identified


Paragraphs, lists, alt texts (for images)
List - list item, table – table row
Nested STUs are hidden
56
Web page summarization

micro-level summarization: 5 methods
tested for displaying STUs in several
states



incremental: 1) the first line, 2) the first
three lines, 3) the whole STU
all: the whole STU in a single state
keywords: 1) important keywords, 2) the
first three lines, 3) the whole STU
57
Web page summarization



summary: 1) the STUs ’most significant’ sentence
is displayed, 2) the whole STU
keyword/summary: 1) keywords, 2) the STUs
’most significant’ sentence, 3) the whole STU
The combination of keywords and a summary
has given the best performance for discovery
tasks on web pages
58
Web page summarization

extracting summary sentences

Sentences are scored using a variant of
Luhn’s method:


Words are TF*IDF weighted; given a weight
cutoff value, the high scoring words are
selected to be significant words
Weight of a segment: sum of the weights of
significant words divided by the total number of
words within a segment
59
Edmundson’s method

Extends earlier work to look at three
features in addition to word
frequencies:



cue phrases (e.g. ”significant”,
”impossible”, ”hardly”)
title and heading words
location
60
Edmundson’s method

Programs to weight sentences based on each
of the four features



weight of a sentence = the sum of the weights for
features
programs were evaluated by comparison
against manually created extracts
corpus-based methodology: training set and
test set

in the training phase, weights were manually
readjusted
61
Edmundson’s method

Results:



three additional features dominated word
frequency measures
the combination of cue-title-location was
the best, with location being the best
individual feature
keywords alone was the worst
62
Fundamental issues



What are the most powerful but also
more general features to exploit for
summarization?
How do we combine these features?
How can we evaluate how well we are
doing?
63
Corpus-based approaches


In the classical methods, various
features (thematic features, title,
location, cue phrase) were used to
determine the salience of information
for summarization
an obvious issue: determine the relative
contribution of different features to any
given text summarization task
64
Corpus-based approaches

Contribution is dependent on the text
genre, e.g. location:



in newspaper stories, the leading text often
contains a summary
in TV news, a preview segment may
contain a summary of the news to come
in scientific text: an author-written abstract
65
Corpus-based approaches


The importance of different text features for
any given summarization problem can be
determined by counting the occurrences of
such features in text corpora
in particular, analysis of human-generated
summaries, along with their full-text sources,
can be used to learn rules for summarization
66
Corpus-based approaches

Challenges


creating a suitable text corpus, designing
an annotation scheme
ensuring the suitable set of summaries is
available


may already be available: scientific papers
if not: author, professional abstractor, judge
67
KPC method



Kupiec, Pedersen, Chen (1995): A
Trainable Document Summarizer
a learning method using a corpus of
abstracts written by professional human
abstractors (Engineering Information
Co.)
naïve Bayesian classification method is
used
68
KPC method: general idea

training phase:


Select a set of features
Calculate a probability of each feature
value to appear in a summary sentence

using a training corpus (e.g. originals + manual
summaries)
69
KPC method: general idea

when a new document is summarized:

For each sentence



Find values for the features
Calculate the probability for this feature value
combination to appear in a summary sentence
Choose n best scoring sentences
70
KPC method: features

sentence-length cut-off feature

given a threshold (e.g. 5 words), the feature
is true for all sentences longer than the
threshold, and false otherwise


F1(s) = 0, if sentence s has 5 or less words
F1(s) = 1, if sentence s has more than 5 words
71
KPC method: features

paragraph feature


sentences in the first 10 paragraphs and
the last 5 paragraphs in a document get a
higher value
in paragraphs: paragraph-initial,
paragraph-final, paragraph-medial are
distinguished
72
KPC method: features

paragraph feature



F2(s) = i, if sentence s is the first sentence in a
paragraph
F2(s) = f, if there are at least 2 sentences in a
paragraph, and s is the last one
F2(s) = m, if there are at least 3 sentences in a
paragraph, and is neither the first nor the last
sentence
73
KPC method: features

thematic word feature




a small number of thematic words (the
most frequent content words) are selected
each sentence is scored as a function of
frequency of the thematic words
highest scoring sentences are selected
binary feature: feature is true for a
sentence, if the sentence is present in the
set of highest scoring sentences
74
KPC method: features

fixed-phrase feature

this feature is true for sentences that
contain any of 26 indicator phrases (e.g.
”this letter…”, ”In conclusion…”), or that
follow section head that contain specific
keywords (e.g. ”results”, ”conclusion”)
75
KPC method: features

Uppercase word feature



proper names and explanatory text for acronyms
are usually important
feature is computed like the thematic word feature
an uppercase thematic word


is not sentence-initial and begins with a capital letter
and must occur several times
first occurrence is scored twice as much as later
occurrences
76
Exercise (CNN news)

sentence-length; F1: let threshold = 14


paragraph; F2:


< 14 words: F1(s) =0, else F1(s)=1
i=first, f=last, m=medial
thematic-words; F3


score: how many thematic words a sentence has
F3(s) = 0, if score > 3,
else F3(s) = 1
77
KPC method: classifier


For each sentence s, we compute the
probability that s will be included in a
summary S given the k features Fj,
j=1…k
the probability can be expressed using
Bayes’ rule:
P( F 1,...Fk | s  S ) P( s  S )
P( s  S | F 1,..., Fk ) 
P( F 1,..., Fk )
78
KPC method: classifier

Assuming statistical independence of
the features:

P( s  S | F ,...F ) 
k
1

k
j 1
P( Fj | s  S ) P( s  S )

k
j 1
P( Fj )
P(sS) is a constant, and P(Fj| sS)
and P(Fj) can be estimated directly from
the training set by counting occurrences
79
KPC method: corpus



Corpus is acquired from Engineering
Information Co, which provides
abstracts of technical articles to online
information services
articles do not have author-written
abstracts
abstracts were created by professional
abstractors
80
KPC method: corpus




188 document/summary pairs sampled from
21 publications in the scientific/technical
domain
summaries are mainly indicative, average
length is 3 sentences
average number of sentences in the original
documents is 86
author, address, and bibliography were
removed
81
KPC method: sentence
matching


The abstracts from the human
abstractors are not extracts but inspired
by the original sentences
the automatic summarization task here:

extract sentences that the human
abstractor might have chosen to prepare
summary text (with minor modifications…)
82
KPC method: sentence
matching


For training, a correspondence between
the manual summary sentences and
sentences in the original document
need to be obtained
matching can be done in several ways
83
KPC method: sentence
matching

matching can be done in several ways:

a direct sentence match


a direct join



the same sentence is found in both
2 or more original sentences were used to form
a summary sentence
summary sentence can be ’unmatchable’
summary sentence (single or joined) can
be ’incomplete’
84
KPC method: sentence
matching

Matching was done in two passes


first, the best one-to-one sentence
matches were found automatically
second, these matches were used as a
starting point for the manual assignment of
correspondences
85
KPC method: evaluation

Cross-validation strategy for evaluation



documents from a given journal were
selected for testing one at a time; all other
document/summary pairs were used for
training
unmatchable and incomplete summary
sentences were excluded
total of 498 unique sentences
86
KPC method: evaluation

Two ways of evaluation
1.
the fraction of manual summary sentences that
were faithfully reproduced by the summarizer
program



2.
the summarizer produced the same number of
sentences as were in the corresponding manual
summary
-> 35% of summary sentences reproduced
83% is the highest possible value, since unmatchable
and incomplete sentences were excluded
the fraction of the matchable sentences that
were correctly identified by the summarizer

-> 42%
87
KPC method: evaluation

the effect of different features was also studied



best combination (44%): paragraph, fixed-phrase,
sentence-length
baseline: selecting sentences from the beginning of
the document (result: 24%)
if 25% of the original sentences selected: 84%
88
Discourse-based approaches


Discourse structure appears to play an
important role in the strategies used by
human abstractors and in the structure of
their abstracts
an abstract is not just a collection of
sentences, but it has an internal structure

-> abstract should be coherent and it should
represent some of the argumentation used in the
source
89
Discourse models

cohesion

relations between words or referring expressions,
which determine how tightly connected the text is


anaphora, ellipsis, synonymy, hypernymy (dog is-a-kind
animal)
coherence

overall structure of a multi-sentence text in terms of
macro-level relations between sentences (e.g.
”although” -> contrast)
90
Boguraev, Kennedy (BG)



Goal: identify those phrasal units across
the entire span of the document that
best function as representative
highlights of the document’s content
these phrasal units are called
topic stamps
a set of topic stamps is called
capsule overview
91
BG

A capsule overview



not a set/sequence of sentences
a semi-formal (normalised) representaion
of the document, derived after a process of
data reduction over the original text
not always very readable, but still
represents the flow of the narrative

can be combined with surrounding information
to produce more coherent presentation
92
Priest is charged with Pope
attack
A Spanish priest was charged here today with attempting to
murder the Pope. Juan Fernandez Krohn, aged 32, was arrested
after a man armed with a bayonet approached the Pope while he
was saying prayers at Fatima on Wednesday night.
According to the police, Fernandez told the investigators today
that he trained for the past six months for the assault. He was
alleged to have claimed the Pope ’looked furious’ on hearing the
priest’s criticism of his handling of the church’s affairs. If found
quilty, the Spaniard faces a prison sentence of 15-20 years.
93
Capsule overview vs. summary

summary could be, e.g.


“A Spanish priest is charged after an unsuccessful
murder attempt on the Pope”
capsule overview:




A SPANISH PRIEST was charged
Attempting to murder the POPE
HE trained for the assault
POPE furious on hearing PRIEST’S criticisms
94
BG


Primary consideration: methods should
apply to any document type and source
(domain independence)
also: efficient and scalable technology

shallow syntactic analysis, no
comprehensive parsing engine needed
95
BG

Based on the findings on technical terms



technical terms have such linguistic properties that
can be used to find terms automatically in different
domains quite reliably
technical terms seem to be topical
task of content characterization

identifying phrasal units that have


lexico-syntactic properties similar to technical terms
discourse properties that signify their status as most
prominent
96
BG: terms as content
indicators

Problems



undergeneration
overgeneration
differentiation
97
Undergeneration


a set of phrases should contain an
exhaustive description of all the entities
that are discussed in the text
the set of technical terms has to be
extended to include also expressions
with pronouns etc.
98
Overgeneration



already the set of technical terms can
be large
extensions make the information
overload even worse
solution: phrases that refer to one
participant in the discourse are
combined with referential links
99
Differentiation


The same list of terms may be used to
describe two documents, even if they,
e.g., focus on different subtopics
it is necessary to differentiate term sets
not only according to their membership,
but also according to the relative
representativeness of the terms they
contain
100
Term sets and coreference
classes

Phrases are extracted using a phrasal
grammar (e.g. a noun with modifiers)



also expressions with pronouns and
incomplete expressions are extracted
using a (Lingsoft) tagger that provides
information about the part of speech,
number, gender, and grammatical function
of tokens in a text
solves the undergeneration problem
101
Term sets and coreference
classes



The phrase set has to be reduced to
solve the problem of overgeneration
-> a smaller set of expressions that
uniquely identify the objects referred to
in the text
application of anaphora resolution

e.g. to which noun a pronoun ’he’ refers
to?
102
Resolving coreferences

Procedure


moving through the text sentence by
sentence and analysing the nominal
expressions in each sentence from left to
right
either an expression is identified as a new
participant in the discourse, or it is taken
to refer to a previously mentioned referent
103
Resolving coreferences

Coreference is determined by a 3 step
procedure



a set of candidates is collected: all nominals within
a local segment of discourse
some candidates are eliminated due to
morphological mismatches or syntactical
restrictions
remaining candidates are ranked according to
their relative salience in the discourse
104
Salience factors






sent(term) = 100 iff term is in the current
sentence
cntx(term) = 50 iff term is in the current
discourse segment
subj(term) = 80 iff term is a subject
acc(term) = 50 iff term is a direct object
dat(term) = 40 iff term is an indirect obj
...
105
Local salience of a candidate




The local salience of a candidate is the sum
of the values of the salience factors
the most salient candidate is selected as the
antecedent
if the coreference link cannot be established
to some other expression, the nominal is
taken to introduce a new referent
-> coreferent classes
106
Topic stamps

In order to further reduce the referent
set, some additional structure has to be
imposed



the term set is ranked according to the
salience of its members
relative prominence or importance in the
discourse of the entities to which they refer
objects in the centre of discussion have a
high degree of salience
107
Saliency

Measured like local saliency in
coreference resolution, but tries to
measure the importance of unique
referents in the discourse
108
Priest is charged with Pope
attack
A Spanish priest was charged here today with attempting to
murder the Pope. Juan Fernandez Krohn, aged 32, was arrested
after a man armed with a bayonet approached the Pope while he
was saying prayers at Fatima on Wednesday night.
According to the police, Fernandez told the investigators today
that he trained for the past six months for the assault. He was
alleged to have claimed the Pope ’looked furious’ on hearing the
priest’s criticism of his handling of the church’s affairs. If found
quilty, the Spaniard faces a prison sentence of 15-20 years.
109
Saliency

’priest’ is the primary element



eight references to the same actor in the body of the
story
these reference occur in important syntactic
positions: 5 are subjects of main clauses, 2 are
subjects of embedded clauses, 1 is a possessive
’Pope attack’ is also important

’Pope’ occurs 5 times, but not in so important
positions (2 are direct objects)
110
Discourse segments


If the intention is to use very concise
descriptions of one or two salient phrases, i.e.
topic stamps, longer text have to be broken
down into smaller segments
topically coherent, contiguous segments can
be found by using a lexical similarity measure

assumption: distribution of words used changes
when the topic changes
111
BG: Summarization process
1.
2.
3.
4.
5.
6.
7.
linguistic analysis
discourse segmentation
extended phrase analysis
anaphora resolution
calculation of discourse salience
topic stamp identification
capsule overview
112
Knowledge-rich approaches

Structured information can be used as
the starting point for summarization


structured information: e.g. data and
knowledge bases, may have been
produced by processing input text
summarizer does not have to address
the linguistic complexities and variability
of the input, but also the structure of
the input text is not available
113
Knowledge-rich approaches


There is a need for measures of
salience and relevance that are
dependent on the knowledge source
addressing coherence, cohesion, and
fluency becomes the entire
responsibility of the generator
114
STREAK


McKeown, Robin, Kukich (1995):
Generating concise natural language
summaries
goal: folding information from multiple
facts into a single sentence using
concise linguistic constructions
115
STREAK



Produces summaries of basketball
games
first creates a draft of essential facts
then uses revision rules constrained by
the draft wording to add in additional
facts as the text allows
116
STREAK

Input:



task:


a set of box scores for a basketball game
historical information (from a database)
summarize the highlights of the game,
underscoring their significance in the light of
previous games
output:

a short summary: a few sentences
117
STREAK


The box score input is represented as a
conceptual network that expresses
relations between what were the
columns and rows of the table
essential facts: the game result, its
location, date and at least one final
game statistic (the most remarkable
statistic of a winning team player)
118
STREAK


Essential facts can be obtained directly
from the box-score
in addition, other potential facts



other notable game statistics of individual
players - from box-score
game result streaks (Utah recorded its
fourth straight win) - historical
extremum performances such as
maximums or minimums - historical
119
STREAK


Essential facts are always included
potential facts are included if there is
space

decision on the potential facts to be
included could be based on the possibility
to combine the facts to the essential
information in cohesive and stylistically
successful ways
120
STREAK

Given facts:



Karl Malone scored 39 points.
Karl Malone’s 39 point performance is
equal to his season high
a single sentence is produced:

Karl Malone tied his season high with 39
points
121
Text summarization

surface-level methods




“manual” features
corpus-based learning
discourse-level methods
knowledge-rich methods
122
Download