ppt

advertisement
Finding and Using RhetoricalSemantic Relations in Text
Sasha Blair-Goldensohn
28 April 2005
Outline
• Background
• Relations and Definitional QA
• Exploring Statistical Techniques for
Relation Finding
• Using Mined Relations For Fun and Profit
Situating This Talk
• Various levels of textual relations (a.k.a. predicates)
– Word-level, e.g. hypernym-hyponym
• WordNet catalogs many of these
– Syntactic, e.g. verb-argument
– Propositional, e.g. agent-patient
• Wide array of work on parsers for syntactic and propositional
structure can derive relations at the sentence level
– Rhetorical, e.g. cause-effect, contrast
• Work in this domain more theoretical, no “general use” parser
• This talk
– How rhetorical-type relations can be useful for a particular task
• Interaction between rhetorical and word-level relations
– Experiments in detecting and using these relations
Motivation
• Definitional Questions
– “What/Who is X?”
• Concepts / Things / Processes: Muzak, thin layer
chromatography, Hogwarts, Aum Shinrikyo, etc.
• People: Sonia Gandhi, Neil Diamond
• Exploratory manual analysis of definitions
– Some properties consistently “good” across topics
• e.g., Superordinate, Cause-Effect, Contrast
– Other “good” properties harder to generalize
• Different for a chemical procedure (applications, process
components) vs. a cult (founder, beliefs, membership)
– Templates could be useful here for certain broad categories
(people, organizations, etc.)
– … but our focus is on a system to define any term
DefScriber: A Hybrid System
– Knowledge-driven: three predicates (a.k.a. relations):
• Genus: category information (“Shiraz is a grape.”)
• Species: differentiating the subject from other category
members (“Shiraz is used to make a popular style of red
wine…”)
– Sentences containing both Genus and Species identified by
pattern
• Non-specific Definitional (NSD): relevant information
that may be impractical to classify generally (“Reds are now
in favor in Australia, but in the 1970s white wine was more
popular.”)
– NSD sentence identified (mainly) by function of term
concentration
– Data-driven: statistical summarization-esque
techniques to organize NSD information
• Separate core concepts from more marginal ones
• Cluster key subtopics
• Order sentences in using importance and cohesion
Pattern-Based Relation Identification (G-S)
Example
Sentence
To Pattern
S
S
NP
NP
VP
The Hindu Kush
NP
DT? TERM
NP
FormativeVb
The Hajj,
NP
VP
NP
is
or
Pilgrimage to
Makkah (Mecca),
NP
PP
NP
the boundary
NP
VP
NP
S
NP
contains
contains
VP
represents
Matches
Input
Sentence
between two major
plates: the Indian
and Eurasian.
The original Genus-Species sentence
Genus
PP
PREP Species
The extracted partial syntax-tree pattern
NP
the central duty
PP
of Islam.
A matching sentence
Example Output (From DUC 2004)
Who is Sonia Gandhi?
Congress President Sonia Gandhi, who married into what was once India’s
most powerful political family, is the first non-Indian since independence 50
years ago to lead the Congress. After Prime Minister Rajiv Gandhi was
assassinated in 1991, Gandhi was persuaded by the Congress to succeed
her husband to continue leading the party as the chief, but she refused. The
BJP had shrugged off the influence of the 51-year-old Sonia Gandhi when
she stepped into politics early this year, dismissing her as a “foreigner.” Sonia
Gandhi is now an Indian citizen. Gandhi, who is 51, met her husband when
she was an 18-year old student at Cambridge in London, the first time she
was away from her native Italy.
• Starting with Genus and Species information gives answer context
• Word-based chaining of concepts for cohesion
• Use of pronoun rewriting (Nenkova, 2003) to clarify initial
references and make later ones more fluid
• Contrast reads well – but we were just lucky!
• Statistical analysis (data-driven techniques) create a definition that
proceeds from more to less central topics
– Five extracted sentences extracted from four different documents
Some Formal Evaluations
• Survey-based evaluation (2003)
– Users rated five qualitative aspects of definitions
– Showed significant improvement over query-focused
multi-document summarization
• Automatic and manual evals in DUC 2004 “Who
is X?” task
– Best results among 22 teams in automated (ROUGE)
evaluation (significantly better than 20)
– Less distinguished in manual evaluation of coverage,
responsiveness, and quality
• Little significant diff: on avg, 1.1 systems better, 2 worse
• Because extractive task?
Informal Observations
• DefScriber Pros
– Robust: Data-driven approaches will provide an
answer for any topic, dynamically
• Stock answer for “Why not use Google definitions?”
– Nice answers when we find a G-S sentence and we
have some coherent threads
• Cons
– Predicate coverage for G-S only
– Data-driven techniques are limited
• Similarity-based (word-overlap)
• Use data from retrieved documents only (mod IDF)
Adding Predicates
• We want to add predicates that are consistently
useful, e.g. Cause-Effect, Contrast
– Approach of syntax-tree patterns with high precision
(~96%) but uneven recall, and requires significant
manual effort
– Initial markup study indicates these predicates are
stated in highly varied ways, and not always explicitly,
e.g.
• E.g., “Diabetes is a disease of the endocrine system.
Symptoms can include tiredness, thirst and the need to
urinate frequently.”
• Idea: A technique to determine a relation using
word pairs, even when it is not explicitly stated
Strengthening Data-driven Techniques
• We want to strengthen our techniques, because
word-based similarity can limit us in some cases,
e.g.:
• We would like to follow:
– Tachyons are a class of particles which are able to travel faster
than the speed of light.
• With:
– By extension of this terminology, particles that travel slower
than light are called tardyons, and particles, such as photons,
that travel exactly at the speed of light are called luxons.
• but the felicitousness of this combination due to Contrast is
missed by similarity-based metric
• Idea: A technique to use relations in addition to
similarity / identity to a cohesion metric
Choosing an Approach
• Learning relationship content, e.g. that disease causes
symptoms, or that faster contrasts with slower
– Echihabi and Marcu (2002) use cue phrases to mine large
corpora to construct a word-pair-based classifier for four
relations including Cause and Contrast and detect these
relations across clauses or sentences
– Lapata and Lascarides (2004) use a similar approach for
sentence-internal temporal relations (Before, After, During, etc.)
using word pairs and other features like verb tenses
• As opposed to learning patterns
– Snow, Jurafsky et al. (2005) use a supervised approach to learn
patterns for the hypernymy relation based on dependency-tree
• e.g., “X is a Y”, “X, Y and other Z”, etc.
– Some issues including usefulness for non-explicit relations and
cohesion application (more later)
The Approach
• Begin by following Echihabi and Marcu:
– Compile a small set of cue-phrases for each relation, e.g.
• Cause: [Because X, Y], [X. As a consequence, Y], etc.
• Contrast: [X. However, Y], [X even though Y], etc.
• Baseline: Choose random non-contiguous sents from a document
– Mine a large amount of (noisy) data:
• If we find a sentence: “Because [x1 x2 … xn] , [y1 y2 … ym] .”
• And note down that pairs (x1, y1) … (xn, ym) were observed in a
causal setting
• So if we find: “Because [of poaching , smuggling and related
treacheries], [tigers, rhinos and civets are endangered species] .
• … our belief that the pair (poaching ,endangered) indicates a causal
relationship is increased
– Construct a naïve Bayes classifier s/t for two text spans W1 and
W2, the probability of Relation rk is estimated as:
Goals
• Attain “good” accuracy
– Not essential to exceed previous numbers
since we are concerned with application
• Apply model to address DefScriber “cons”
– Make a system that can be used in an online
setting
• Consider alternative uses for model
System Design
• Corpus: Aquaint collection (LDC) of
approximately 20M sentences of newswire text
from 1996-2000
• Mined examples of Cause and Contrast
– Approx 407k cause
– Approx 943k contrast
– Trained system on approx 400k each, and added
400k “no relation” as baseline
• “No relation” is taken as sentence pairs from the same
document which are at least 3 sents apart
• 64M word pairs with counts in MySQL Database
– Efficiency concerns
Classification Task
• Given two text spans, predict the relation
between them when cue patterns are
removed
• Used 10k held out test data for each
relation type
– Baseline for binary classifier = 50%
Smoothing
• Our data is very sparse given the possible number of
word pairs (99% of possible pairs unseen in 400k norel
sentence pairs)
• Using LaPlace smoothing, we estimate the probability of a
given word pair as:
C ( x, y )  
PLap( x, y ) 
N  B
•
Where B is the number of unseen events. But with λ = 1,
94% of the probability space goes to unseen events
• We can experiment with smaller λ
– Or estimate values empirically
Effect of λ Parameter
Binary Classification @ 100k Training Exampes
0.75
0.70
Accuracy
0.65
0.60
cause v norel
contrast v norel
0.55
0.50
0.45
0.40
1.0000
0.1000
0.0100
LaPlace Parameter
0.0010
0.0001
Good-Turing Smoothing
• Smoothes all counts based on ratio of
frequencies of frequencies
– Gives N1/N = .08 probability to unseen events
• Depends on choice of smoothing function for
higher frequencies where we have few
examples
• In limited experiments, performed moderately
worse than LaPlace (within .05)
– May improve with more data (and effort!)
Stemming
• Experimented with Porter Stemmer to
address sparsity
– Improves classification accuracy marginally (<
1 percent)
• However, somewhat coarse-grained for
other tasks
– Currently using unstemmed models;
lemmatization might be better
Classification Results
Binary Classification : Unstemmed / Laplace @ 0.01
0.90
Accuracy
0.85
0.85
0.80
0.80
0.75
0.76
cause vs
norel
0.80
0.75
contrast vs
norel
0.73
0.70
0.69
0.65
0.64
cause vs
norel - ref
contrast vs
norel - ref
0.60
0.55
0.50
100k
200k
400k
890k
Training Examples (not to scale)
3882k
Another Task: Term Suggestion
• We can also use these models to look for pairs
of words which are most strongly linked for a
given relation, e.g. Contrast
• Using log-likelihood measure a la Dunning
– Null hypothesis is that for two terms w and t, the pair
(w,t) is equally likely for the Contrast model or not
– H0 = P(w,t|ContrastModel) = P(w,t|~ContrastModel) =
P(w|t)
– So given a word w, we wish to suggest the term(s) t
for which H0 is most unlikely
• Issues: Evaluation and Sparsity
Term Suggestion: an Example
• Recall our example:
• Tachyons are a class of particles which are able to travel faster than
the speed of light.
• By extension of this terminology, particles that travel slower than
light are called tardyons, and particles, such as photons, that travel
exactly at the speed of light are called luxons.
• Contrast terms above log-likelihood threshold
• Speed: not, still, only, speed, average, exactly, football, slower, dial,
race, faster, isn’t, efficient, strength, toughness
• Faster: buyer, perhaps, #unk#, speed
• Class: not, restroom, island, mostly, individual, down, lost, subject,
guys, only, schools
– Non-content terms: May indicate contrast language
– Noise / context-specific suggestions
– Useful terms: some antonyms, but also pseudo-coordinates, and
often term itself – we are more interested in rhetorical relevance
more than strict relation
• Seems promising, but only anecdotal evidence here
Applying to Definitional Answers
• Several potential directions for algorithm input
from relation models
– As additional weight when selecting “next” sentence
by measuring cause/contrast-ness of pairing
• Idea: encourage causal / contrast “chains” in the definition
• Could be done as classification or with term suggestions
– Use term suggestions to boost “importance” measure
at word level
• Idea: even if a sentence doesn’t seem ideal from a cohesion
perspective, it may be important enough to insert anyway if it
has strong relation links with the cluster as a whole
– “Needle in Haystack” issue
• Which terms to use as seeds for suggestion?
Contrast Chain Weighting
Idea: Use suggested terms rather than span
classifier since textual regularities of adjacent
sentences may be missing
Algorithm:
1. Extract keywords K from current sent
2. For each k in K
1. Get terms T with LogLike(Contrast(t,K)) > threshold
2. For each potential next sent S, ContrastScore(S) =
WeightedOverlap(T,S)
3. Choose best next S as a function of
ContrastScore(S) and other weights
Applying To Definitions:
“What is bankruptcy?”
Old Answer:
New Answer:
There are two types of bankruptcy Chapter 7 bankruptcy and Chapter 13
bankruptcy.
People with insufficient assets or
income could still file a Chapter 7
bankruptcy, which if approved by a
judge erases debts entirely after certain
assets are forfeited.
File bankruptcy petition with the clerk of
the bankruptcy courts.
Bankruptcy spawns new restaurant Jan
25, 2005 Lansdale Reporter, According
to United States Bankruptcy Court
documents Memphis Magic filed for
Chapter 11 bankruptcy on Oct. 29 which
had voluntarily ...
Some people file bankruptcy because of
the automatic stay provision, the part of
the bankruptcy code that offers legal
protection against bill collectors.
There are two types of bankruptcy
Chapter 7 bankruptcy and Chapter 13
bankruptcy.
When a co-signer is involved in consumer
debt situations, a Chapter 13 proceeding
could protect the co-signer who has not
also filed for bankruptcy protection.
People with insufficient assets or income
could still file a Chapter 7 bankruptcy,
which if approved by a judge erases
debts entirely after certain assets are
forfeited.
Just filing the bankruptcy does not breach
the mortgage; filing to make payments
according to the loan agreement is a
breach.
Personal debt pushes more into
bankruptcy Jan 26, 2005 Manawatu
Standard, The rules that apply to
personal bankruptcy are similar to those
that govern company bankruptcy: the
slate is wiped clean after three years.
Further Uses for Model
• For coherence/cohesion in general-purpose
summarization
• For answering causal or comparative questions
– “Why did Dow-Corning go bankrupt?”
• Filter by terms that have causal relationship with bankruptcy
– “How fast is a lion?”
• Filter by terms that are contrasted with fast
• As added weight on bootstrapped data for, e.g. opinions
– If we believe term X has strong positive orientation, and we
believe X causes/contrasts reliably with Y, we can
increase/decrease our belief about the positive orientation of Y
• As general tool for applications that can accept weaker
inferences in exchange for broad coverage
Alternatives
• “Couldn’t you just use WordNet?”
– Certainly complementary
– WN has issues of coverage
• Number of terms, number of relations both limited
• Much more precise, but doesn’t clearly contain things like the
“contrast” between speed and strength
– Probabilities over relations
• “What about patterns?”
– Again complementary
– Issues with explicit statement of relations
– For methods like Snow et al., need training data
Issues
• Sparsity
– More effort into smoothing (class-based methods,
principled estimation for parameter-based techniques)
– Additional data, features
• Pattern inaccuracy
– Estimated at up to 15% by Echihabi -- address with
syntax-aware patterns
– e.g., " I think the bond is going to pass as it is
because it ' s an excellent proposal , " [she said] .
– Pattern-learning can discover and rank patterns, but
most methods need training data
• Evaluation
– DUC, TREC, and others!
Wrap Up
• Building a model of certain rhetoricalsemantic relations seems feasible
• Validated previous work on classification
• Exploring new avenues for applying these
models to QA, summarization, and beyond
Example Run: “What is the Hajj?”
Goal-Driven
•Use definitional
predicates such as
Genus and Species to
search for sentences
conveying typical
definitional information.
•Implementation
combines featurebased classification
and pattern recognition
over syntax trees.
Data-Driven
•Adapt techniques from
summarization to
maximize content
importance, cohesion
and coverage.
•Implementation uses
lexical distance for
centroid-based
clustering and cohesion
metrics
Document Retrieval
11 Web documents, 1127
total sentences
Predicate Identification
383 Nonspecific
Definitional
sentences
DataDriven
Analysis
Clusters,
ordering
information
9 Genus-Species
Sentences
1. The Hajj, or
pilgrimage to Makkah
(Mecca), is the central
duty of Islam.
2. The Hajj is a
milestone event in a
Muslim 's life.
3. The hajj is one of
five pillars that make
up the foundation of
Islam.
4. The hajj is a weeklong pilgrimage that
begins in the 12th month
of the Islamic lunar
calendar. …
Definition Creation
The Hajj, or pilgrimage to Makkah
[Mecca], is the central duty of Islam.
More than two million Muslims are
expected to take the Hajj this year.
Muslims must perform the hajj at
least once in their lifetime if
physically and financially able. The
Hajj is a milestone event in a
Muslim's life. The annual hajj begins
in the twelfth month of the Islamic
year (which is lunar, not solar, so
that hajj and Ramadan fall
sometimes in summer, sometimes in
winter). The Hajj is a week-long
pilgrimage that begins in the 12th
month of the Islamic lunar calendar.
Another ceremony, which was not
connected with the rites of the Ka'ba
before the rise of Islam, is the Hajj,
the annual pilgrimage to 'Arafat,
about two miles east of Mecca,
toward Mina. The hajj is one of five
pillars that make up the foundation
of Islam.
Download