ACL Tutorial on Textual Entailment Recognition

advertisement
Textual Entailment
Ido Dagan
Dan Roth,
Bar Ilan University
Israel
University of Illinois,
Urbana-Champaign
USA
ACL -2007
Fabio Massimo
Zanzotto
University of Rome
Italy
1
Outline
1.
2.
3.
4.
5.
Motivation and Task Definition
A Skeletal review of Textual Entailment Systems
Knowledge Acquisition Methods
Applications of Textual Entailment
A Textual Entailment view of Applied Semantics
Page 2
I. Motivation and Task Definition
Page 3
Motivation



Text applications require semantic inference
A common framework for applied semantics is needed,
but still missing
Textual entailment may provide such framework
Page 4
Desiderata for Modeling Framework
A framework for a target level of
language processing should provide:

1) Generic (feasible) module for applications
2) Unified (agreeable) paradigm for investigating
language phenomena
Most semantics research is scattered



WSD, NER, SRL, lexical semantics relations… (e.g. vs.
syntax)
Dominating approach - interpretation
Page 5
Natural Language and Meaning
Meaning
Variability
Ambiguity
Language
Page 6
Variability of Semantic Expression
The Dow Jones Industrial Average closed up 255
Dow ends up
Dow gains 255 points
Dow climbs 255
Stock market hits a
record high
Model variability as relations between text expressions:


Equivalence: text1  text2
Entailment: text1  text2
(paraphrasing)
the general case
Page 7
Typical Application Inference: Entailment
Question
Expected answer form
Who bought Overture?
Overture’s acquisition
by Yahoo
>>
X bought Overture
entails
hypothesized answer
text





Yahoo bought Overture
Similar for IE: X acquire Y
Similar for “semantic” IR: t: Overture was bought for …
Summarization (multi-document) – identify redundant
info
MT evaluation (and recent ideas for MT)
Educational applications
Page 8
KRAQ'05 Workshop - KNOWLEDGE and REASONING for
ANSWERING QUESTIONS (IJCAI-05)
CFP:
 Reasoning aspects:
* information fusion,
* search criteria expansion models
* summarization and intensional answers,
* reasoning under uncertainty or with incomplete
knowledge,
 Knowledge representation and integration:
* levels of knowledge involved (e.g. ontologies,
domain knowledge),
* knowledge extraction models and techniques to
optimize response accuracy
… but similar needs for other applications –
can entailment provide a common empirical framework?
Page 9
Classical Entailment Definition

Chierchia & McConnell-Ginet (2001):
A text t entails a hypothesis h if h is true in every
circumstance (possible world) in which t is true

Strict entailment - doesn't account for some uncertainty
allowed in applications
Page 10
“Almost certain” Entailments
t: The technological triumph known as GPS … was
incubated in the mind of Ivan Getting.
h: Ivan Getting invented the GPS.
Page 11
Applied Textual Entailment

A directional relation between two text fragments:
Text (t) and Hypothesis (h):
t entails h (th) if
humans reading t will infer that h is most likely true

Operational (applied) definition:


Human gold standard - as in NLP applications
Assuming common background knowledge –
which is indeed expected from applications
Page 12
Probabilistic Interpretation
Definition:

t probabilistically entails h if:

P(h is true | t) > P(h is true)



t increases the likelihood of h being true
≡ Positive PMI – t provides information on h’s truth
P(h is true | t ): entailment confidence


The relevant entailment score for applications
In practice: “most likely” entailment expected
Page 13
The Role of Knowledge

For textual entailment to hold we require:

text AND knowledge  h
but


knowledge should not entail h alone
Systems are not supposed to validate h’s truth regardless of t
(e.g. by searching h on the web)
Page 14
PASCAL Recognizing Textual Entailment
(RTE) Challenges
EU FP-6 Funded PASCAL Network of Excellence
2004-7
Bar-Ilan University
MITRE
ITC-irst and CELCT, Trento
Microsoft Research
Page 15
Generic Dataset by Application Use

7 application settings in RTE-1, 4 in RTE-2/3










QA
IE
“Semantic” IR
Comparable documents / multi-doc summarization
MT evaluation
Reading comprehension
Paraphrase acquisition
Most data created from actual applications output
RTE-2/3: 800 examples in development and test sets
50-50% YES/NO split
Page 16
RTE Examples
TEXT
HYPOTHESIS
TASK
ENTAILMENT
Regan attended a ceremony in
1 Washington to commemorate the
landings in Normandy.
Washington is located in
Normandy.
IE
False
2 Google files for its long awaited IPO.
Google goes public.
IR
True
…: a shootout at the Guadalajara
airport in May, 1993, that killed
3
Cardinal Juan Jesus Posadas Ocampo
and six others.
Cardinal Juan Jesus
Posadas Ocampo died in
1993.
QA
True
IE
True
The SPD got just 21.5% of the vote
in the European Parliament elections,
4 while the conservative opposition
parties
polled 44.5%.
The SPD is defeated by
the opposition parties.
Page 17
Participation and Impact

Very successful challenges, world wide:


RTE-1 – 17 groups
RTE-2 – 23 groups


RTE-3 – 25 groups


~150 downloads
Joint workshop at ACL-07
High interest in the research community



Papers, conference sessions and areas, PhD’s,
influence on funded projects
Textual Entailment special issue at JNLE
ACL-07 tutorial
Page 18
Methods and Approaches (RTE-2)

Measure similarity match between t and h
(coverage of h by t):









Lexical overlap (unigram, N-gram, subsequence)
Lexical substitution (WordNet, statistical)
Syntactic matching/transformations
Lexical-syntactic variations (“paraphrases”)
Semantic role labeling and matching
Global similarity parameters (e.g. negation, modality)
Cross-pair similarity
Detect mismatch (for non-entailment)
Interpretation to logic representation + logic inference
Page 19
Dominant approach: Supervised Learning
Classifier
t,h
Similarity Features:
Lexical, n-gram,syntactic
semantic, global
Feature vector



Features model similarity and mismatch
Classifier determines relative weights of information
sources
Train on development set and auxiliary t-h corpora
Page 20
YES
NO
RTE-2 Results
First Author (Group)
Accuracy
Average
Precision
Hickl (LCC)
75.4%
80.8%
Tatu (LCC)
73.8%
71.3%
Zanzotto (Milan & Rome)
63.9%
64.4%
Adams (Dallas)
62.6%
62.8%
Bos (Rome & Leeds)
61.6%
66.9%
11 groups
58.1%-60.5%
7 groups
52.9%-55.6%
Page 21
Average: 60%
Median: 59%
Analysis
For the first time: methods that carry some deeper analysis
seemed (?) to outperform shallow lexical methods


Cf. Kevin Knight’s invited talk at EACL-06, titled:
Isn’t linguistic Structure Important, Asked the
Engineer

Still, most systems, which do utilize deep analysis, did not
score significantly better than the lexical baseline
Page 22
Why?

System reports point at:



Lack of knowledge (syntactic transformation rules, paraphrases,
lexical relations, etc.)
Lack of training data
It seems that systems that coped better with these issues
performed best:


Hickl et al. - acquisition of large entailment corpora for training
Tatu et al. – large knowledge bases (linguistic and world
knowledge)
Page 23
Some suggested research directions

Knowledge acquisition




Unsupervised acquisition of linguistic and world knowledge from
general corpora and web
Acquiring larger entailment corpora
Manual resources and knowledge engineering
Inference

Principled framework for inference and fusion of information levels

Are we happy with bags of features?
Page 24
Complementary Evaluation Modes

“Seek” mode:




Entailment subtasks evaluations


Input:
h and corpus
Output: all entailing t ’s in corpus
Captures information seeking needs, but requires post-run
annotation (TREC-style)
Lexical, lexical-syntactic, logical, alignment…
Contribution to various applications

QA – Harabagiu & Hickl, ACL-06;
RE – Romano et al., EACL-06
Page 25
II. A Skeletal review of Textual
Entailment Systems
Page 26
Textual Entailment
Eyeing the huge market
potential, currently led by
Google, Yahoo took over
search company
Overture Services Inc. last
year
Entails
Subsumed by

Yahoo acquired Overture
Overture is a search company
Google is a search company
Google owns Overture
……….
Phrasal verb paraphrasing
Entity matching
Alignment
How?
Semantic Role Labeling
Integration
Page 27
A general Strategy for Textual Entailment
Given a sentence T
Re-represent T
Lexical Syntactic Semantic
Representatio
n
Re-represent T
Re-represent T
Re-represent T
Re-represent T
Re-represent T
Re-represent
Re-represent
TT
Given a sentence H
e
Re-represent H
Lexical Syntactic Semantic
Knowledge Base
semantic; structural
& pragmatic
Transformations/rules
Decision
Find the set of
Transformations/Features
of the new representation
(or: use these to create a
cost function)
that allows embedding of
H in T.
Page 28
Details of The Entailment Strategy

Preprocessing






Multiple levels of lexical preprocessing
Syntactic Parsing
Shallow semantic parsing
Annotating semantic phenomena
Representation


Bag of words, n-grams through
tree/graphs based representation
Logical representations
Knowledge Sources






Control Strategy & Decision
Making



Single pass/iterative processing
Strict vs. Parameter based
Justification

Page 29
Syntactic mapping rules
Lexical resources
Semantic Phenomena specific
modules
RTE specific knowledge sources
Additional Corpora/Web
resources
What can be said about the
decision?
The Case of Shallow Lexical Approaches

Preprocessing



Identify Stop Words

Representation

Bag of words
Knowledge Sources

Control Strategy & Decision
Making



Shallow Lexical resources –
typically Wordnet
Single pass
Compute Similarity; use threshold
tuned on a development set
(could be per task)
Justification

It works
Page 30
Shallow Lexical Approaches (Example)

Lexical/word-based semantic overlap: score based on
matching each word in H with some word in T



Word similarity measure: may use WordNet
Clearly, this may not appeal to what we
May take account of subsequences, word order
think as understanding, and it is easy to
‘Learn’ threshold on maximum
word-based
generate
cases formatch
whichscore
this does not
work well.
Text: The Cassini spacecraft
arrived at
Titan in(surprisingly)
July, 2006. well with
However,
it works
respect to current evaluation metrics (data
Text: NASA’s Cassini-Huygens spacecraft traveled to Saturn in
sets?)
2006.
Text: The Cassini spacecraft has taken images that show rivers on
Saturn’s moon Titan.
Hyp:
The Cassini spacecraft has reached Titan.
Page 31
An Algorithm: LocalLexcialMatching

For each word in Hypothesis, Text



if word matches stopword – remove word
if no words left in Hypothesis or Text return 0
numberMatched = 0;




for each word W_H in Hypothesis
for each word W_T in Text
HYP_LEMMAS = Lemmatize(W_H);
TEXT_LEMMAS = Lemmatize(W_T);


if any term in HYP_LEMMAS matches any term in TEXT_LEMMAS



Use Wordnet’s
using LexicalCompare()
numberMatched++;
Return: numberMatched/|HYP_Lemmas|
Page 32
An Algorithm: LocalLexicalMatching (Cont.)

LexicalCompare()

if(LEMMA_H == LEMMA_T)


return TRUE:
if(SynonymOf(textWord, hypothesisWord)


returnTRUE;
if(MemberOfDistanceFromTo(textWord, hypothesisWord) <= 3)


return TRUE;
if(MeronymyDistanceFromTo(textWord, hypothesisWord) <= 3)


Test: 60.50
Test: 65.63
if(HypernymDistanceFromTo(textWord, hypothesisWord) <= 3)


return TRUE;
LLM Performance:
RTE2: Dev: 63.00
RTE 3: Dev: 67.50
return TRUE;
Notes:




LexicalCompare is Asymmetric & makes use of single relation type
Additional differences could be attributed to stop word list (e.g, including aux verbs)
Straightforward improvements such as bi-grams do not help.
More sophisticated lexical knowledge (entities; time) should help.
Page 33
Details of The Entailment Strategy (Again)

Preprocessing






Multiple levels of lexical preprocessing
Syntactic Parsing
Shallow semantic parsing
Annotating semantic phenomena
Representation


Bag of words, n-grams through
tree/graphs based representation
Logical representations
Knowledge Sources






Control Strategy & Decision
Making



Single pass/iterative processing
Strict vs. Parameter based
Justification

Page 34
Syntactic mapping rules
Lexical resources
Semantic Phenomena specific
modules
RTE specific knowledge sources
Additional Corpora/Web
resources
What can be said about the
decision?
Preprocessing

Syntactic Processing:



Only a few systems
Lexical Processing







Syntactic Parsing (Collins; Charniak; CCG)
Dependency Parsing (+types)
Tokenization; lemmatization
For each word in Hypothesis, Text
Phrasal verbs
Idiom processing
Named Entities + Normalization
Date/Time arguments + Normalization
Semantic Processing




Semantic Role Labeling
Nominalization
Modality/Polarity/Factive
Co-reference
}
}
often used only during
decision making
often used only during
decision making
Page 35
Details of The Entailment Strategy (Again)

Preprocessing






Multiple levels of lexical preprocessing
Syntactic Parsing
Shallow semantic parsing
Annotating semantic phenomena
Representation


Bag of words, n-grams through
tree/graphs based representation
Logical representations
Knowledge Sources






Control Strategy & Decision
Making



Single pass/iterative processing
Strict vs. Parameter based
Justification

Page 36
Syntactic mapping rules
Lexical resources
Semantic Phenomena specific
modules
RTE specific knowledge sources
Additional Corpora/Web
resources
What can be said about the
decision?
Basic Representations
Inference
Meaning
Representation
Logical Forms
Semantic
Representation
Representation
Syntactic Parse
Local Lexical
Raw Text
Textual Entailment

Most approaches augment the basic structure defined by the
processing level with additional annotation and make use of a
tree/graph/frame-based system.
Page 37
Basic Representations (Syntax)
Syntactic Parse
Hyp:
The Cassini spacecraft has reached Titan.
Page 38
Local Lexical
Basic Representations (Shallow Semantics: Pred-Arg )
T: The government purchase of the Roanoke building, a former prison,
took place in 1902.
take
PRED
The govt. purchase… prison
ARG_0
purchase
PRED
place
in 1902
ARG_1
ARG_2
The Roanoke building
ARG_1
H: The Roanoke building, which was a former prison, was bought by the
government in 1902.
buy
The government
ARG_0
The Roanoke building
PRED
The Roanoke … prison
In 1902
ARG_1
AM_TMP
be
PRED
a former prison
ARG_2
ARG_1
Page 39
Roth&Sammons’07
Basic Representations (Logical Representation)
[Bos & Markert]
The semantic representation
language is a first-order
fragment a language used in
Discourse Representation
Theory (DRS), conveying
argument structure with a
neo-Davidsonian analysis and
Including the recursive DRS
structure to cover negation,
disjunction, and implication.
Page 40
Representing Knowledge Sources

Rather straight forward in the
Logical Framework:

Tree/Graph base representation
may also use rule based
transformations to encode
different kinds of knowledge,
sometimes represented as
generic or knowledge based
tree transformations.
Page 41
Representing Knowledge Sources (cont.)

In general, there is a mix of procedural and rule based
encodings of knowledge sources


Done by hanging more information on parse tree or predicate
argument representation [Example from LCC’s system]
Or different frame-based annotation systems for encoding
information, that are processed procedurally.
Page 42
Details of The Entailment Strategy (Again)

Preprocessing






Multiple levels of lexical preprocessing
Syntactic Parsing
Shallow semantic parsing
Annotating semantic phenomena
Representation


Bag of words, n-grams through
tree/graphs based representation
Logical representations
Knowledge Sources






Control Strategy & Decision
Making



Single pass/iterative processing
Strict vs. Parameter based
Justification

Page 43
Syntactic mapping rules
Lexical resources
Semantic Phenomena specific
modules
RTE specific knowledge sources
Additional Corpora/Web
resources
What can be said about the
decision?
Knowledge Sources



The knowledge sources available to the system are the
most significant component of supporting TE.
Different systems draw differently the line between
preprocessing capabilities and knowledge resources.
The way resources are handled is also different across
different approaches.
Page 44
Enriching Preprocessing

In addition to syntactic parsing several approaches enrich
the representation with various linguistics resources









Pos tagging
Stemming
Predicate argument representation: verb predicates and
nominalization
Entity Annotation: Stand alone NERs with a variable number of
classes
Acronym handling and Entity Normalization: mapping mentions
of the same entity mentioned in different ways to a single ID.
Co-reference resolution
Dates, times and numeric values; identification and normalization.
Identification of semantic relations: complex nominals, genitives,
adjectival phrases, and adjectival clauses.
Event identification and frame construction.
Page 45
Lexical Resources


Recognizing that a word or a phrase in S entails a word or a
phrase in H is essential in determining Textual Entailment.
Wordnet is the most commonly used resoruce




In most cases, a Wordnet based similarity measure between words
is used. This is typically a symmetric relation.
Lexical chains over Wordnet are used; in some cases, care is taken
to disallow some chains of specific relations.
Extended Wordnet is being used to make use of Entities
Derivation relation which links verbs with their corresponding nominalized
nouns.
Page 46
Lexical Resources (Cont.)

Lexical Paraphrasing Rules


A number of efforts to acquire relational paraphrase rules are under
way, and several systems are making use of resources such as DIRT
and TEASE.
Some systems seems to have acquired paraphrase rules that are in
the RTE corpus









person killed --> claimed one life
hand reins over to --> give starting job to
same-sex marriage --> gay nuptials
cast ballots in the election -> vote
dominant firm --> monopoly power
death toll --> kill
try to kill --> attack
lost their lives --> were killed
left people dead --> people were killed
Page 47
Semantic Phenomena



A large number of semantic phenomena have been identified as
significant to Textual Entailment.
A large number of them are being handled (in a restricted way) by
some of the systems. Very little quantification per-phenomena has
been done, if at all.
Semantic implications of interpreting syntactic structures [Braz et. al’05;
Bar-Haim et. al. ’07]

Conjunctions



Jake and Jill ran up the hill
Jake and Jill met on the hill
Jake ran up the hill
*Jake met on the hill
Clausal modifiers



But celebrations were muted as many Iranians observed a Shi'ite mourning month.
Many Iranians observed a Shi'ite mourning month.
Semantic Role Labeling handles this phenomena automatically
Page 48
Semantic Phenomena (Cont.)

Relative clauses




Appositives



Frank Robinson, a one-time manager of the Indians, has the distinction for the NL.
Frank Robinson is a one-time manager of the Indians.
Passive




The assailants fired six bullets at the car, which carried Vladimir Skobtsov.
The car carried Vladimir Skobtsov.
Semantic Role Labeling handles this phenomena automatically
We have been approached by the investment banker.
The investment banker approached us.
Semantic Role Labeling handles this phenomena automatically
Genitive modifier


Malaysia's crude palm oil output is estimated to have risen..
The crude palm oil output of Malasia is estimated to have risen .
Page 49
Logical Structure

Factivity : Uncovering the context in which a verb phrase is embedded



Polarity negative markers or a negation-denoting verb (e.g. deny,
refuse, fail)





The terrorists tried to enter the building.
The terrorists entered the building.
The terrorists failed to enter the building.
The terrorists entered the building.
Modality/Negation Dealing with modal auxiliary verbs (can, must,
should), that modify verbs’ meanings and with the identification of the
scope of negation.
Superlatives/Comperatives/Monotonicity: inflecting adjectives or
adverbs.
Quantifiers, determiners and articles
Page 50
Some Examples


.


.








[Braz et. al. IJCAI workshop’05;PARC Corpus]
T: Legally, John could drive.
H: John drove.
S: Bush said that Khan sold centrifuges to North Korea.
H: Centrifuges were sold to North Korea.
S: No US congressman visited Iraq until the war.
H: Some US congressmen visited Iraq before the war.
S: The room was full of women.
H: The room was full of intelligent women.
S: The New York Times reported that Hanssen sold FBI secrets to the
Russians and could face the death penalty.
H: Hanssen sold FBI secrets to the Russians.
S: All soldiers were killed in the ambush.
H: Many soldiers were killed in the ambush.
Page 51
Details of The Entailment Strategy (Again)

Preprocessing






Multiple levels of lexical preprocessing
Syntactic Parsing
Shallow semantic parsing
Annotating semantic phenomena
Representation


Bag of words, n-grams through
tree/graphs based representation
Logical representations
Knowledge Sources






Control Strategy & Decision
Making



Single pass/iterative processing
Strict vs. Parameter based
Justification

Page 52
Syntactic mapping rules
Lexical resources
Semantic Phenomena specific
modules
RTE specific knowledge sources
Additional Corpora/Web
resources
What can be said about the
decision?
Control Strategy and Decision Making

Single Iteration




Strict Logical approaches are, in principle, a single stage computation.
The pair is processed and transform into the logic form.
Existing Theorem Provers act on the pair along with the KB.
Multiple iterations



Graph based algorithms are typically iterative.
Following [Punyakanok et. al ’04] transformations are applied and
entailment test is done after each transformation is applied.
Transformation can be chained, but sometimes the order makes a
difference. The algorithm can be a greedy algorithm or can be more
exhaustive, and search for the best path found [Braz et. al’05;Bar-Haim
et.al 07]
Page 53
Transformation Walkthrough [Braz et. al’05]
T: The government purchase of the Roanoke building, a former
prison, took place in 1902.
H: The Roanoke building, which was a former prison, was
bought by the government in 1902.
Does ‘H’ follow from ‘T’?
Page 54
Transformation Walkthrough (1)
T: The government purchase of the Roanoke building, a former prison,
took place in 1902.
take
PRED
The govt. purchase… prison
ARG_0
purchase
PRED
place
in 1902
ARG_1
ARG_2
The Roanoke building
ARG_1
H: The Roanoke building, which was a former prison, was bought by the
government in 1902.
buy
The government
ARG_0
The Roanoke building
PRED
The Roanoke … prison
In 1902
ARG_1
AM_TMP
be
PRED
a former prison
ARG_2
ARG_1
Page 55
Transformation Walkthrough (2)
T: The government purchase of the Roanoke building,
a former prison, took place in 1902.
Phrasal Verb Rewriter
The government purchase of the Roanoke building,
a former prison, occurred in 1902.
occur
The govt. purchase… prison
PRED
ARG_0
in 1902
ARG_2
H: The Roanoke building, which was a former prison, was bought by the
government.
Page 56
Transformation Walkthrough (3)
T: The government purchase of the Roanoke building, a former prison,
occurred in 1902.
Nominalization Promoter
NOTE: depends on
earlier
transformation:
The government purchase the Roanoke building in 1902.
order is important!
purchase
The government
ARG_0
PRED
the Roanoke building,
a former prison
In 1902
AM_TMP
ARG_1
H: The Roanoke building, which was a former prison, was bought by the
government in 1902.
Page 57
Transformation Walkthrough (4)
T: The government purchase of the Roanoke building, a former prison,
occurred in 1902.
Apposition Rewriter
The Roanoke building be a former prison.
be
The Roanoke building
PRED
ARG_1
a former prison
ARG_2
H: The Roanoke building, which was a former prison, was bought by the
government in 1902.
Page 58
Transformation Walkthrough (5)
T: The government purchase of the Roanoke building, a former prison,
took place in 1902.
purchase
PRED
The government
ARG_0
The Roanoke … prison
In 1902
ARG_1
AM_TMP
be
PRED
The Roanoke building
a former prison
ARG_2
ARG_1
H: The Roanoke building, which was a former prison, was bought by the
WordNet
government in 1902.
buy
The government
ARG_0
The Roanoke building
PRED
The Roanoke … prison
In 1902
ARG_1
AM_TMP
be
PRED
a former prison
ARG_2
ARG_1
Page 59
Characteristics

Multiple paths => optimization problem




Shortest or highest-confidence path through transformations
Order is important; may need to explore different orderings
Module dependencies are ‘local’; module B does not need access to
module A’s KB/inference, only its output
If outcome is “true”, the (optimal) set of transformations
and local comparisons form a proof
Page 60
Summary: Control Strategy and Decision Making

Despite the appeal of the Strict Logical approaches as of today, they do not
work well enough.
 Bos & Markert:



Braz et. al.:


Strict graph based representation is not doing as well as LLM.
Tatu et. al


Strict logical approach is failing significantly behind good LLMs and
multiple levels of lexical pre-processing
Only incorporating rather shallow features and using it in the evaluation
saves this approach.
Results show that strict logical approach is inferior to LLMs, but when put
together, it produces some gain.
Using Machine Learning methods as a way to combine systems and
multiple features has been found very useful.
Page 61
Hybrid/Ensemble Approaches

Bos et al.: use theorem prover and model builder




Tatu et al. (2006) use ensemble approach:




Expand models of T, H using model builder, check sizes of models
Test consistency with background knowledge with T, H
Try to prove entailment with and without background knowledge
Create two logical systems, one lexical alignment system
Combine system scores using coefficients found via search
(train on annotated data)
Modify coefficients for different tasks
Zanzotto et al. (2006) try to learn from comparison of
structures of T, H for ‘true’ vs. ‘false’ entailment pairs


Use lexical, syntactic annotation to characterize match between T, H
for successful, unsuccessful entailment pairs
Train Kernel/SVM to distinguish between match graphs
Page 62
Justification

For most approaches justification is given only by the data
Preprocessed


Empirical Evaluation
Logical Approaches


There is a proof theoretic justification
Modulo the power of the resources and the ability to map a
sentence to a logical form.


Graph/tree based approaches


There is a model theoretic justification
The approach is sound, but not complete, modulo the availably of
resources.
Page 63
Justifying Graph Based Approaches


[Braz et. al 05]
R - a knowledge representation language, with a well defined
syntax and semantics or a domain D.
For text snippets s, t:


rs, rt - their representations in R.
M(rs), M(rt) their model theoretic representations

There is a well defined notion of subsumption in R, defined
model theoretically
u, v 2 R:
u is subsumed by v when M(u) µ M(v)

Not an algorithm; need a proof theory.

Page 64
Defining Semantic Entailment (2)


The proof theory is weak; will show rs µ rt only when they are
relatively similar syntactically.
r 2 R is faithful to s if M(rs) = M(r)
Definition: Let s, t, be text snippets with representations r s, rt 2 R.
We say that s semantically entails t if there is a representation
r 2 R that is faithful to s, for which we can prove that r µ rt

Given rs need to generate many equivalent representations r’s
and test r’s µ rt Cannot be done exhaustively
How to generate alternative representations?
Page 65
Defining Semantic Entailment (3)


A rewrite rule (l,r) is a pair of expressions in R such that l µ r
Given a representation rs of s and a rule (r,l) for which rs µ l
the augmentation of rs via (l,r) is r’s = rs Æ r.
µ
l µ r, rs µ l
rs
r’s = rs Æ r
Claim: r’s is faithful to s.
Proof: In general, since r’s = rs Æ r then M(r’s)= M(rs) Å M(r)
However, since rs µ l µ r then M(rs) µ M(r).
Consequently: M(r’s)= M(rs)
And the augmented representation is faithful to s.
Page 66
Comments




The claim suggests an algorithm for generating alternative
(equivalent) representations and for semantic entailment.
The resulting algorithm is a sound algorithm, but is not
complete.
Completeness depends on the quality of the KB of rules.
The power of this algorithm is in the rules KB.
l and r might be very different syntactically, but by satisfying
model theoretic subsumption they provide expressivity to the
re-representation in a way that facilitates the overall
subsumption.
Page 67
Non-Entailment


The problem of determining non-entailment is harder,
mostly due to it’s structure.
Most approaches determine non-entailment heuristically.



Set a threshold for a cost function. If not met by the pair, say ‘now’
Several approach has identified specific features the hind on nonentialment.
A model Theoretic approach for non-entailment has also
been developed, although it’s effectiveness isn’t clear yet.
Page 68
What are we missing?

It is completely clear that the key resource missing is
knowledge.




Better resources translate immediately to better results.
At this point existing resources seem to be lacking in coverage and
accuracy.
Not enough high quality public resources; no quantification.
Some Examples

Lexical Knowledge: Some cases are difficult to acquire systematically.





A bought Y  A has/owns Y
Many of the current lexical resources are very noisy.
Numbers, quantitative reasoning
Time and Date; Temporal Reasoning.
Robust event based reasoning and information integration
Page 69
Textual Entailment as a Classification
Task
Page 70
RTE as classification task

RTE is a classification task:


Given a pair we need to decide if T implies H or T does not implies H
We can learn a classifier from annotated examples
What do we need:
 A learning algorithm
 A suitable feature space
Page 71
Defining the feature space
How do we define the feature space?
T1  H1

T1
“At the end of the year, all solid companies pay dividends.”
H1
“At the end of the year, all solid insurance companies pay dividends.”

Possible features




“Distance Features” - Features of “some” distance between T and H
“Entailment trigger Features”
“Pair Feature” – The content of the T-H pair is represented
Possible representations of the sentences



Bag-of-words (possibly with n-grams)
Syntactic representation
Semantic representation
Page 72
Distance Features
T H
T
“At the end of the year, all solid companies pay dividends.”
H
“At the end of the year, all solid insurance companies pay dividends.”
Possible features




Number of words in common
Longest common subsequence
Longest common syntactic subtree
…
Page 73
Entailment Triggers
Possible features
from (de Marneffe et al., 2006)

Polarity features

presence/absence of neative polarity contexts (not,no or few, without)
“Oil price surged”“Oil prices didn’t grow”

Antonymy features

presence/absence of antonymous words in T and H
“Oil price is surging”“Oil prices is falling down”

Adjunct features

dropping/adding of syntactic adjunct when moving from T to H
“all solid companies pay dividends” “all solid companies pay cash dividends”

…
Page 74
Pair Features
T H
T
“At the end of the year, all solid companies pay dividends.”
H
“At the end of the year, all solid insurance companies pay dividends.”
Possible features
Bag-of-word spaces of T and H

Syntactic spaces of T and H
Page 75
insurance_H
dividends_H
pay_H
solid_H
…
year_H
…
end_H
dividends_T
H
pay_T
companies_T
solid_T
year_T
…
end_T
T
companies_H

…
Pair Features: what can we learn?
Bag-of-word spaces of T and H

We can learn:


T implies H as when T contains “end”…
T does not imply H when H contains “end”…
It seems to be totally irrelevant!!!
Page 76
insurance_H
dividends_H
pay_H
companies_H
solid_H
…
year_H
…
end_H
dividends_T
H
pay_T
companies_T
solid_T
…
year_T
end_T
T
…
ML Methods in the possible feature spaces
(Bos&Markert, 2006)
Distance
Entailment
Trigger
Possible Features
Pair
(Zanzotto&Moschitti, 2006)
(de Marneffe et al., 2006)
(Hickl et al., 2006)
(Ipken et al., 2006)
(Kozareva&Montoyo, 2006)
(Herrera et al., 2006)(…)
(…)
(Rodney et al.,(…)
2006)
Bag-of-words
Syntactic
Sentence representation
Page 77
Semantic
Effectively using the Pair Feature Space
(Zanzotto, Moschitti, 2006)
Roadmap

Motivation: Reason why it is important even if it seems not.

Understanding the model with an example



Challenges
A simple example
Defining the cross-pair similarity
Page 78
Observing the Distance Feature Space…
(Zanzotto, Moschitti, 2006)
T1  H1
T1
“At the end of the year, all solid companies pay dividends.”
H1
“At the end of the year, all solid insurance companies pay dividends.”
T1  H2
T1
“At the end of the year, all solid companies pay dividends.”
H2
“At the end of the year, all solid companies pay cash dividends.”
In a distance feature space…
… the two pairs are very likely
the same point
% common
syntactic
dependencies
T1  H1
T1  H2
% common
words
Page 79
What can happen in the pair feature space?
(Zanzotto, Moschitti, 2006)
T1  H1
T1
“At the end of the year, all solid companies pay dividends.”
H1
“At the end of the year, all solid insurance companies pay dividends.”
T1  H2
T3  H3
T1
“At the end of the year, all solid companies pay dividends.”
H2
“At the end of the year, all solid companies pay cash dividends.”
S2 < S1
T3
“All wild animals eat plants that have scientifically proven
medicinal properties.”
H3
“All wild mountain animals eat plants that have scientifically proven
medicinal properties.”
Page 80
Observations
Some examples are difficult to be exploited in the distance
feature space…
 We need a space that considers the content and the structure
of textual entailment examples
Let us explore:
 the pair space!
 … using the Kernel Trick: define the space defining the
distance K(P1 , P2) instead of defining the feautures
K(T1  H1,T1  H2)

T1  H1
T1  H2
Page 81
Target
(Zanzotto, Moschitti, 2006)
Cross-pair similarity
KS((T’,H’),(T’’,H’’)) KT(T’,T’’)+ KT(H’,H’’)
How do we build it:


Using a syntactic interpretation of sentences
Using a similarity among trees KT(T’,T’’): this similarity counts the
number of subtrees in common between T’ and T’’
This is a syntactic pair feature space
Question: do we need something more?
Page 82
Observing the syntactic pair feature space
(Zanzotto, Moschitti, 2006)
Can we use syntactic tree similarity?
Page 83
Observing the syntactic pair feature space
(Zanzotto, Moschitti, 2006)
Can we use syntactic tree similarity?
Page 84
Observing the syntactic pair feature space
(Zanzotto, Moschitti, 2006)
Can we use syntactic tree similarity? Not only!
Page 85
Observing the syntactic pair feature space
(Zanzotto, Moschitti, 2006)
Can we use syntactic tree similarity? Not only!
We want to use/exploit also the implied rewrite rule
a
a
b
c
b
a
d
c
d
Page 86
a
c
b
b
d
c
d
Exploiting Rewrite Rules
(Zanzotto, Moschitti, 2006)
To capture the textual entailment recognition rule (rewrite rule
or inference rule), the cross-pair similarity measure should
consider:


the structural/syntactical similarity between, respectively, texts and
hypotheses
the similarity among the intra-pair relations between constituents
How to reduce the problem to a tree similarity computation?
Page 87
Exploiting Rewrite Rules
(Zanzotto, Moschitti, 2006)
Page 88
Exploiting Rewrite Rules
(Zanzotto, Moschitti, 2006)
Intra-pair operations
Page 89
Exploiting Rewrite Rules
(Zanzotto, Moschitti, 2006)
Intra-pair operations
 Finding anchors
Page 90
Exploiting Rewrite Rules
(Zanzotto, Moschitti, 2006)
Intra-pair operations
Finding anchors
Naming anchors with placeholders
Page 91
Exploiting Rewrite Rules
(Zanzotto, Moschitti, 2006)
Intra-pair operations
Finding anchors
Naming anchors with placeholders
Propagating placeholders
Page 92
Exploiting Rewrite Rules
Intra-pair operations
Cross-pair operations
Finding anchors
Naming anchors with placeholders
Propagating placeholders
Page 93
(Zanzotto, Moschitti, 2006)
Exploiting Rewrite Rules
Intra-pair operations
Finding anchors
Naming anchors with placeholders
Propagating placeholders
(Zanzotto, Moschitti, 2006)
Cross-pair operations
Matching placeholders across pairs
Page 94
Exploiting Rewrite Rules
Intra-pair operations
Cross-pair operations
Finding anchors
Naming anchors with placeholders
Propagating placeholders
Matching placeholders across pairs
Renaming placeholders
Page 95
Exploiting Rewrite Rules
Intra-pair operations
Cross-pair operations
Finding anchors
Naming anchors with placeholders
Propagating placeholders
Matching placeholders across pairs
Renaming placeholders
Calculating the similarity between syntactic trees with
co-indexed leaves
Page 96
Exploiting Rewrite Rules
(Zanzotto, Moschitti, 2006)
Intra-pair operations
Cross-pair operations
Finding anchors
Naming anchors with placeholders
Propagating placeholders
Matching placeholders across pairs
Renaming placeholders
Calculating the similarity between syntactic trees with co-indexed leaves
Page 97
Exploiting Rewrite Rules
(Zanzotto, Moschitti, 2006)
The initial example: sim(H1,H3) > sim(H2,H3)?
Page 98
Defining the Cross-pair similarity
(Zanzotto, Moschitti, 2006)

The cross pair similarity is based on the distance between
syntatic trees with co-indexed leaves:
where




C is the set of all the correspondences between anchors of (T’,H’)
and (T’’,H’’)
t(S, c) returns the parse tree of the hypothesis (text) S where
placeholders of these latter are replaced by means of the substitution
c
i is the identity substitution
KT(t1, t2) is a function that measures the similarity between the two
trees t1 and t2.
Page 99
Defining the Cross-pair similarity
Page 100
Refining Cross-pair Similarity
(Zanzotto, Moschitti, 2006)

Controlling complexity


Reducing the computational cost


We reduced the size of the set of anchors using the notion of chunk
Many subtree computations are repeated during the computation of
KT(t1, t2). This can be exploited for a better dynamic progamming
algorithm (Moschitti&Zanzotto, 2007)
Focussing on information within a pair relevant for the
entailment:

Text trees are pruned according to where anchors attach
Page 101
BREAK (30 min)
Page 102
III. Knowledge Acquisition Methods
Page 103
Knowledge Acquisition for TE
What kind of knowledge we need?
 Explicit Knowledge (Structured Knowledge Bases)

Relations among words (or concepts)



Relations among sentence prototypes



Symmetric: Synonymy, cohypohymy
Directional: hyponymy, part of, …
Symmetric: Paraphrasing
Directional : Inference Rules/Rewrite Rules
Implicit Knowledge

Relations among sentences


Symmetric: paraphrasing examples
Directional: entailment examples
Page 104
Acquisition of Explicit Knowledge
Page 105
Acquisition of Explicit Knowledge
The questions we need to answer

What?


Using what?


What we want to learn? Which resources do we need?
Which are the principles we have?
How?

How do we organize the “knowledge acquisition” algorithm
Page 106
Acquisition of Explicit Knowledge: what?
Types of knowledge
 Symmetric

Co-hyponymy
Between words: cat  dog

Synonymy
Between words: buy  acquire
Sentence prototypes (paraphrasing) : X bought Y  X acquired Z% of the Y’s shares

Directional semantic relations
Words: cat  animal , buy  own , wheel partof car
Sentence prototypes : X acquired Z% of the Y’s shares  X owns Y
Page 107
Acquisition of Explicit Knowledge : Using what?
Underlying hypothesis

Harris’ Distributional Hypothesis (DH) (Harris, 1964)
“Words that tend to occur in the same contexts tend to have similar
meanings.”
sim(w1,w2)sim(C(w1), C(w2))

Robison’s Point-wise Assertion Patterns (PAP) (Robison,
1970)
“It is possible to extract relevant semantic relations with some pattern.”
w1 is in a relation r with w2 if the context pattern(w1, w2 )
Page 108
Distributional Hypothesis (DH)
simw(W1,W2)simctx(C(W1), C(W2))
Words or Forms
w1= constitute
Context (Feature) Space
C(w1)
Corpus: source of contexts
… sun is constituted of hydrogen …
…The Sun is composed of hydrogen …
w2= compose
C(w2)
Page 109
Point-wise Assertion Patterns (PAP)
w1 is in a relation r with w2 if the contexts patternsr(w1, w2 )
Corpus: source of contexts
relation w1 part_of w2
patterns
“w1 is constituted of
w2”
“w1 is composed of
w2”
… sun is constituted of hydrogen …
…The Sun is composed of hydrogen …
Statistical Indicator
Scorpus(w1,w2)
selects correct vs incorrect relations
among words
part_of(sun,hydrogen)
Page 110
DH and PAP cooperate
Distributional Hypothesis
Words or Forms
w1= constitute
Point-wise assertion Patterns
Context (Feature) Space
C(w1)
Corpus: source of contexts
… sun is constituted of hydrogen …
…The Sun is composed of hydrogen …
w2= compose
C(w2)
Page 111
Knowledge Acquisition: Where methods differ?
Words or Forms
w1= cat
Context (Feature) Space
C(w1)
w2= dog
On the “word” side
 Target equivalence classes: Concepts or Relations
 Target forms: words or expressions
On the “context” side
 Feature Space
 Similarity function
Page 112
C(w2)
Directional
Verb Entailment
(Zanzotto et al., 2006)
Noun Entailment
(Geffet&Dagan, 2005)
Relation Pattern Learning (ESPRESSO)
(Pantel&Pennacchiotti, 2006)
ISA patterns
(Hearst, 1992)
ESPRESSO
(Pantel&Pennacchiotti, 2006)
Symmetric
Types of knowledge
KA4TE: a first classification of some methods
Hearst
Concept Learning
(Lin&Pantel, 2001a)
TEASE
(Szepktor et al.,2004)
Inference Rules (DIRT)
(Lin&Pantel, 2001b)
Distributional
Hypothesis
Point-wise assertion
Patterns
Underlying hypothesis
Page 113
Noun Entailment Relation
(Geffet&Dagan, 2006)



Type of knowledge: directional relations
Underlying hypothesis: distributional hypothesis
Main Idea: distributional inclusion hypothesis
Words or Forms
w1
w1  w2
if
All the prominent features
of w1 occur with w2 in a
sufficiently large corpus
Context (Feature) Space
C(w1)
I(C(w1))
+
+
+
+
w1  w2
w2
I(C(w2))
C(w2)
Page 114
+
+
+
+
+
Verb Entailment Relations
(Zanzotto, Pennacchiotti, Pazienza, 2006)



Type of knowledge: oriented relations
Underlying hypothesis: point-wise assertion patterns
Main Idea: ? win  play ! player wins
relation
v1  v2
patterns
“agentive_nominalization(v2) v1”
Statistical Indicator
S(v1,v2)
Point-wise Mutual information
Page 115
Verb Entailment Relations
(Zanzotto, Pennacchiotti, Pazienza, 2006)
Understanding the idea

Selectional restriction
fly(x)  has_wings(x)
in general
v(x)  c(x) (if x is the subject of v then x has the property c)

Agentive nominalization
“agentive noun is the doer or the performer of an action v’”
“X is player” may be read as play(x)
c(x) is clearly v’(x) if the property c is derived by v’ with an agentive
nominalization
Page 116
Verb Entailment Relations
Understanding the idea
Given the expression


player wins
Seen as a selctional restriction
win(x)  play(x)
Seen as a selectional preference
P(play(x)|win(x)) > P(play(x))
Page 117
Knowledge Acquisition for TE: How?
The algorithmic nature of a DH+PAP method
 Direct


Indirect


Starting point: target words
Starting point: context feature space
Iterative

Interplay between the context feature space and the target words
Page 118
Direct Algorithm
sim(w1,w2)sim(C(w1), C(w2))
1.
sim(w1,w2)sim(I(C(w1)), I(C(w2)))
2.
Words or Forms
w1= cat
Context (Feature) Space
C(w1)
sim(C(w1), C(w2))
I(C(w1))
sim(w1, w2)
3.
sim(I(C(w1)), I(C(w2)))
I(C(w2))
w2= dog
C(w2)
Page 119
Select target words wi from
the corpus or from a
dictionary
Retrieve contexts of each wi
and represent them in the
feature space C(wi )
For each pair (wi, wj)
1.
Compute the similarity
sim(C(wi), C(wj )) in the
context space
2.
If sim(wi, wj )=
sim(C(wi), C(wj ))>t,
wi and wj belong to the
same equivalence class
W
Indirect Algorithm
1.
sim(w1,w2)sim(C(w1), C(w2))
sim(w1,w2)sim(I(C(w1)), I(C(w2)))
2.
Words or Forms
Context (Feature) Space
3.
w1= cat
C(w1)
sim(C(w1), C(w2))
4.
5.
sim(w1, w2)
w2= dog
C(w2)
Page 120
Given an equivalence class
W, select relevant contexts
and represent them in the
feature space
Retrieve target words (w1, …,
wn) that appear in these
contexts. These are likely to
be words in the equivalence
class W
Eventually, for each wi,
retrieve C(wiI) from the
corpus
Compute the centroid
I(C(W))
For each for each wi,
if sim(I(C(W), wi)<t,
eliminate wi from W.
Iterative Algorithm
1.
sim(w1,w2)sim(C(w1), C(w2))
sim(w1,w2)sim(I(C(w1)), I(C(w2)))
Words or Forms
Context (Feature) Space
2.
3.
4.
w1= cat
C(w1)
sim(C(w1), C(w2))
sim(w1, w2)
w2= dog
C(w2)
Page 121
For each word wi in the
equivalence class W, retrieve
the C(wi) contexts and
represent them in the feature
space
Extract words wj that have
contexts similar to C(wi)
Extract contexts C(wj) of
these new words
For each for each new word
wj, if sim(C(W), wj)>t, put wj
in W.
Knowledge Acquisition using DH and PAH

Direct Algorithms




Indirect Algorithms




Concepts from text via clustering (Lin&Pantel, 2001)
Inference rules – aka DIRT (Lin&Pantel, 2001)
…
Hearst’s ISA patterns (Hearst, 1992)
Question Answering patterns (Ravichandran&Hovy, 2002)
…
Iterative Algorithms



Entailment rules from Web – aka TEASE (Szepktor et al., 2004)
Espresso (Pantel&Pennacchiotti, 2006)
…
Page 122
TEASE
(Szepktor et al., 2004)
Type: Iterative algorithm
On the “word” side

Target equivalence classes: fine-grained relations
prevent(X,Y)

Target forms: verb with arguments
X
subj
mod
call
obj
mod
indictable
finally
Y
On the “context” side

Feature Space
X_{filler}:mi?,Y_{filler}:mi?
Innovations with respect to reasearches < 2004

First direct algorithm for extracting rules
Page 123
TEASE
(Szepktor et al., 2004)
Lexicon
Input template:
Xsubj-accuse-objY
WE
B
TEASE
Sample corpus for input template:
Paula Jones accused Clinton…
BBC accused Blair…
Sanhedrin accused St.Paul…
…
Anchor Set Extraction
(ASE)
Anchor sets:
{Paula Jonessubj; Clintonobj}
{Sanhedrinsubj; St.Paulobj}
…
Sample corpus for anchor sets:
Template Extraction
Paula Jones called Clinton indictable…
St.Paul defended before the Sanhedrin
…
(TE)
Templates:
X call Y indictable
Y defend before X
Page 124…
iterate
TEASE
(Szepktor et al., 2004)
Innovations with respect to reasearches < 2004
 First direct algorithm for extracting rules
 A feature selection is done to assess the most informative
features
 Extracted forms are clustered to obtain the most general
sentence prototype of a given set of equivalent forms
S1:
S2:
call
subj {2}
mod {1}
mod {2}
Y
{1}
subj {1,2}
mod {1,2}
obj {1,2}
obj {2}
obj {1}
X
{1}
{1,2}
{2}
{1}
subj {1}
call
call
indictable
{1}
X
{2}
Y
{2}
for {1}
indictable
{2}
X
{1,2}
Y
{1,2}
indictable
{1,2}
for {1}
mod {2}
mod {2}
harassment
{1}
finally
finally
{2}
{2}
Page 125
harassment
{1}
Espresso
(Pantel&Pennacchiotti, 2006)
Type: Iterative algorithm
On the “word” side
 Target equivalence classes: relations
compose(X,Y)

Target forms: expressions, sequences of tokens
Y is composed by X, Y
is made of
X
Innovations with respect to reasearches < 2006

A measure to determine specific vs. general patterns
(ranking in the equivalent forms)
Page 126
Espresso
(leader , panel)
(city , region)
(oxygen , water)
1.0
0.9
0.7
0.6
0.6
0.2
SEEDS
(Pantel&Pennacchiotti, 2006)
(tree , land)
(atom, molecule)
(leader , panel)
(range of information, FBI report)
(artifact , exhibit)
(oxygen , hydrogen)
Instance Extraction
Pattern instantiation
Pattern Induction
Pattern Ranking / Selection
Instance Ranking / Selection
GENERIC PATTERN FILTERING
Sentence retrieval
Pattern Reliability
ranking
Generic Test
Google
yes
Web Instance
Filter
Instance Reliability
ranking
no
Sentence generalization
Instance selection
Pattern selection
Frequency count
Low Redundancy
Test
yes
Syntactic Expansion
Web Expansion
Y is composed by X
X,Y
Y is part of Y
1.0
0.8
0.2
Y is composed by X
Y is part of X
X,Y
Page 127
(tree , land)
(oxygen , hydrogen)
(atom, molecule)
(leader , panel)
(range of information, FBI report)
(artifact , exhibit)
…
Espresso
(Pantel&Pennacchiotti, 2006)
Innovations with respect to reasearches < 2006
 A measure to determine specific vs. general patterns
(ranking in the equivalent forms)
1.0
0.8
0.2


Y is composed by X
Y is part of X
X,Y
Both pattern and instance selections are performed
Different Use of General and specific patterns in the iterative
algorithm
Page 128
Acquisition of Implicit Knowledge
Page 129
Acquisition of Implicit Knowledge
The questions we need to answer

What?


What we want to learn? Which resources do we need?
Using what?

Which are the principles we have?
Page 130
Acquisition of Implicit Knowledge: what?
Types of knowledge
 Symmetric

Nearly Synonymy between sentences
Acme Inc. bought Goofy ltd.  Acme Inc. acquired 11% of the Goofy ltd.’s shares

Directional semantic relations

Entailment between sentences
Acme Inc. acquired 11% of the Goofy ltd.’s shares  Acme Inc. owns
Goofy ltd.
Note: ALSO TRICKY NOT-ENTAILMENT ARE RELEVANT
Page 131
Acquisition of Implicit Knowledge :
Using what?
Underlying hypothesis

Structural and content similarity
“Sentences are similar if they share enough content”
sim(s1,s2) according to relations from s1 and s2

A revised Point-wise Assertion Patterns
“Some patterns of sentences reveal relations among sentences”
Page 132
entails not entails
Symmetric
Directional
Types of knowledge
A first classification of some methods
Relations among sentences
(Hickl et al., 2006)
Relations among sentences
(Burger&Ferro, 2005)
Paraphrase Corpus
(Dolan&Quirk, 2004)
Revised Point-wise
assertion Patterns
Structural and
content similarity
Underlying hypothesis
Page 133
Entailment relations among sentences
(Burger&Ferro, 2005)



Type of knowledge: directional relations (entailment)
Underlying hypothesis: revised point-wise assertion patterns
Main Idea: in headline news items, the first
sentence/paragraph generally entails the title
relation
patterns
s2  s1
“News Item
Title(s1)
This pattern works on the
structure of the text
First_Sentence(s2)”
Page 134
Entailment relations among sentences
examples from the web
Title
New York Plan for DNA Data in Most Crimes
Body
Eliot Spitzer is proposing a major expansion of New
York’s database of DNA samples to include people
convicted of most crimes, while making it easier for
prisoners to use DNA to try to establish their
innocence. …
Title
Chrysler Group to Be Sold for $7.4 Billion
Body
DaimlerChrysler confirmed today that it would sell a
controlling interest in its struggling Chrysler Group to
Cerberus Capital Management of New York, a private
equity firm that specializes in restructuring troubled
companies. …
Page 135
Tricky Not-Entailment relations among sentences
(Hickl et al., 2006)

Type of knowledge: directional relations (tricky not-
entailment)


Underlying hypothesis: revised point-wise assertion patterns
Main Idea:


relation
pattern
s
in a text, sentences with a same name entity generally do not entails
each other
Sentences connected by “on the contrary”, “but”, … do not entail
each other
s1  s2
s1 and s2 are in the same text and
share at least a named entity
“s1. On the contrary, s2”
Page 136
Tricky Not-Entailment relations among sentences
examples from (Hickl et al., 2006)
T
One player losing a close friend is Japanese pitcher
Hideki Irabu, who was befriended by Wells during
spring training last year.
H
Irabu said he would take Wells out to dinner
when the Yankees visit Toronto.
T
According to the professor, present methods of
cleaning up oil slicks are extremely costly and are
never completely efficient.
H In contrast, he stressed, Clean Mag has a 100
percent pollution retrieval rate, is low cost and can
be recycled.
Page 137
Context Sensitive Paraphrasing



He used a Phillips head to
tighten the screw.
The bank owner tightened
security after a spat of local
crimes.
The Federal Reserve will
aggressively tighten monetary
policy.
Page 138
Loosen
Strengthen
Step up
Toughen
Improve
Fasten
Impose
Intensify
Ease
Beef up
Simplify
Curb
Reduce
……….
Context Sensitive Paraphrasing





Can speak replace command?
The general commanded his troops.
The general spoke to his troops.
The soloist commanded attention.
The soloist spoke to attention.
Context Sensitive Paraphrasing


Need to know when one word can paraphrase another, not
just if.
Given a word v and its context in sentence S, and another
word u:






Can u replace v in S and have S keep the same or entailed meaning.
Is the new sentence S’ where u has replaced v entailed by previous
sentence S
The general commanded [V] his troops.
[Speak = U]
The general spoke to his troops.
YES
The soloist commanded [V ] attention. [Speak = U]
The soloist spoke to attention.
NO
Related Work

Paraphrase generation:


A sense disambiguation task – w/o naming the sense




Given a sentence or phrase, generate paraphrases of that phrase which
have the same or entailed meaning in some context. [DIRT;TEASE]
Dagan et. al’06
Kauchak & Barzilay (in the context of improving MT evaluation)
SemEval word Substitution Task; Pantel et. al ‘06
In these cases, this was done by learning (in a supervised
way) a single classifier per word u
Context Sensitive Paraphrasing

[Connor&Roth ’07]
Use a single global binary classifier
f(S,v,u) ! {0,1}




Unsupervised, bootstrapped, learning approach
Key: the use of a very large amount of unlabeled data to derive a reliable
supervision signal that is then used to train a supervised learning
algorithm.
Features are amount of overlap between contexts u and v have both been
seen with
Include context sensitivity by restricting to contexts similar to S


Are both u and v seen in contexts similar to local context S
Allows running the classifier on previously unseen pairs (u,v)
IV. Applications of Textual Entailment
Page 143
Relation Extraction (Romano et al. EACL-06)

Identify different ways of expressing a target
relation


Examples: Management Succession, Birth - Death,
Mergers and Acquisitions, Protein Interaction
Traditionally performed in a supervised manner


Requires dozens-hundreds examples per relation
Examples should cover broad semantic variability

Costly - Feasible???

Little work on unsupervised approaches
Page 144
Proposed Approach
Input Template
X prevent Y
Entailment Rule Acquisition
TEASE
Templates
X prevention for Y, X treat Y, X reduce Y
Syntactic Matcher
Relation Instances
<sunscreen, sunburns>
Page 145
Transformation
Rules
Dataset


Bunescu 2005
Recognizing interactions between annotated
proteins pairs


200 Medline abstracts
Input template : X interact with Y
Page 146
Manual Analysis - Results

93% of interacting protein pairs can be identified with lexical syntactic
templates
Number of templates vs. recall (within 93%):
R(%)
# templates
R(%)
# templates
10
2
60
39
20
4
70
73
30
6
80
107
40
11
90
141
50
21
100
175
Frequency of syntactic phenomena:
Phenomenon
%
Phenomenon
%
transparent head
34
relative clause
8
apposition
24
co-reference
7
conjunction
24
coordination
7
set
13
passive form
2
Page 147
TEASE Output for X interact with Y
A sample of correct templates learned:
X bind to Y
X binding to Y
X activate Y
X Y interaction
X stimulate Y
X attach to Y
X couple to Y
X interaction with Y
interaction between X and Y
X trap Y
X become trapped in Y
X recruit Y
X Y complex
X associate with Y
X recognize Y
X be linked to Y
X block Y
X target Y
Page 148
TEASE Potential Recall on Training Set
Experiment


Recall
input
39%
input + iterative
49%
input + iterative + morph
63%
Iterative - taking the top 5 ranked templates as input
Morph - recognizing morphological derivations
(cf. semantic role labeling vs. matching)
Page 149
Performance vs. Supervised Approaches
Supervised:
180 training
abstracts
Page 150
Textual Entailment for Question Answering


Sanda Harabagiu and Andrew Hickl (ACL-06) :
Methods for Using Textual Entailment in Open-Domain
Question Answering
Typical QA architecture – 3 stages:
1) Question processing
2) Passage retrieval
3) Answer processing

Incorporated their RTE-2 entailment system at stages 2&3,
for filtering and re-ranking
Page 151
Integrated three methods
1) Test entailment between question and final answer –
filter and re-rank by entailment score
2) Test entailment between question and candidate retrieved
passage – combine entailment score in passage ranking
3) Test entailment between question and Automatically
Generated Questions (AGQ) created from candidate
paragraph




Utilizes earlier method for generating Q-A pairs from paragraph
Correct answer should match that of an entailed AGQ
TE is relatively easy to integrate at different stages
Results: 20% accuracy increase
Page 152
Answer Validation Exercise @ CLEF 2006-7



Peñas et al., Journal of Logic and Computation (to appear)
Allow textual entailment systems to validate (and
prioritize) the answers of QA systems participating at CLEF
AVE participants receive:
1) question and answer – need to generate full hypothesis
2) supporting passage – should entail the answer hypothesis

Methodologically: Enables to measure TE systems
contribution to QA performance, across many QA systems

TE developers do not need to have full-blown QA system
Page 153
V. A Textual Entailment view of
Applied Semantics
Page 154
Classical Approach = Interpretation
Stipulated
Meaning
Representation
(by scholar)
Variability
Language
(by nature)


Logical forms, word senses, semantic roles,
named entity types, … - scattered interpretation tasks
Feasible/suitable framework for applied semantics?
Page 155
Textual Entailment = Text Mapping
Assumed Meaning
(by humans)
Variability
Language
(by nature)
Page 156
General Case – Inference
Inference
Meaning
Representation
Interpretation
Language
Textual Entailment


Entailment mapping is the actual applied goal
- but also a touchstone for understanding!
Interpretation becomes possible means

Varying representation levels may be investigated
Page 157
Some perspectives

Issues with semantic interpretation




Hard to agree on a representation language
Costly to annotate semantic representations for training
Difficult to obtain - is it more difficult than needed?
Textual entailment refers to texts



Texts are theory neutral
Amenable for unsupervised learning
“Proof is in the pudding” test
Page 158
Entailment as an Applied Semantics Framework
The new view:
formulate (all?) semantic problems as
entailment tasks
Some semantic problems are traditionally
investigated as entailment tasks
But also…
 Revised definitions of old problems
 Exposing many new ones

Page 159
Some Classical Entailment Problems

Monotonicity – traditionally approached via entailment

Given that: dog  animal





Upward monotone:
Some dogs are nice  Some animals are nice
Downward monotone: No animals are nice  No dogs are nice
Some formal approaches – via interpretation to logical form
Natural logic – avoids interpretation to FOL (cf. Stanford @ RTE-3)
Noun compound relation identification


a novel by Tolstoy  Tolstoy wrote a novel
Practically an entailment task, when relations are represented lexically
(rather than as interpreted semantic notions)
Page 160
Revised definition of an Old Problem:
Sense Ambiguity


Classical task definition - interpretation:
Word Sense Disambiguation
What is the RIGHT set of senses?



Any concrete set is problematic/subjective
… but WSD forces you to choose one
A lexical entailment perspective:



Instead of identifying an explicitly stipulated sense of a word
occurrence ...
identify whether a word occurrence (i.e. its implicit sense) entails
another word occurrence, in context
Dagan et al. (ACL-2006)
Page 161
Synonym Substitution
Source = record
positive
Target = disc
This is anyway a stunning disc, thanks to the playing of the Moscow
Virtuosi with Spivakov.
negative
He said computer networks would not be affected and copies of
information should be made on floppy discs.
negative
Before the dead soldier was placed in the ditch his personal possessions
were removed, leaving one disc on the body for identification purposes.
Page 162
Unsupervised Direct: kNN-ranking


Test example score:
Average Cosine similarity of target example with k most
similar (unlabeled) instances of source word
Rational:



positive examples of target will be similar to some source
occurrence (of corresponding sense)
negative target examples won’t be similar to source examples
Rank test examples by score

A classification slant on language modeling
Page 163
Results (for synonyms): Ranking
 kNN improves 8-18% precision up to 25% recall
Page 164
Other Modified and New Problems

Lexical entailment vs. classical lexical semantic relationships



synonym ⇔ synonym
hyponym ⇒ hypernym (but much beyond WN – e.g. “medical technology”)
meronym ⇐ ? ⇒ holonym – depending on meronym type, and context


Named Entity Classification – by any textual type



X’s acquisition of Y  X acquired Y
X’s acquisition by Y  Y acquired X
Transparent head



Which pickup trucks are produced by Mitsubishi?
Magnum  pickup truck
Argument mapping for nominalizations (derivations)


boil on elbow ⇒ boil on arm vs. government voted ⇒ minister voted
sell to an IBM division  sell to IBM
sell to an IBM competitor ⇏ sell to IBM
…
Page 165
The importance of analyzing entailment examples

Few systematic manual data analysis works were reported






Vanderwende et al. at RTE-1 workshop
Bar-Haim et al. at ACL-05 EMSEE Workshop
Within Romano et al. at EACL-06
Xerox Parc Data set; Braz et. IJCAI workshop’05
Contribute a lot to understanding and defining entailment
phenomena and sub-problems
Should be done (and reported) much more…
Page 166
Unified Evaluation Framework


Defining semantic problems as entailment problems
facilitates unified evaluation schemes (vs. current state)
Possible evaluation schemes:
1) Evaluate on the general TE task, while creating corpora which focus
on target sub-tasks


E.g. a TE dataset with many sense-matching instances
Measure impact of sense-matching algorithms on TE performance
2) Define TE-oriented subtasks, and evaluate directly on sub-task




E.g. a test collection manually annotated for sense-matching
Advantages: isolate sub-problem; researchers can investigate individual
problems without needing a full-blown TE system (cf. QA research)
Such datasets may be derived from datasets of type (1)
Facilitates common inference goal across semantic problems
Page 167
Summary: Textual Entailment as Goal

The essence of the textual entailment paradigm:



Interpretation and “mapping” methods may
compete/complement


Base applied semantic inference on entailment “engines” and KBs
Formulate various semantic problems as entailment sub-tasks
at various levels of representations
Open question: which inferences


can be represented at “language” level?
require logical or specialized representation and inference?
(temporal, spatial, mathematical, …)
Page 168
Textual Entailment ≈
Human Reading Comprehension

From a children’s English learning book
(Sela and Greenberg):
Reference Text:
“…The Bermuda Triangle lies in the Atlantic Ocean, off
the coast of Florida. …”
???
Hypothesis (True/False?):
The Bermuda Triangle is near the United States
Page 169
Cautious Optimism: Approaching the Desiderata?
1) Generic (feasible) module for applications
2) Unified (agreeable) paradigm for investigating
language phenomena
Thank you!
Page 170
Download