>> Chris Brockett: Good afternoon. I'm very pleased... MacCartney here today. Bill is a Ph.D. student finishing...

advertisement
>> Chris Brockett: Good afternoon. I'm very pleased to be able to Bill
MacCartney here today. Bill is a Ph.D. student finishing up his dissertation at
Stanford, working with Chris Manning. He's been particularly involved in
inference and the recognizing textural entailment system that they've been
developing at Stanford. And today he is going to present some work that won
the best paper award at Coling this year and follow on with a discussion of the
issue of alignment in recognizing entailment and engaging with inference. With
that, I'll pass it off. Thank you.
>> Bill MacCartney: Thank you very much. And thank you all for giving me the
opportunity to talk to you today. It's an honor to be here and I hope you'll find it
interesting. I'm going to be talking about two aspects of the problem of natural
language inference today, and this is really the two part talk so each part of the
talk concerns a different aspect of the problem of natural language inference,
which I'll define in a moment.
The first part of the talk will be based on a paper that I presented at Coling '08,
it's called modelling semantic containment, an exclusion in natural language
inference, and it describes a computational model of natural logic for NLI. It's tell
you what that is in a moment. This is not a general solution for the problem of
NLI, but it does handle an interesting subset of NLI problems.
But it depends on alignments that come from another source, and that's the
motivation for the paper that the second part of the talk is based on. This is a
paper I'm going to be presenting at EMNLT in a couple of weeks, a phrase based
model for alignment for NLI, directly addresses the problem of alignment for NLI
and relates it to that problem of alignment for MT. And this part of the work was
directly enabled by annotated data produced here at MSR.
Okay. So the first part of the talk is about modelling semantic containment and
exclusion in natural language inference, and this is joint work with my advisor,
Chris Manning. Natural language inference, which is also known as recognizing
textural entailment is the problem of determining whether a premise P justifies an
inference to a hypothesis H, and this is an informal, intuitive notion of inference.
The emphasis is on short local inference steps and variability of linguistic
expression rather than long chains of formal reasoning.
So here's an example. The premise is every firm polled saw costs grow more
than expected even after adjusting for inflation and hypothesis is every big
company in the poll reported cost increases. And this is a valid inference. I want
to make two observations about this example. The first, if the quantifier were
some instead of every, the inference would not be valid because it could be that
only small firms saw cost grow. And second it will be difficult or impossible to
translate these sentences fully and accurately into formal logic and the
importance of these facts will become clear in a moment.
Natural language inference is necessary to the ultimate goal of full natural
language understanding and it can also enable more immediate applications
such as semantic search, question answering and others.
Work on natural language inference has explored a broad spectrum of
approaches, so at one end of the spectrum are approaches based on lexical or
semantic overlap, pattern based relation extraction or approximate matching of
predicate argument structure. These approaches are robust and broadly
effective but imprecise and they're easily confounded by inferences involving
negation quantifiers and other phenomena, including the example on the
previous slide.
At the other end of the spectrum, we have approaches based on first order logic
and theorem proving. Such approaches have the power and precision that we're
looking for, but they tend to founder on the many involved difficulties involved in
accurately translating natural language to first order logic.
In this work, we explore a different point on the spectrum by developing a
computational model of natural logic which I'll define in a moment. So here's the
outline for the talk. First I'll talk about the theoretical foundations of natural logic,
then I'll introduce our computational model of natural logic, the NatLog system,
then I'll describe experiments with two different data sets, the FraCaS test suite
and the RTE data and then I'll conclude the first part of the talk.
So what is natural logic? The term was introduced by Lacoff who defined natural
logic as a logic whose vehicle of inference is natural language. That is it
characterizes valid patterns of reasoning in terms of surface forms, and it thus
enables us to do precise reasoning while side stepping the myriad difficulties of
full semantic interpretation.
Natural logic has a very long history stretching back to the syllogisms of Aristotle,
and it was revived in the 1980s as the monotonistic calculus of Van Bentum and
Sanchez Valencia.
Also the account of implicatives and factives developed by Nairn, et al, at park,
arguably belongs to the natural logic traditional, though it wasn't presented at
such. In this work we present a new theory of natural logic which extends the
monotonistic calculus to account for negation and exclusion and also
incorporates elements of Nairn's model of implicatives. Over the next few slides
I'll sketch this model, but at a very high level. For more details you can either see
the Coling paper or I'm actually almost finished with a new paper which describes
the theoretical foundations of this in much greater detail and if you're interested, I
can send you a draft of that.
So first we propose an inventory of seven mutually exclusive basic entailment
relations and this slide is kind of important because these relations and the
symbols that I've chosen to represent them will reappear throughout the rest of
the talk. The relations are defined by analogy with set relations, and they include
representations of both semantic containment and semantic exclusion. So the
seven relations are first, equivalence, forward entailment and reverse entailment.
These are pretty self explanatory and these are the containment relations.
And then negation or exhaustive exclusion, alternation or non exhaustive
exclusion and cover or exhaustive non exclusion.
And finally independence, which covers all other cases. These relations are
defined for expressions of every semantic type, so not only sentences but also
common nouns, adjectives, transitive and intransitive verbs, temporal and
locative modifiers, quantifiers, and so on. And there's some illustrations in those
semantic types here.
Okay. I know that was very quick. But the next question is how are entailment
relations affected by semantic composition? So in other words, how do the
entailments of a compound expression depend on the entailments of its parts?
In the most common case semantic composition reserves entailment relations so
for example eat pork entails eat meat and big bird excludes big fish. But many
semantic functions behave differently. For example refuse projects forward
entailment as reverse entailment, so that refuse to tanning go is entailed by
refuse to dance. And not projects exclusion as a cover relation so that not
French and not German stand in the cover relation to each other.
In our model we categorize semantic functions according to how they project
each of the seven basic entailment relations. This is a generalization of both the
three monotonistic classes of the monotonistic calculus and the nine implication
signatures of Nairn, et al.
For example, not and refuse are alike and projecting equivalents as equivalents
and independents as independents, and they both swap forward and reverse
entailment. But whereas not projects exclusion as -- not projects exclusion as
cover, refuse projects exclusion as independents. So for example, refuse to
tango and refuse to waltz are independent of each other.
A typology of projective allows us to determine the entailments of a compound
expression compositionally by projecting lexical entailment relations upward
through a semantic composition tree. So consider this example. If nobody can
enter without a shirt, then it follows that nobody can enter without clothes. To
explain this compositionally, assume that we have idealized semantic
composition trees and these are plausible renderings of semantic composition
trees here. Representing the compositional structure of the semantics of these
sentences.
We begin from a lexical entailment relation between shirt and clothes. So 30
forward entails clothes, but without is a downward monotone operator, so without
isn't hit -- sorry, without a shirt is entailed by without clothes. This is then applied
to enter and then becomes the argument can which is upward monotone, so it
preserves the direction of things, but then it becomes an argument to nobody,
nobody is downward monotone, so we get another inversion and we find that
nobody can enter without a shirt. Forward entails nobody can enter without
clothes, which is what we expect.
Now, we come to the third element of the theory which builds on the preceding to
prove a hypothesis from a premise. Suppose we can find a sequence of edits
which transforms the premise into the hypothesis. These can be insertions,
deletions, substitutions or more complex edit operations.
We begin by determining a lexical entailment operation for each atomic edit. For
substitutions this depends on the relation between the meanings of the
substituents. Deletions ordinarily generate the forward entailment relation, but
some lexical items have special behavior. So for example deleting not generates
the negation relation. And insertions are symmetric to deletions.
Next we project the lexical entailment relation upward through a semantic
composition tree as in the previous slide to determine the entailment relation
across each atomic edit. And finally we join these acomic entailment relations
across the sequence of edits as in Tarskin relation algebra to obtain our final
answer.
Okay. This has been description of the theory. It's at a very high level. May
have been kind of hard to follow. I'm going to show you a worked out example in
a moment that will make things for concrete and hopefully we'll get it -- give you a
better sense of what actually happens.
But let's switch gears and talk about what we built. The NatLog similar is a
computational model of natural logic and it consists of five stages, and in the
following slides I'll talk about each of these five stages in turn.
But first to illustrate the operation of the system, I'm going to use a running
example shown here. The example is quite contrived, but it compactly exhibits
the three phenomena that I'm interested in, containment, exclusion, and
implicative. So the premise is Jimmy Dean refused to move without blue jeans,
and the hypothesis is James Dean didn't dance without pants, and this is a valid
inference.
Okay. So the steps of the model in the first stage we do linguistic preprocessing,
so we begin by tokenizing and imparsing the input sentences using the Stanford
parser which is a broad coverage statistical parser trained on the pen tree bank.
The most important task at this stage is to identify any semantic functions with
non default projective and to compute their scope in order to determine the
effective projective at each token.
What makes this tricky is that the phrase structure trees produced by the parser
may not correspond exactly to the semantic structure of the sentence. If we had
idealized semantic composition trees, then determining the effective projective
would be easy. Since we don't, we use a somewhat awkward work around. We
define categories of items with special projectivity and for each category we
specify its default scope in phrase structure trees using a tree pattern language
called Tregex which is similar to Tgrep and which was partly the work of Gaylin
Andrew [phonetic], who is in the room here.
This enables us to identify the constituents over which the projective properties
should be applied and there by to compute the final effective projective at each
token.
In the second stage we establish an alignment between the premise and the
hypothesis, and we represent this by sequence of atomic edits over spans of
word tokens. So I've shown an alignment for our running example here. As you
can see, we use four types of edit, deletion, insertion, substitution and match.
The edits are ordered and this ordering defines a path from the premise to the
hypothesis through intermediate forms. The ordering, however, doesn't have to
correspond to the sentence order, although it does in this example. Thus the
alignment effectively decomposes the inference problem into a sequence of
atomic inference problems, one for each acomic edit that is between each
intermediate form that it's transformed through.
An alignment will be the subject of the second part of the talk today. Okay. The
next stage is the heart of the system. This is lexical entailment classification.
And here we try to predict an entailment relation for each atomic edit based
solely on the features of the lexical items involved, independent of the
surrounding context such as falling under a downward monotone operator. We
do this by exploring available resources on lexical semantics and applying
machine learning.
So our feature representation includes semantic relatedness information based
on WordNet, non bank and other lexical resources, string and lemma similarity
scores and information about lexical categories, including special purpose
categories for quantifiers and implicatives. And we use a decision tree classifier
which is trained on about 2500 hand annotated lexical entailment problems like
the examples shown down here.
So back to the running example. I've added two new rows which show the
features generated for each edit and the lexical entailment relation predicted from
those features. So the first edit is a substitution and string similarity is high so we
predict equivalence. In the second edit we delete an implicative refuse and the
model knows that the leading implicatives in this category generates the
alternation relation.
The third edit inserts an auxiliary verb. Since auxiliaries are more or less
semantically vacuous, we predict equivalence. The fourth edit inserts a negation
and this generates the negation relation. The fifth edit is the substitution, and
WordNet tells us that these words are hyponyms, so we predict reverse
entailment. The sixth edit is a match equivalence. The seventh edit is the
deletion of a generic modifier, blue. By default this generates forward
entailments. And finally the eighth edit is a hyponym substitution so forward
entailment.
The fourth stage is entailment projection. So I covered this earlier. It means
projecting lexical entailment relations upward by taking account of the projective
properties of the surrounding context. I'm going to simplify things here a bit by
only considering upward and downward monotonicity. So I've added two new
rows to the table, and the first row shows the effective monotonicity at the locus
of each edit. So everything is upward monotone until we insert negation, after
which the next two edits occur in a downward monotone context. But then
without is again downward monotone, so we get another inversion. And the last
two edits occur in upward monotone context. I want to remind you it's not
necessarily the case that the edits happen in the linear order of the sentence.
Happens to be the case in this example.
The last row shows how the lexical entailment relations are projected into atomic
entailment relations, that is the entailment relations across each atomic edit. So
the only interesting case is right here where reverse entailment relation is
inverted to a forward entailment because of a downward monotone context.
Okay. The final stage is entailment joining in which we combine atomic
entailment relations one by one to obtain our final answer. So I've added a new
row. And we start at the left with equivalence and then equivalence joined with
alternation yields alternation. Alternation joined with equivalence yields to
alternation again. And then alternation jointed with negation yields forward
entailment.
So that one may not be quite as obvious, but it makes sense if you think about it
for a bit. For example, fish alternates with human, and human is the negation of
non human. So fish forward entails non human. After that we're just joining
forward entailment either with itself or with equivalents. So forward entailment is
preserved the whole way through. And that's our final answer. And that's the
right answer for this problem.
Okay. In order to evaluate our system, we use the FraCaS test which came out
of a mid '90s project on computational semantics. It contains 346 problems
which look like they could have come out of a textbook on formal semantics, and
FraCaS involves three way classification, so it distinguishes contradiction from
mere non entailment.
In this work, we consider only problems which contain a single premise, and I've
shown three examples here. So the first example inserts a restrictive modifier, a
lot of in a downward monotone context. The second example involves
predicates which stand in the alternation relation, large and small. And the third
involves a non-factive verb with a clausal complement. So here are the results
for a baseline classifier for our system last year and for our current system. The
columns indicate the number of problems, precision and re-call for the yes class,
and accuracy.
Overall we've made good progress since last year achieving a 27 percent
reduction in error and reaching almost 90 percent in precision. What's more
interesting is the breakdown by section. So the FraCaS problems are divided
into nine sections, each focused on a different describe of semantic phenomena.
In the section on quantifiers which is both the largest and the most amenable to
natural logic, we answer all problem but one correctly. In fact, performance is
good on all the sections where we expect NatLog to have relevant expertise.
Our average accuracy on the five sections most amenable to natural logic is 87
percent.
Not surprisingly, though, we make little headway with things like [inaudible] and
ellipses, but even here precision is high. So the system rarely predicts
entailment when none exists. Of course, this doesn't constitute a proper
evaluation on unseen test data. But on the other hand, the system was never
trained on the FraCaS data, it was only trained on lexical entailment problems
and it's had no opportunity to learn [inaudible] implicit in the data. And our main
goal in testing on FraCaS is really to evaluate the representational and inferential
adequacy of our model of natural logic. And from this perspective, the results are
quite encouraging.
Since the FraCaS test is not well known, we also wanted to do an evaluation
using the familiar RTE data, which many of you have probably seen. Relative to
FraCaS the RTE problems are more natural seeming, with much longer premises
which average 35 words instead of 11. But the RTE problems are not an ideal
match to the strengths of the NatLog system.
First, RTE includes many kinds of inference which are not addressed by natural
logic such as paraphrase, temporal reasoning and relation extraction. Second, in
most RTE problems the edit distance between the premises and hypothesis is
quite large. More atomic edit means a greater chance that prediction errors
made by the atomic entailment model will propagate via entailment joining to the
system's final output. So here are a couple of example problems.
The first example is not a good match to the strengths of the NatLog system. It's
essentially a relation sxrak problem and the NatLog system is thrown off by the
insertion of the words acts as in the hypothesis.
The second example is a much better fit for NatLog. It hinges on recognizing
that deleting a negation yields a contradiction and NatLog gets this problem right.
So here are the results on the RTE3 development and test sets for the Stanford
RTE system which is a broad coverage RTE system and nor NatLog. For each
system I show the percentage of problems answered yes along with precision
and re-call for the yes class and accuracy. Not surprisingly, the overall accuracy
of the NatLog system is unimpressive. But NatLog achieves relatively high
precision, over 70 percent, on its yes predictions. And this suggests a strategy of
hyper dieing the high precision, low re-call NatLog system with the broad
coverage Stanford system.
Boss and Marker pursued similar strategy in their 2006 paper based on first order
logic and theorem proving, however, that system was able to make a positive
prediction in only four percent of cases. NatLog makes positive predictions far
more often in about 25 percent of cases. And the results are quite satisfying. As
we hoped hybridization yields substantial gains, so on the RTE3 test said the
hybrid system attained an accuracy four percent better than the standard system
alone, corresponding to an extra 32 questions out of 800 answered correctly.
So in summary, I want to emphasize that we are not proposing natural logic as -sorry. Go ahead.
>>: [Inaudible].
>> Bill MacCartney: Yes?
>>: [Inaudible] so you said that the -- so the NatLog system was -- I thought the
strength was procedure, all right, so the hybrid system [inaudible] standard RTE
it's the [inaudible].
>> Bill MacCartney: Yeah, that's true. I don't have a good interpretation for that.
I hadn't noticed that before. I guess I was focusing on accuracy and I don't have
a good interpretation for that.
>>: I understand you [inaudible] what is the overlap between the once that
answered yes where the NatLog answered yes an answered correctly versus
[inaudible] so how much overlap [inaudible].
>> Bill MacCartney: Yes. I am afraid I don't have answers at my fingertips.
That's an excellent question. Basically you'd like to see the sort of three
dimensional confusion matrix, right, you want to see correct answers -- the gold
standard versus Stanford versus NatLog for yes and no.
>>: [Inaudible].
[brief talking over].
>> Bill MacCartney: Right. Right. Yeah. I wish I had those answers at my
fingertips. Those are good questions. Yes. So one thing is clear, NatLog is not
a universal solution for natural language inference. There are a lot of kinds of
inference that are simply not addressed by natural logic and we see a lot of those
on RTE, so paraphrase, verb frame alternation, relation extraction, common
sense reasoning, also the model of inference that I described, the inference
method that I described has a weaker proof theory than first order logic. So there
are many, many important kinds of inference that can be explained with first
order logic that NatLog simply can't explain, including Demorgan's laws for
quantifiers just to give one example.
But natural logic enables precise reasoning about semantic containment,
exclusion, and implicative while side stepping the difficulties of full semantic
interpretation and it's therefore able to explain a broad range of such inferences
as demonstrated on the FraCaS test suite. A full solution to natural language
inference is probably ultimately going to require combining disparate reasoners
and natural logic I think is likely to be an important part of such a solution.
So that's the end of the first part of the talk. Now we'll switch gears and move on
to the second part of the talk which concerns a phrase based model of alignment
for natural language inference, and this is joint work with Michelle Galley and
Chris Manning. Well, I've already introduced the NLI task, so I won't do that
again. But I do want to make an observation about the example shown here. In
order to recognize that Kennedy was killed can be inferred from JFK was
assassinated. One needs first to recognize the correspondence between
Kennedy and JFK and between killed and assassinated. Consequently most
current approaches to NLI depends implicitly or explicitly on a facility for
alignment, that is establishes links between corresponding predicates and
entities in the premise and hypothesis.
So different systems do this in different ways. Systems that are based on
measuring lexical overlap implicitly align each word in the hypothesis to a word in
the premise to the word in the premise to which they're most similar. In
approaches which formulate natural language inference as analogous to prove
search, the alignment is implicit in the steps of the proof. But increasingly the
most successful analyzed systems have made the alignment problem explicit
and then use the alignment to drive entailment classification.
So this paper, this is the paper that I'm going to be presenting at EMNLP, and
there's three major contributions we try to make in this paper. The first is to
undertake the first systematic study of alignment for existing NLI. Existing NLI
aligners, including one that we've previously department at Stanford have tended
to use idiosyncractic methods and to be poorly documented, and also to use
proprietary data. And so this work tries to remedy all three of those.
We're going to propose a new model of alignment for NLI called the MANLI
system, which uses a phrased based alignment representation. It exploits
outside resources for information about semantic relatedness and capitalizes on
new source of supervised training data which specifically came from Microsoft
research, which has been a great help to us.
And the third thing that we're going to do is examine the relation between the
problem of alignment in natural language inference and the very similar problem
of alignment in machine translation. And in particular, the question can we just
use an off the shelf MT aligner for NLI alignment? So little more on that last
topic. The alignment problem is very familiar in machine translation and the MT
community has developed not only an extensive literature but also standard
proven tools for alignment. So can an off the shelf MT aligner be usefully applied
to the NLI alignment problem.
Well, there's reason to be doubtful. The alignment problem for NLI differs from
the alignment problem for MT in several key respects. First it's monolingual
which opens the door to utilizes abundant monolingual sources of information on
semantic relatedness. Second, it's intrinsically asymmetric. The premise is often
much longer than the hypothesis and it commonly contains clauses or phrases
which have no counterpart in hypothesis.
In fact, even more strongly one cannot even assume approximate semantic
equivalence in NLI. This is usually a given in MT. Because NLI problems
include both valid and invalid inferences, the semantic content of the premise
and the hypothesis can diverge substantially. So NLI aligners must
accommodate frequent unaligned content. And finally little training data is
available. MT aligners typically use unsupervised training on massive amounts
of by text but no such data is available, certainly not in such quantities, for NLI.
NLI aligners must therefore depend on smaller amounts of supervised data
supplemented by external lexical resources. Conversely MT aligners can use
dictionaries but they typically aren't design to harness other sources of
information about semantic relatedness, particularly not graded, that is scored
information about degrees of semantic relatedness.
In the past research on alignment for NLI has been hampered by a paucity of
high quality publically available training data. Happily that picture has begun to
change, thanks to you guys right here at MSR. Last year MSR released a data
set containing gold standard alignments for the RT 2 development and test sets
containing 800 problems each. The alignment representation is token based but
many to many and thus allows implicit alignment of phrases. And I've shown an
example here. The premise goes down the rows in most specific countries there
are very few women represented in -- there are very few women in parliament,
and the hypothesis, women are poorly represented in parliament.
Two things I want to point out about this example. First, the phrase in the
premise in most specific countries there is completely unaligned, and you
ordinarily wouldn't see this in an MT alignment. You wouldn't see such a big
chunk of the sentence be you be aligned. And second, just the implicit phrase
alignment here, very few is aligned with poorly represented. So the
representation is formally token based, but you get this ability to implicitly
represent phrases.
Each problem was independently annotated by three people and interannotated
agreement was very high, so all three agreed, all three annotators agreed on 70
percent of proposed links and two out of the three agreed on more than 99
percent of proposed links attesting to the high quality of the data.
For this work, we merged the 3 annotation into a single gold standard using
majority rule. Finally I didn't put this on the slide, but following a convention
common in MT, the annotation included both sure links and possible links. In this
work we ignored the possible links and just used the short links.
Okay. Now I'd like to tell you about a new model of alignment for natural
language inference, the MANLI system. I know it's a funny name. If you have to
say it out loud you might feel a little silly. But the system itself is very
straightforward. It has 4 components. It uses a phrase based representation of
alignment. And a linear feature based scoring function. It performance decoding
using a simulated annealing strategy and it uses a version of the average
perceptron for weight training.
Let me tell about you each of these components in turn. First we use a
representation of alignment which is phrase based, so we represent an alignment
by a sequence of phrase edits of four different types. So in each edit connects a
phrase in the premise with an equal by word lemmas phrase in the hypothesis
and a subedit connects the premise phrase with an unequal phrase in the
hypothesis. By the way, by phrase I don't mean -- I'm using phrase in the way it's
used in MT just to mean sequence of tokens not necessarily a syntactic phrase.
A del edit covers an unaligned phrase in the premise and a ins edit, insertion edit
covers an unaligned phrase in the hypothesis. So I've shown for the example
that I already showed you, I've shown the translation into phrase edits and the
only interesting thing here is the substitution of very few with poorly represented.
So this representation is intrinsically phrase based? Yes?
>>: [Inaudible] in what order is the sequence enumerated? Is it enumerated in
sort of hypothesis ->> Bill MacCartney: Yeah. Actually I might as well have said a set of phrase
edits because the ordering doesn't matter in this case. So in the first part of the
topic ordering did make a difference because it affected the ordering of joining
atomic entailment relations but for this model it didn't make a difference.
The representation is constrained to be one to one at the phrase level, but it can
be many to many at the token level, and in fact, this is the chief motivation for the
phrase based representation. We can align very few with poorly represented
without being forced to make an arbitrary choice about which word goes with
which word.
Also, our scoring function can make use of lexical resources which have
information about semantic relatedness of multi word phrases not just individual
words. Finally for the purpose of model training but not for the evaluations that I'll
show you later, we converted the token based MSR data into this phrase based
representation. Yes?
>>: So many token based alignments, how did you handle any disjoint phrases?
>> Bill MacCartney: Yeah, there were a few of those, and essentially we had -we had to throw some away. So if I remember the statistics correctly, something
like -- so I'm talking about this conversion now, this is the conversion just for the
training data. This doesn't apply to the test data for the evaluation. But for the
training data, something like three quarters of the alignments were already one to
one. Something like 92 percent of the MSR alignments were either already one
to one or they were -- or the conversion to this representation was trivial. So this
will be an example of that.
They may have contained blocks, but there were no non-contiguous alignments.
There was remaining eight percent of MR alignments. I think that's the right
figure, something like that, which did contain non-contiguous alignments. And
we basically had to throw one of -- one of the pair away. And we had a simple
heuristic for figuring out which one to throw away, which was based on string
matching. That heuristic work in something like six percent of those eight
percent. So three quarters of that eight percent. And they were left with a where
we had to make an arbitrary choice. So we did have to make some arbitrary
choices, but since it was only for the training data, we didn't worry about it too
much.
Okay. What about scoring alignments? Well, we used a feature based scoring
function. This is a very, very simple linear function where the score from
alignment is just the sum of the scores of the edits it contains. This includes not
just the link edits, that is eek and substitution edits but it also includes the
insertion and deletion edits which correspond to unalign things in the premise
and hypothesis.
So it's the sum of the scores of the edits it contains, and the score for an edit is
just a do the product of a weight vector and a feature vector. So we use several
types of features. First we have a group of features which represents the type of
the edit, then we have features which encode the sizes of the phrases involved in
the edit, and whether those phrases are non constituents in its syntactic parse.
For subedits, very important feature represents the lexical similarity of the
substituents as a real value between zero and one and we compute this as a
max over a number of component functions. Sum based on external lexical
resources. So this includes manually constructed lexical resources such as
WordNet and also automatically constructed resources such as a measure of
distributional similarity in a large corpus, DeConglin [phonetic] style.
An MT aligner is basically inducing something like distributional similar later from
massive amounts of by text and we're getting it from an external lexical resource
instead. We also use various measures of string and lemma similarity. Finally
high lexical similarity doesn't necessarily mean a good match especially if
sentences contain multiple occurrences of the same word, which happens very
commonly with function words and other little words. So to remedy this, we
introduced contextual features. There's that distortion feature, which measures
the difference between the relative positions of words within their respective
sentences and they're also matching neighbors features which indicate whether
the tokens before and after the aligned pair are equal or similar. And this helps
us to get those little words aligned correctly.
Decoding is made more complex by our use of a phrase based representation,
so with a token based representation decoding can be trivial because each token
can you aligned independently of its neighbors. With a phrase based
representation every aligned phrase pair has to be consistent with its neighbors
with respect to the segmentation into phrases. So the problem doesn't factor as
easily. To address this problem we use a stochastic local search based on
stimulated annealing. And here's how it works.
We start with an empty alignment. That little grid represents an empty alignment
and then we generate a set of successors. So to do this, we generate every
possible edit up to a certain maximum size, an arbitrary maximum size and then
we generate a successor by adding that edit to our current alignment and
removing any other edits that may conflict with that edit that involves some of the
same tokens as that edit.
Then we score the successors using our scoring function and we convert the
scores into a probability distribution. Next we smooth or sharpen that probability
distribution by raising it to a power which depends on the temperature parameter.
The temperature starts off high so that we're smoothing the distribution. And this
helps to ensure that we explore the space of possibilities. In latter iterations the
temperature falls and the distributions get sharper and this helps us to converge
on a particular answer.
Then we sample a new alignment which may or may not be the most likely one in
the distribution. Then we lower the temperature and repeat the process. And we
do this 100 times. To find a good alignment. This might seem like it would be
slow, but it's -- but we were clever about memorization and things like that, and
it's actually pretty fast. The average RT problem takes about two seconds to
align using this.
There's no guarantee of optimality, no guarantee that you'll get to the best
alignment but we did find in experiments that the guest alignment that comes out
of this procedure scores at least as high as the gold alignment for greater than 99
percent of alignments. So that means the search is good. The scoring function
may or may not be good, but the search is good at any rate.
To tune the parameters of the model we use an adaptation of the average
perceptron algorithm which has proven successful on a range of NLP tasks. So
we perform 50 training epics and in each epic we iterate through the training data
and for each problem we first find the current best guess at an alignment using
our decoder and the current weight vector. And then we update the weight
vector based on the difference between the features of the gold alignment and
our current guest alignment. So we generate a feature vector for each of those,
look at the difference between them and the update to the weight vector is a
learning rate times that, and the learning rate falls over time.
At the end of each epic, we normalize and store the weight vector, and the final
result is the average of the stored weight vectors except that we omit fixed
proportion of -- we omit vectors from a fixed proportion of epics near the
beginning of the run which tend to be of poor quality anyway.
And range runs on the RTE2 developments required about 20 hours. Okay.
Let's talk about evaluation. Over the next several slides I'll present evaluations of
several alignment systems on the MR, RTE alignment data. Specifically I'll look
at a baseline aligner, which I'll describe in a moment, two different MT aligners,
GIZA++ and Cross-EM, and then two aligners specifically designed for NLI. The
aligner from the Stanford RTE system and the MANLI aligner that I just
described.
To evaluate each aligner's ability to recover the gold center alignments we're
going to look at per link precision, re-call, and F1. In the FT community it's more
conventional to report alignment error rate or AER, but since we're using the -since we're using only the sure links from the annotation, AER is just one minus
F 1. Also since we're using the original token based version of the MSR data for
evaluation, in evaluating MANLI we'll consider two tokens to be aligned just in
case they're contained within phrases which are aligned by MANLI.
Finally we're also going to report the exact match rate. So in what proportion of
guest alignments -- at what proportion of the guest alignments matched the gold
exactly as a whole? Okay. First a baseline system. As a baseline we're going to
use a very simple alignment algorithm which was inspired by the lexical
entailment model of Glickman, et al, and this just involved matching each token
in the hypothesis with the token in the premise to which it's most similar
according to a lexical similarity function.
We use a very simple lexical similarity function which is based on the string edit
distance between two word lemmas. So this dist right here is just Levenstein
string edit distance.
And I show the initial results for this. So I -- as I described earlier, I show
precision re-call F1 and exact match rate for the RTE2 development and test
sets. And despite how incredibly simple this model is, the re-call is surprisingly
good, so it's above 80 percent. But precision is really mediocre, and F1 is not too
great, either. And this is chiefly because by design this model alliance every
hypothesis token with some premise token, and of course in the gold data many
of the hypothesis tokens are left unaligned. This model could surely be improved
by allowing it to leave some of the hypothesis tokens unaligned, but we didn't
pursue this.
Okay. Well, given the importance for alignment for natural language inference
and the availability of proven standard tools for MT alignment, the obvious
question is why can't we just use off the shelf MT aligners for NLI? I argued
earlier that this is unlikely to succeed but to my knowledge we're the first to
investigate the matter empirically. Although Bill Dolan and a couple other people
from here at MR had a paper four years ago where they looked at similar
problem of using MT aligners to identify paraphrasers. So similar but a bit
different.
We did experiments using the best known MT aligner, GIZA++, running it via the
Moses toolkit with default parameters. We generated asymmetric alignments in
both directions and then performed symmetrization using the well-known
intersection heuristic and the initial results were very poor. Subjectively when
you look at the output it looks like it's aligning most words at random. Not even
aligning equal words. So if the same word appears in the hypothesis then the
premises doesn't get aline usually.
This is not too surprising. GIZA++ is designed for cross-lingual use, so it didn't
ordinarily consider word equality between the source and target sentences. So
to remedy this, we supplied GIZA++ with a election con using a trick common in
MT. We supplemented the training data with additional synthetic training data
consisting of matched pairs of equal words. So this gives GIZA++ a better
chance of learning that man should align with man, for example.
This resulted in a big boost in re-call and a smaller gain in precision. I'll show
you results in the next slide. But as an additional comparison, we also ran a
similar set of experiments with the Cross-EM aligner from Berkeley. So here are
the results. This is based on using the election con and using the intersection
heuristic. Both MT aligners do about the same on F1, so in the ballpark of 72 to
75 percent. But GIZA++ attains better precision and Cross-EM attains better
re-call. Both do significantly better than bag of words baseline, especially on
precision, although the bag of words actually does slightly better on re-call.
We also tried using alternate symmetrization heuristics and asymmetric
alignments but everything we tried did much worse than the intersection heuristic
on F1. Qualitatively both MT aligners do a good job of aligning equal words
when you use it with a election con. That's what it's there for. But they continue
to align most other word pairs apparently at random. And this is not too
surprising. The basic problem is that the quantity of data is just far too small for
unsupervised learning of word correspondences.
So a successful NLI aligner will need to exploit supervised training data and will
also need access to additional sources of knowledge about lexical related -lexical relatedness.
A better comparison is thus to an alignment system expressly aligned for NLI.
So for this purpose, we use the alignment component of the Stanford RTE
system. The Stanford system represents alignments as a map from hypothesis
tokens to premise tokens. So phrase alignments are not directly representable,
although the effect can be approximated by a preprocessing step which
collapses multi-token named entities and certain colliquations into certain tokens.
The scoring function exploits variety of sources of information about lexical
relatedness and also includes syntax based features intended to promote the
similarity of similar predicate argument structures. And decoding and learning
are handled in a similar fashion to MANLI.
So here are the results for the Stanford aligner. It outperforms the NT aligners
on F1 but re-call is substantially lower than precision, and that's even after
applying a correction which generously ignores all re-call errors involving
punctuation which is systematically ignored by the Stanford system. Error
analysis reveals that the Stanford aligner does a poor job of aligning function
words. About 13 percent of the aligned pairs in the MR data are matching prep
positions or articles, and the Stanford aligner misses about two-thirds of such
pairs. By contrast MANLI misses only about 10 percent of such pairs. Function
words matter less in inference than nouns and verbs but they're not irrelevant
and because sentences often contain multiple instances of a particular function
word, matching them properly is by no means trivial.
Finally the Stanford aligner is handicapped by its token based alignment
representation and often fails partly or completely to align multi word phrases
such as peace activist with protestors or hackers with non authorized personnel.
Now here are the results for the MANLI aligner. MANLI was found to outperform
all other aligners evaluated on every measure achieving F1, 10 and a half
percent higher than GIZA++ and 6.2 percent higher than Stanford, even after
applying the punctuation correction that I mentioned.
It also achieves a good balance of precision and re-call, and it matched the gold
standard exactly more than 20 percent of the time. There are three factors which
seem to have contributed the most to man's success. First MANLI is able to
outperform the MT aligners principally because it's able to leverage lexical
resources to identify the similarity between pairs of words such as jail and prison
or prevent and stop or injured and wounded.
Second, MANLI's contextual features enable it to do better than Stanford align or
at matching function words. Third, MANLI gains a marginal advantage because
its phrase-based representation enables it to properly align phrase pairs such as
death penalty and capital punishment or abdicate and give up.
However, the phrase based representation contributed far less than we had
hoped. We did an experiment where we set MANLI's maximum phrase size to
one, so effectively restricting it to a token based representation. And we found
that we lost just 0.2 percent in F1. We don't interpret this to mean that phrases
are not useful. Instead we think it shows that we failed to fully exploit the
advantages of the phrase based representation, chiefly because we lack lexical
resources providing good information on the similarity of multi word phrases.
Error analysis suggests that there's ample room for improvement. A large
proportion of re-call errors, may be 40 percent occur because the lexical
similarity function assigns too low a value to pairs of words or phrases which are
clearly similar, such as organization and agencies or bone fragility and
osteoporosis. We just don't have a lexical resource that tells us those are the
same thing. Or related things.
Precision errors may be harder to reduce so these errors are dominated by
cases where we mistakenly align two equal function words, two forms of the verb
to be, two equal punctuation marks or two words or phrases of other types
having equal lemmas. Such errors often occur because the aligner is forced to
choose between nearly equivalent alternatives so these errors may be hard to
eliminate.
Okay. As a -- so those evaluations were sort of the main event. As a coda to
that, I want to look at one other thing briefly. Over the last several slides we've
evaluated the ability of the aligners to evaluate gold standard alignments. But
alignment is just one component of the NLI problem. So we might also look at
the impact of different aligners on the ability to recognize value it inferences.
And the question is, does a high scoring alignment indicate a valid inference.
Well, there's more to inferential validity than just close lexical or structural
correspondence. So things like negations, modals, non-factive and implicative
verbs and other linguistic construct can affect validity in ways hard to capture in
alignment.
Still, alignment score can be a strong predictor of influential validity and many
analyze the systems rely entirely on some measure of alignment to -- of
alignment quality to predict validity. If an aligner generates real valued
entitlement scores -- sorry, real valued alignment scores we can use the RTE
data to test its ability to predict inferential validity using the following simple
method. So for a given RTE problem we predict yes if its alignment score
exceeds a certain threshold and no otherwise. We tune the threshold to
maximize the accuracy on the RTE2 development set. And then we measure the
performance on the RTE2 test set using the same threshold.
So here are results for doing that experiment. For several NLI aligners, the top
three rows. Along with some results for complete RTE systems, including the
LCC system, which was the top performer in the RTE two competition, and an
average of all systems participating in RTE2. So I show average -- I show
accuracy and average precision in predicting answers for the RTE2 development
and test sets. Average precision was supposedly the preferred metric for RTE2,
although in practice everyone seems to pay attention to accuracy not average
precision.
None of the aligners rivals the performance of the LCC system. But all achieve
respectable results, and in particular the Stanford and MANLI aligners outperform
the average RTE2 entry. So even if alignment quality doesn't determine
inferential validity, many NLI systems could be improved by harnessing a well
designed NLI aligner.
Given the extensive literature on phrase based MT, it may be helpful to situate
our phrase based model aligner in relation to past work. Phrase based MT
systems usually apply phrase extraction heuristics to word aligned training data
which hands at odds with the key assumption in phrase based systems that
many translations are non-compositional.
More recently several authors have presented more unified phrase-based
systems that jointly align and weight phrases. But we would argue that this work
is of limited applicability to our problem. In MANLI we use phrases only when
word alignments are not appropriate and longer phrases are not needed to
achieve good alignment quality. But MT phrased alignment benefits from using
longer phrases whenever possible, since this helps to realize more dependencies
among translated words, including things like word order, agreement, and
subcategorization.
Also, MT phrase alignment systems don't model word insertions or deletions as
in MANLI. For example, in the example that I showed before MANLI can just
skip in most specific countries there, whereas an MT phrased base model would
presumably align in most specific countries there are to women are.
Okay. So to wrap up, I think that the main idea is to take away our first that MT
aligners are probably not directly applicable to the NLI problem. MT aligners rely
primarily on unsupervised learning from massive amounts by text which are just
not available for the NLI setting and they rely on an assumption of semantic
equivalence between premise and hypothesis which is usually not the case in the
NLI setting.
I introduced the MANLI system which achieves success by first and foremost by
exploiting both manually and constructed -- manually and automatically
constructed lexical resources and also accommodates frequent unaligned
phrases which arise very often in natural language inference. And the third take
away is that the phrase based representation shows potential I think but we've
sort of failed to prove it. And I think the reason we have failed to prove it is that
we need access to better phrase based lexical resources. That's it. Thank you
very much.
[applause].
>>: I put together phrase and I wonder which of these alignment will align the
best? [Inaudible] by reading book about war is one phrase. [Inaudible] John
killed many enemy soldiers. So main verb ->> Bill MacCartney: Can you read the second one again?
>>: Over [inaudible] over time John killed many enemy soldiers. So the
correlation is [inaudible].
>> Bill MacCartney: Yes, that's a very difficult problem. I mean so is the
question what should the gold standard be, or is the question what will my
system ->>: [Inaudible] the phrase [inaudible] because it's obviously it's aline perfectly
because John killed, John killed.
>> Bill MacCartney: Yes. Yes. So the words are very similar, the structure is
very different.
>>: Absolutely. So each one can catch this.
>> Bill MacCartney: I think the ones that would have the best chance of catching
-- I actually think that the Stanford aligner might have an advantage on a problem
like that, and the reason for that is the Stanford aligner is the only one of the
ones I've presented, it's the only one that takes syntax into account. It explicitly
incorporates syntactic features that essentially look at the paths through the
syntax tree, actually through a type dependency tree, between candidate aligned
pairs. And includes that as a feature in machine learning model for what should
be aligned with what. So the MANLI system doesn't include any syntactic
features explicitly, it's something that I think should be in there, and I wish it were
in there, but didn't have time to put it in there.
The MT aligners I would not expect to get that particular kind of problem right,
and certainly the baseline aligner would not, the baseline aligner is ignoring
structure and just matching based on words.
>>: [Inaudible].
>> Bill MacCartney: That's right.
>>: [Inaudible] and that's when you use the syntax.
>> Bill MacCartney: That's right. Yeah. Yeah. There's -- it's [inaudible].
>>: [Inaudible].
>> Bill MacCartney: Right.
>>: That briefcase.
>> Bill MacCartney: Right. Yeah. There are a lot of RTE systems that
essentially use alignment as a proxy for inferential validity and say if we have a
good alignment then it's a valid inference. That's -- that strategy is actually
surprisingly effective and it's more effective on some of the RTE data sets than
on others. So I've done experiments where I've used a model which is pretty
much like the baseline bag of words model that I described there, and tested how
well it does on the different RTE data sets at predicting the RTE answer,
inferential validity and there's tremendous variation from RTE data set to the
next. So on RTE 1, I think that system got something like 59 percent accuracy,
on RTE2, I think it got 62 or 63 percent accuracy, and then on RTE3, I think it got
67 percent accuracy. Which is very competitive with many of the systems that
people worked really hard on for months and months and months.
But more and more, I think people are recognizing that an RTE system needs to
include more than just a measure of alignment quality. Yes?
>>: So it is interesting that the phrasal matches haven't provided much benefit
yet, and I think your analysis seems spot on. It's just tough to learn those
resources.
>> Bill MacCartney: Yes.
>>: You're probably aware of some work that Chris Kellison Birch [phonetic] has
done in the past.
>> Bill MacCartney: Yes.
>>: [Inaudible] I wonder if you've thought of either [inaudible] phrase tables or
using live transition tables for instance to say here are two phrases to be
translated into the same word. So if I looked up capital punishment and death
penalty and they both translate into the same training phrase using one of these
online translation systems.
I mean, have you thought about exploiting some of those.
>> Bill MacCartney: Yeah, we've thought about it and we're currently working on
it. I think it's a really promising idea. I thought -- I thought that was a great idea,
a great contribution of his and one of my colleagues at Stanford is currently
working on sort of reimplementing that and extracting what from our perspective
will basically be a new lexical resource or phrase based lexical resource which is
derived from pivoting through MT.
And we didn't have time to integrate it into this work, so we haven't yet reaped
any benefit from it. But we have high hopes for the future.
>>: And another easy thing to do, I just tried it again with German, capital
punishment and death penalty both go to [inaudible] in German. So, you know, if
you just want to take all the phrases in your data and shove them at a translation
data to 12 different languages and see which ones land, I mean you have 12
independent feature functions.
>> Bill MacCartney: Yeah. I think it's really promising. We actually looked at
two different or, well, it's not yet finished but we hope -- we have we are currently
working on two different ways of trying to get phrasal equivalences so the MT
pivoting through translation table is one of them and the other thing we have
explored is a dirt style, so dirt paraphrases in trying to get phrasal equivalences
from that. That one is a little bit further along in its implementation. We had an
undergrad student working on it over the summer. I think he did a good job, but
we didn't get very much value from the results. So that one was a bit
disappointing. But I have higher hopes for the MT stuff. Yes?
>>: Going back to the first half of the talk.
>> Bill MacCartney: Yes.
>>: So in order to do a system like NatLog, what are the lexical resources that
are required [inaudible] to do this and so like what is the size of your election con
[inaudible] would you get more if you had more lexical [inaudible]?
>> Bill MacCartney: Let's see. That's -- there's kind of a nest of interrelated
questions there. So we used a bunch of different lexical resources, so we have
-- we kind of have a collection of standard lexical resources that we use in lots of
different context. We use it in the main Stanford RTE system which is different
from this and we also -- we also use it here. So certainly we make heavy use of
various lexical resources that are based on WordNet. Some of them are very
specific like the ones that tell us whether two words are antonyms. Others are
more general measures of semantic relatedness. So we used the Jiang Conrath
measure which is based on path lengths through the hyponym hierarchy in
WordNet.
So we have a bunch of lexical resources that are based on sort of well known
publically available manually constructed lexical resources like that, like non
bank. Then we also have I mentioned earlier lexical resource based on
DeConglin's [phonetic] style distributional similarity. We get a lot of mileage out
of that.
For NatLog we also found it to be very important to use some lexical resources,
specifically constructed for NatLog that probably would not be of general utility.
So this included things like quantifier categories. So in particular in the FraCaS
data there's a lot of problems that involve relations between different categories
and what happens when you replace a universal quantifier with an existential
quantifier and things like that. And of course there's more than one universal
quantifier, so you need to be able to recognize that every belongs to this
universal quantifier category and what relation does that have to other
categories?
So a lot of that was hand crafted specifically for the purpose of NatLog
specifically to be able to handle questions involving quantifiers. So from one
perspective that doesn't scale up very well, but on the other hand, there's not
very many quantifiers so there's not that much scaling to do. Yes?
>>: [Inaudible] curious, you know, verbs like refuse and ->> Bill MacCartney: Yeah.
>>: Because those are a little bit more [inaudible].
>> Bill MacCartney: A little bit, although maybe not as much as you might think.
And happily for us, some other people have already done some of the hard work
there. So another important feature that we had specifically for implicatives and
factives was the implication signature and we leaned on some work that's been
done at park in this area so they've compiled list of different verbs and verb-like
constructs according to their implication signature. So they have an account in
which there are 9 differently implication signatures and have lists of verbs that fall
into each of those. And so we relied on those.
Most of the implication signatures actually only have from maybe half a dozen to
a couple of dozen instances. Then the biggest one has a few hundred. Yes?
>>: How big does the lexicon have to be to get [inaudible] in other words, what -presumably there will be things like adjectives like [inaudible] and so on.
[Inaudible].
>> Bill MacCartney: Yeah.
>>: How large a lexicon?
>> Bill MacCartney: I mean for that, for that particular feature for the implication
signature feature, it's -- the total size of the set of words that it's looking for is not
more than a couple of hundred. And it is mostly verbs. I don't think able is in
there, although it should be, yeah. There's another -- I didn't mention it here, but
I have another smaller hand crafted list of non-intersective adjectives. Which
also break these rules. I need to be able to recognize them. So this is adjectives
like fake or former or alleged. And so I have a little list of the non intersective
adjectives. Probably somebody has already put together a better listen the list
that I have. I just didn't put any effort into going out and finding it. But there are
a number of things like this. And then also, I mean, this was a fair amount of
work. I came up with a list of lexical entailment problems and went through an
hand initiated them. It's actually pretty quick to annotate but that's a lot of
problems so it took a little while. So there was a certain amount of labor that
went into that as well.
>>: [Inaudible].
>> Bill MacCartney: Yes.
>>: [Inaudible].
>> Bill MacCartney: Yes.
>>: [Inaudible]. I'm thinking some lexical relationships, they're disjoint classes
but the ones that are disjoint [inaudible] versus disjoint [inaudible].
>> Bill MacCartney: Oh, yeah. And there were -- there were many difficult cases
when I was doing that annotation. There are many pairs of words where it's
really not clear what the right relationship is or it may depend on context, it may
depend on topical context, the words may be sort of entering sickly bit like
system and approach, for example, what's the relationship between those two? I
mean, typically -- when I -- you know, lots of nouns, I put them into the alternation
relation, so even if -- whether there's related like cat and dog which are pretty
clearly related to each other but disjoint, or two words that have nothing to do
with each other like cloud and cat, those are still disjoint sets, right. But what
about system and approach? Well, maybe in some context, those are two
separate things and in some context maybe they're the same thing and it's really
just hard to say. And there were a lots of examples like that.
>>: [Inaudible].
>> Bill MacCartney: Yes.
>>: Okay.
>> Bill MacCartney: Yes. So it would have been better to get somebody else to
do it, but as a grad student I don't have many resources to call on.
>>: So [inaudible] an example is a tall basketball player.
>> Bill MacCartney: Yeah.
>>: A short basketball player who is still nevertheless taller than the community
at large.
>> Bill MacCartney: Yes. Yes.
>>: [Inaudible] you don't want to [inaudible] necessarily.
>> Bill MacCartney: Right. Yeah.
>>: Do you have any -- the Indian elephant, what's the difference between and
after can elephant and an Indian elephant, an Indian elephant has small ears for
example.
>> Bill MacCartney: But they're still big ears.
>>: But they're still big ears.
>> Bill MacCartney: Yeah. I'm afraid I don't have -- I'm afraid I don't have
anything useful to say about that and the NatLog system would probably get it
wrong. There are -- yeah. That's one of a million hard problems I think in
semantics.
>>: Also, what about the [inaudible] position, the problem of [inaudible].
>> Bill MacCartney: Yeah. So the way I approach -- so I don't know if this is a
general solution or not, but I have a solution which works at least for a few
examples that I looked at. I actually handle some of those non intersective
adjectives as being similar to implicatives in that you can specify the entailment
relation generated by deleting that adjective. So for example let's think about
alleged. An alleged criminal and a criminal. The entailment relation between
alleged criminal and criminal is independence because an alleged criminal might
or might not actually be a criminal and a criminal might or might not be alleged to
be a criminal. Right? Whereas a former has different behavior. So a former
student and a student, those stand in the alternation relation. If you're a former
student, you are not a student anymore. Presumably. Right? So you can
categorize these non interest -- at least for a -- at least for those examples you
can categorize non-intersective adjectives according to what all -- what
entailment relation is generated by their deletion and then that can be
propagated up the composition tree and changed along the lines that I described.
And there are a couple of examples along those lines in the FraCaS test suite
which I believe are all correctly handled by NatLog. All right. Well, thank you
very much.
[applause]
Download