>> Chris Brockett: Good afternoon. I'm very pleased to be able to Bill MacCartney here today. Bill is a Ph.D. student finishing up his dissertation at Stanford, working with Chris Manning. He's been particularly involved in inference and the recognizing textural entailment system that they've been developing at Stanford. And today he is going to present some work that won the best paper award at Coling this year and follow on with a discussion of the issue of alignment in recognizing entailment and engaging with inference. With that, I'll pass it off. Thank you. >> Bill MacCartney: Thank you very much. And thank you all for giving me the opportunity to talk to you today. It's an honor to be here and I hope you'll find it interesting. I'm going to be talking about two aspects of the problem of natural language inference today, and this is really the two part talk so each part of the talk concerns a different aspect of the problem of natural language inference, which I'll define in a moment. The first part of the talk will be based on a paper that I presented at Coling '08, it's called modelling semantic containment, an exclusion in natural language inference, and it describes a computational model of natural logic for NLI. It's tell you what that is in a moment. This is not a general solution for the problem of NLI, but it does handle an interesting subset of NLI problems. But it depends on alignments that come from another source, and that's the motivation for the paper that the second part of the talk is based on. This is a paper I'm going to be presenting at EMNLT in a couple of weeks, a phrase based model for alignment for NLI, directly addresses the problem of alignment for NLI and relates it to that problem of alignment for MT. And this part of the work was directly enabled by annotated data produced here at MSR. Okay. So the first part of the talk is about modelling semantic containment and exclusion in natural language inference, and this is joint work with my advisor, Chris Manning. Natural language inference, which is also known as recognizing textural entailment is the problem of determining whether a premise P justifies an inference to a hypothesis H, and this is an informal, intuitive notion of inference. The emphasis is on short local inference steps and variability of linguistic expression rather than long chains of formal reasoning. So here's an example. The premise is every firm polled saw costs grow more than expected even after adjusting for inflation and hypothesis is every big company in the poll reported cost increases. And this is a valid inference. I want to make two observations about this example. The first, if the quantifier were some instead of every, the inference would not be valid because it could be that only small firms saw cost grow. And second it will be difficult or impossible to translate these sentences fully and accurately into formal logic and the importance of these facts will become clear in a moment. Natural language inference is necessary to the ultimate goal of full natural language understanding and it can also enable more immediate applications such as semantic search, question answering and others. Work on natural language inference has explored a broad spectrum of approaches, so at one end of the spectrum are approaches based on lexical or semantic overlap, pattern based relation extraction or approximate matching of predicate argument structure. These approaches are robust and broadly effective but imprecise and they're easily confounded by inferences involving negation quantifiers and other phenomena, including the example on the previous slide. At the other end of the spectrum, we have approaches based on first order logic and theorem proving. Such approaches have the power and precision that we're looking for, but they tend to founder on the many involved difficulties involved in accurately translating natural language to first order logic. In this work, we explore a different point on the spectrum by developing a computational model of natural logic which I'll define in a moment. So here's the outline for the talk. First I'll talk about the theoretical foundations of natural logic, then I'll introduce our computational model of natural logic, the NatLog system, then I'll describe experiments with two different data sets, the FraCaS test suite and the RTE data and then I'll conclude the first part of the talk. So what is natural logic? The term was introduced by Lacoff who defined natural logic as a logic whose vehicle of inference is natural language. That is it characterizes valid patterns of reasoning in terms of surface forms, and it thus enables us to do precise reasoning while side stepping the myriad difficulties of full semantic interpretation. Natural logic has a very long history stretching back to the syllogisms of Aristotle, and it was revived in the 1980s as the monotonistic calculus of Van Bentum and Sanchez Valencia. Also the account of implicatives and factives developed by Nairn, et al, at park, arguably belongs to the natural logic traditional, though it wasn't presented at such. In this work we present a new theory of natural logic which extends the monotonistic calculus to account for negation and exclusion and also incorporates elements of Nairn's model of implicatives. Over the next few slides I'll sketch this model, but at a very high level. For more details you can either see the Coling paper or I'm actually almost finished with a new paper which describes the theoretical foundations of this in much greater detail and if you're interested, I can send you a draft of that. So first we propose an inventory of seven mutually exclusive basic entailment relations and this slide is kind of important because these relations and the symbols that I've chosen to represent them will reappear throughout the rest of the talk. The relations are defined by analogy with set relations, and they include representations of both semantic containment and semantic exclusion. So the seven relations are first, equivalence, forward entailment and reverse entailment. These are pretty self explanatory and these are the containment relations. And then negation or exhaustive exclusion, alternation or non exhaustive exclusion and cover or exhaustive non exclusion. And finally independence, which covers all other cases. These relations are defined for expressions of every semantic type, so not only sentences but also common nouns, adjectives, transitive and intransitive verbs, temporal and locative modifiers, quantifiers, and so on. And there's some illustrations in those semantic types here. Okay. I know that was very quick. But the next question is how are entailment relations affected by semantic composition? So in other words, how do the entailments of a compound expression depend on the entailments of its parts? In the most common case semantic composition reserves entailment relations so for example eat pork entails eat meat and big bird excludes big fish. But many semantic functions behave differently. For example refuse projects forward entailment as reverse entailment, so that refuse to tanning go is entailed by refuse to dance. And not projects exclusion as a cover relation so that not French and not German stand in the cover relation to each other. In our model we categorize semantic functions according to how they project each of the seven basic entailment relations. This is a generalization of both the three monotonistic classes of the monotonistic calculus and the nine implication signatures of Nairn, et al. For example, not and refuse are alike and projecting equivalents as equivalents and independents as independents, and they both swap forward and reverse entailment. But whereas not projects exclusion as -- not projects exclusion as cover, refuse projects exclusion as independents. So for example, refuse to tango and refuse to waltz are independent of each other. A typology of projective allows us to determine the entailments of a compound expression compositionally by projecting lexical entailment relations upward through a semantic composition tree. So consider this example. If nobody can enter without a shirt, then it follows that nobody can enter without clothes. To explain this compositionally, assume that we have idealized semantic composition trees and these are plausible renderings of semantic composition trees here. Representing the compositional structure of the semantics of these sentences. We begin from a lexical entailment relation between shirt and clothes. So 30 forward entails clothes, but without is a downward monotone operator, so without isn't hit -- sorry, without a shirt is entailed by without clothes. This is then applied to enter and then becomes the argument can which is upward monotone, so it preserves the direction of things, but then it becomes an argument to nobody, nobody is downward monotone, so we get another inversion and we find that nobody can enter without a shirt. Forward entails nobody can enter without clothes, which is what we expect. Now, we come to the third element of the theory which builds on the preceding to prove a hypothesis from a premise. Suppose we can find a sequence of edits which transforms the premise into the hypothesis. These can be insertions, deletions, substitutions or more complex edit operations. We begin by determining a lexical entailment operation for each atomic edit. For substitutions this depends on the relation between the meanings of the substituents. Deletions ordinarily generate the forward entailment relation, but some lexical items have special behavior. So for example deleting not generates the negation relation. And insertions are symmetric to deletions. Next we project the lexical entailment relation upward through a semantic composition tree as in the previous slide to determine the entailment relation across each atomic edit. And finally we join these acomic entailment relations across the sequence of edits as in Tarskin relation algebra to obtain our final answer. Okay. This has been description of the theory. It's at a very high level. May have been kind of hard to follow. I'm going to show you a worked out example in a moment that will make things for concrete and hopefully we'll get it -- give you a better sense of what actually happens. But let's switch gears and talk about what we built. The NatLog similar is a computational model of natural logic and it consists of five stages, and in the following slides I'll talk about each of these five stages in turn. But first to illustrate the operation of the system, I'm going to use a running example shown here. The example is quite contrived, but it compactly exhibits the three phenomena that I'm interested in, containment, exclusion, and implicative. So the premise is Jimmy Dean refused to move without blue jeans, and the hypothesis is James Dean didn't dance without pants, and this is a valid inference. Okay. So the steps of the model in the first stage we do linguistic preprocessing, so we begin by tokenizing and imparsing the input sentences using the Stanford parser which is a broad coverage statistical parser trained on the pen tree bank. The most important task at this stage is to identify any semantic functions with non default projective and to compute their scope in order to determine the effective projective at each token. What makes this tricky is that the phrase structure trees produced by the parser may not correspond exactly to the semantic structure of the sentence. If we had idealized semantic composition trees, then determining the effective projective would be easy. Since we don't, we use a somewhat awkward work around. We define categories of items with special projectivity and for each category we specify its default scope in phrase structure trees using a tree pattern language called Tregex which is similar to Tgrep and which was partly the work of Gaylin Andrew [phonetic], who is in the room here. This enables us to identify the constituents over which the projective properties should be applied and there by to compute the final effective projective at each token. In the second stage we establish an alignment between the premise and the hypothesis, and we represent this by sequence of atomic edits over spans of word tokens. So I've shown an alignment for our running example here. As you can see, we use four types of edit, deletion, insertion, substitution and match. The edits are ordered and this ordering defines a path from the premise to the hypothesis through intermediate forms. The ordering, however, doesn't have to correspond to the sentence order, although it does in this example. Thus the alignment effectively decomposes the inference problem into a sequence of atomic inference problems, one for each acomic edit that is between each intermediate form that it's transformed through. An alignment will be the subject of the second part of the talk today. Okay. The next stage is the heart of the system. This is lexical entailment classification. And here we try to predict an entailment relation for each atomic edit based solely on the features of the lexical items involved, independent of the surrounding context such as falling under a downward monotone operator. We do this by exploring available resources on lexical semantics and applying machine learning. So our feature representation includes semantic relatedness information based on WordNet, non bank and other lexical resources, string and lemma similarity scores and information about lexical categories, including special purpose categories for quantifiers and implicatives. And we use a decision tree classifier which is trained on about 2500 hand annotated lexical entailment problems like the examples shown down here. So back to the running example. I've added two new rows which show the features generated for each edit and the lexical entailment relation predicted from those features. So the first edit is a substitution and string similarity is high so we predict equivalence. In the second edit we delete an implicative refuse and the model knows that the leading implicatives in this category generates the alternation relation. The third edit inserts an auxiliary verb. Since auxiliaries are more or less semantically vacuous, we predict equivalence. The fourth edit inserts a negation and this generates the negation relation. The fifth edit is the substitution, and WordNet tells us that these words are hyponyms, so we predict reverse entailment. The sixth edit is a match equivalence. The seventh edit is the deletion of a generic modifier, blue. By default this generates forward entailments. And finally the eighth edit is a hyponym substitution so forward entailment. The fourth stage is entailment projection. So I covered this earlier. It means projecting lexical entailment relations upward by taking account of the projective properties of the surrounding context. I'm going to simplify things here a bit by only considering upward and downward monotonicity. So I've added two new rows to the table, and the first row shows the effective monotonicity at the locus of each edit. So everything is upward monotone until we insert negation, after which the next two edits occur in a downward monotone context. But then without is again downward monotone, so we get another inversion. And the last two edits occur in upward monotone context. I want to remind you it's not necessarily the case that the edits happen in the linear order of the sentence. Happens to be the case in this example. The last row shows how the lexical entailment relations are projected into atomic entailment relations, that is the entailment relations across each atomic edit. So the only interesting case is right here where reverse entailment relation is inverted to a forward entailment because of a downward monotone context. Okay. The final stage is entailment joining in which we combine atomic entailment relations one by one to obtain our final answer. So I've added a new row. And we start at the left with equivalence and then equivalence joined with alternation yields alternation. Alternation joined with equivalence yields to alternation again. And then alternation jointed with negation yields forward entailment. So that one may not be quite as obvious, but it makes sense if you think about it for a bit. For example, fish alternates with human, and human is the negation of non human. So fish forward entails non human. After that we're just joining forward entailment either with itself or with equivalents. So forward entailment is preserved the whole way through. And that's our final answer. And that's the right answer for this problem. Okay. In order to evaluate our system, we use the FraCaS test which came out of a mid '90s project on computational semantics. It contains 346 problems which look like they could have come out of a textbook on formal semantics, and FraCaS involves three way classification, so it distinguishes contradiction from mere non entailment. In this work, we consider only problems which contain a single premise, and I've shown three examples here. So the first example inserts a restrictive modifier, a lot of in a downward monotone context. The second example involves predicates which stand in the alternation relation, large and small. And the third involves a non-factive verb with a clausal complement. So here are the results for a baseline classifier for our system last year and for our current system. The columns indicate the number of problems, precision and re-call for the yes class, and accuracy. Overall we've made good progress since last year achieving a 27 percent reduction in error and reaching almost 90 percent in precision. What's more interesting is the breakdown by section. So the FraCaS problems are divided into nine sections, each focused on a different describe of semantic phenomena. In the section on quantifiers which is both the largest and the most amenable to natural logic, we answer all problem but one correctly. In fact, performance is good on all the sections where we expect NatLog to have relevant expertise. Our average accuracy on the five sections most amenable to natural logic is 87 percent. Not surprisingly, though, we make little headway with things like [inaudible] and ellipses, but even here precision is high. So the system rarely predicts entailment when none exists. Of course, this doesn't constitute a proper evaluation on unseen test data. But on the other hand, the system was never trained on the FraCaS data, it was only trained on lexical entailment problems and it's had no opportunity to learn [inaudible] implicit in the data. And our main goal in testing on FraCaS is really to evaluate the representational and inferential adequacy of our model of natural logic. And from this perspective, the results are quite encouraging. Since the FraCaS test is not well known, we also wanted to do an evaluation using the familiar RTE data, which many of you have probably seen. Relative to FraCaS the RTE problems are more natural seeming, with much longer premises which average 35 words instead of 11. But the RTE problems are not an ideal match to the strengths of the NatLog system. First, RTE includes many kinds of inference which are not addressed by natural logic such as paraphrase, temporal reasoning and relation extraction. Second, in most RTE problems the edit distance between the premises and hypothesis is quite large. More atomic edit means a greater chance that prediction errors made by the atomic entailment model will propagate via entailment joining to the system's final output. So here are a couple of example problems. The first example is not a good match to the strengths of the NatLog system. It's essentially a relation sxrak problem and the NatLog system is thrown off by the insertion of the words acts as in the hypothesis. The second example is a much better fit for NatLog. It hinges on recognizing that deleting a negation yields a contradiction and NatLog gets this problem right. So here are the results on the RTE3 development and test sets for the Stanford RTE system which is a broad coverage RTE system and nor NatLog. For each system I show the percentage of problems answered yes along with precision and re-call for the yes class and accuracy. Not surprisingly, the overall accuracy of the NatLog system is unimpressive. But NatLog achieves relatively high precision, over 70 percent, on its yes predictions. And this suggests a strategy of hyper dieing the high precision, low re-call NatLog system with the broad coverage Stanford system. Boss and Marker pursued similar strategy in their 2006 paper based on first order logic and theorem proving, however, that system was able to make a positive prediction in only four percent of cases. NatLog makes positive predictions far more often in about 25 percent of cases. And the results are quite satisfying. As we hoped hybridization yields substantial gains, so on the RTE3 test said the hybrid system attained an accuracy four percent better than the standard system alone, corresponding to an extra 32 questions out of 800 answered correctly. So in summary, I want to emphasize that we are not proposing natural logic as -sorry. Go ahead. >>: [Inaudible]. >> Bill MacCartney: Yes? >>: [Inaudible] so you said that the -- so the NatLog system was -- I thought the strength was procedure, all right, so the hybrid system [inaudible] standard RTE it's the [inaudible]. >> Bill MacCartney: Yeah, that's true. I don't have a good interpretation for that. I hadn't noticed that before. I guess I was focusing on accuracy and I don't have a good interpretation for that. >>: I understand you [inaudible] what is the overlap between the once that answered yes where the NatLog answered yes an answered correctly versus [inaudible] so how much overlap [inaudible]. >> Bill MacCartney: Yes. I am afraid I don't have answers at my fingertips. That's an excellent question. Basically you'd like to see the sort of three dimensional confusion matrix, right, you want to see correct answers -- the gold standard versus Stanford versus NatLog for yes and no. >>: [Inaudible]. [brief talking over]. >> Bill MacCartney: Right. Right. Yeah. I wish I had those answers at my fingertips. Those are good questions. Yes. So one thing is clear, NatLog is not a universal solution for natural language inference. There are a lot of kinds of inference that are simply not addressed by natural logic and we see a lot of those on RTE, so paraphrase, verb frame alternation, relation extraction, common sense reasoning, also the model of inference that I described, the inference method that I described has a weaker proof theory than first order logic. So there are many, many important kinds of inference that can be explained with first order logic that NatLog simply can't explain, including Demorgan's laws for quantifiers just to give one example. But natural logic enables precise reasoning about semantic containment, exclusion, and implicative while side stepping the difficulties of full semantic interpretation and it's therefore able to explain a broad range of such inferences as demonstrated on the FraCaS test suite. A full solution to natural language inference is probably ultimately going to require combining disparate reasoners and natural logic I think is likely to be an important part of such a solution. So that's the end of the first part of the talk. Now we'll switch gears and move on to the second part of the talk which concerns a phrase based model of alignment for natural language inference, and this is joint work with Michelle Galley and Chris Manning. Well, I've already introduced the NLI task, so I won't do that again. But I do want to make an observation about the example shown here. In order to recognize that Kennedy was killed can be inferred from JFK was assassinated. One needs first to recognize the correspondence between Kennedy and JFK and between killed and assassinated. Consequently most current approaches to NLI depends implicitly or explicitly on a facility for alignment, that is establishes links between corresponding predicates and entities in the premise and hypothesis. So different systems do this in different ways. Systems that are based on measuring lexical overlap implicitly align each word in the hypothesis to a word in the premise to the word in the premise to which they're most similar. In approaches which formulate natural language inference as analogous to prove search, the alignment is implicit in the steps of the proof. But increasingly the most successful analyzed systems have made the alignment problem explicit and then use the alignment to drive entailment classification. So this paper, this is the paper that I'm going to be presenting at EMNLP, and there's three major contributions we try to make in this paper. The first is to undertake the first systematic study of alignment for existing NLI. Existing NLI aligners, including one that we've previously department at Stanford have tended to use idiosyncractic methods and to be poorly documented, and also to use proprietary data. And so this work tries to remedy all three of those. We're going to propose a new model of alignment for NLI called the MANLI system, which uses a phrased based alignment representation. It exploits outside resources for information about semantic relatedness and capitalizes on new source of supervised training data which specifically came from Microsoft research, which has been a great help to us. And the third thing that we're going to do is examine the relation between the problem of alignment in natural language inference and the very similar problem of alignment in machine translation. And in particular, the question can we just use an off the shelf MT aligner for NLI alignment? So little more on that last topic. The alignment problem is very familiar in machine translation and the MT community has developed not only an extensive literature but also standard proven tools for alignment. So can an off the shelf MT aligner be usefully applied to the NLI alignment problem. Well, there's reason to be doubtful. The alignment problem for NLI differs from the alignment problem for MT in several key respects. First it's monolingual which opens the door to utilizes abundant monolingual sources of information on semantic relatedness. Second, it's intrinsically asymmetric. The premise is often much longer than the hypothesis and it commonly contains clauses or phrases which have no counterpart in hypothesis. In fact, even more strongly one cannot even assume approximate semantic equivalence in NLI. This is usually a given in MT. Because NLI problems include both valid and invalid inferences, the semantic content of the premise and the hypothesis can diverge substantially. So NLI aligners must accommodate frequent unaligned content. And finally little training data is available. MT aligners typically use unsupervised training on massive amounts of by text but no such data is available, certainly not in such quantities, for NLI. NLI aligners must therefore depend on smaller amounts of supervised data supplemented by external lexical resources. Conversely MT aligners can use dictionaries but they typically aren't design to harness other sources of information about semantic relatedness, particularly not graded, that is scored information about degrees of semantic relatedness. In the past research on alignment for NLI has been hampered by a paucity of high quality publically available training data. Happily that picture has begun to change, thanks to you guys right here at MSR. Last year MSR released a data set containing gold standard alignments for the RT 2 development and test sets containing 800 problems each. The alignment representation is token based but many to many and thus allows implicit alignment of phrases. And I've shown an example here. The premise goes down the rows in most specific countries there are very few women represented in -- there are very few women in parliament, and the hypothesis, women are poorly represented in parliament. Two things I want to point out about this example. First, the phrase in the premise in most specific countries there is completely unaligned, and you ordinarily wouldn't see this in an MT alignment. You wouldn't see such a big chunk of the sentence be you be aligned. And second, just the implicit phrase alignment here, very few is aligned with poorly represented. So the representation is formally token based, but you get this ability to implicitly represent phrases. Each problem was independently annotated by three people and interannotated agreement was very high, so all three agreed, all three annotators agreed on 70 percent of proposed links and two out of the three agreed on more than 99 percent of proposed links attesting to the high quality of the data. For this work, we merged the 3 annotation into a single gold standard using majority rule. Finally I didn't put this on the slide, but following a convention common in MT, the annotation included both sure links and possible links. In this work we ignored the possible links and just used the short links. Okay. Now I'd like to tell you about a new model of alignment for natural language inference, the MANLI system. I know it's a funny name. If you have to say it out loud you might feel a little silly. But the system itself is very straightforward. It has 4 components. It uses a phrase based representation of alignment. And a linear feature based scoring function. It performance decoding using a simulated annealing strategy and it uses a version of the average perceptron for weight training. Let me tell about you each of these components in turn. First we use a representation of alignment which is phrase based, so we represent an alignment by a sequence of phrase edits of four different types. So in each edit connects a phrase in the premise with an equal by word lemmas phrase in the hypothesis and a subedit connects the premise phrase with an unequal phrase in the hypothesis. By the way, by phrase I don't mean -- I'm using phrase in the way it's used in MT just to mean sequence of tokens not necessarily a syntactic phrase. A del edit covers an unaligned phrase in the premise and a ins edit, insertion edit covers an unaligned phrase in the hypothesis. So I've shown for the example that I already showed you, I've shown the translation into phrase edits and the only interesting thing here is the substitution of very few with poorly represented. So this representation is intrinsically phrase based? Yes? >>: [Inaudible] in what order is the sequence enumerated? Is it enumerated in sort of hypothesis ->> Bill MacCartney: Yeah. Actually I might as well have said a set of phrase edits because the ordering doesn't matter in this case. So in the first part of the topic ordering did make a difference because it affected the ordering of joining atomic entailment relations but for this model it didn't make a difference. The representation is constrained to be one to one at the phrase level, but it can be many to many at the token level, and in fact, this is the chief motivation for the phrase based representation. We can align very few with poorly represented without being forced to make an arbitrary choice about which word goes with which word. Also, our scoring function can make use of lexical resources which have information about semantic relatedness of multi word phrases not just individual words. Finally for the purpose of model training but not for the evaluations that I'll show you later, we converted the token based MSR data into this phrase based representation. Yes? >>: So many token based alignments, how did you handle any disjoint phrases? >> Bill MacCartney: Yeah, there were a few of those, and essentially we had -we had to throw some away. So if I remember the statistics correctly, something like -- so I'm talking about this conversion now, this is the conversion just for the training data. This doesn't apply to the test data for the evaluation. But for the training data, something like three quarters of the alignments were already one to one. Something like 92 percent of the MSR alignments were either already one to one or they were -- or the conversion to this representation was trivial. So this will be an example of that. They may have contained blocks, but there were no non-contiguous alignments. There was remaining eight percent of MR alignments. I think that's the right figure, something like that, which did contain non-contiguous alignments. And we basically had to throw one of -- one of the pair away. And we had a simple heuristic for figuring out which one to throw away, which was based on string matching. That heuristic work in something like six percent of those eight percent. So three quarters of that eight percent. And they were left with a where we had to make an arbitrary choice. So we did have to make some arbitrary choices, but since it was only for the training data, we didn't worry about it too much. Okay. What about scoring alignments? Well, we used a feature based scoring function. This is a very, very simple linear function where the score from alignment is just the sum of the scores of the edits it contains. This includes not just the link edits, that is eek and substitution edits but it also includes the insertion and deletion edits which correspond to unalign things in the premise and hypothesis. So it's the sum of the scores of the edits it contains, and the score for an edit is just a do the product of a weight vector and a feature vector. So we use several types of features. First we have a group of features which represents the type of the edit, then we have features which encode the sizes of the phrases involved in the edit, and whether those phrases are non constituents in its syntactic parse. For subedits, very important feature represents the lexical similarity of the substituents as a real value between zero and one and we compute this as a max over a number of component functions. Sum based on external lexical resources. So this includes manually constructed lexical resources such as WordNet and also automatically constructed resources such as a measure of distributional similarity in a large corpus, DeConglin [phonetic] style. An MT aligner is basically inducing something like distributional similar later from massive amounts of by text and we're getting it from an external lexical resource instead. We also use various measures of string and lemma similarity. Finally high lexical similarity doesn't necessarily mean a good match especially if sentences contain multiple occurrences of the same word, which happens very commonly with function words and other little words. So to remedy this, we introduced contextual features. There's that distortion feature, which measures the difference between the relative positions of words within their respective sentences and they're also matching neighbors features which indicate whether the tokens before and after the aligned pair are equal or similar. And this helps us to get those little words aligned correctly. Decoding is made more complex by our use of a phrase based representation, so with a token based representation decoding can be trivial because each token can you aligned independently of its neighbors. With a phrase based representation every aligned phrase pair has to be consistent with its neighbors with respect to the segmentation into phrases. So the problem doesn't factor as easily. To address this problem we use a stochastic local search based on stimulated annealing. And here's how it works. We start with an empty alignment. That little grid represents an empty alignment and then we generate a set of successors. So to do this, we generate every possible edit up to a certain maximum size, an arbitrary maximum size and then we generate a successor by adding that edit to our current alignment and removing any other edits that may conflict with that edit that involves some of the same tokens as that edit. Then we score the successors using our scoring function and we convert the scores into a probability distribution. Next we smooth or sharpen that probability distribution by raising it to a power which depends on the temperature parameter. The temperature starts off high so that we're smoothing the distribution. And this helps to ensure that we explore the space of possibilities. In latter iterations the temperature falls and the distributions get sharper and this helps us to converge on a particular answer. Then we sample a new alignment which may or may not be the most likely one in the distribution. Then we lower the temperature and repeat the process. And we do this 100 times. To find a good alignment. This might seem like it would be slow, but it's -- but we were clever about memorization and things like that, and it's actually pretty fast. The average RT problem takes about two seconds to align using this. There's no guarantee of optimality, no guarantee that you'll get to the best alignment but we did find in experiments that the guest alignment that comes out of this procedure scores at least as high as the gold alignment for greater than 99 percent of alignments. So that means the search is good. The scoring function may or may not be good, but the search is good at any rate. To tune the parameters of the model we use an adaptation of the average perceptron algorithm which has proven successful on a range of NLP tasks. So we perform 50 training epics and in each epic we iterate through the training data and for each problem we first find the current best guess at an alignment using our decoder and the current weight vector. And then we update the weight vector based on the difference between the features of the gold alignment and our current guest alignment. So we generate a feature vector for each of those, look at the difference between them and the update to the weight vector is a learning rate times that, and the learning rate falls over time. At the end of each epic, we normalize and store the weight vector, and the final result is the average of the stored weight vectors except that we omit fixed proportion of -- we omit vectors from a fixed proportion of epics near the beginning of the run which tend to be of poor quality anyway. And range runs on the RTE2 developments required about 20 hours. Okay. Let's talk about evaluation. Over the next several slides I'll present evaluations of several alignment systems on the MR, RTE alignment data. Specifically I'll look at a baseline aligner, which I'll describe in a moment, two different MT aligners, GIZA++ and Cross-EM, and then two aligners specifically designed for NLI. The aligner from the Stanford RTE system and the MANLI aligner that I just described. To evaluate each aligner's ability to recover the gold center alignments we're going to look at per link precision, re-call, and F1. In the FT community it's more conventional to report alignment error rate or AER, but since we're using the -since we're using only the sure links from the annotation, AER is just one minus F 1. Also since we're using the original token based version of the MSR data for evaluation, in evaluating MANLI we'll consider two tokens to be aligned just in case they're contained within phrases which are aligned by MANLI. Finally we're also going to report the exact match rate. So in what proportion of guest alignments -- at what proportion of the guest alignments matched the gold exactly as a whole? Okay. First a baseline system. As a baseline we're going to use a very simple alignment algorithm which was inspired by the lexical entailment model of Glickman, et al, and this just involved matching each token in the hypothesis with the token in the premise to which it's most similar according to a lexical similarity function. We use a very simple lexical similarity function which is based on the string edit distance between two word lemmas. So this dist right here is just Levenstein string edit distance. And I show the initial results for this. So I -- as I described earlier, I show precision re-call F1 and exact match rate for the RTE2 development and test sets. And despite how incredibly simple this model is, the re-call is surprisingly good, so it's above 80 percent. But precision is really mediocre, and F1 is not too great, either. And this is chiefly because by design this model alliance every hypothesis token with some premise token, and of course in the gold data many of the hypothesis tokens are left unaligned. This model could surely be improved by allowing it to leave some of the hypothesis tokens unaligned, but we didn't pursue this. Okay. Well, given the importance for alignment for natural language inference and the availability of proven standard tools for MT alignment, the obvious question is why can't we just use off the shelf MT aligners for NLI? I argued earlier that this is unlikely to succeed but to my knowledge we're the first to investigate the matter empirically. Although Bill Dolan and a couple other people from here at MR had a paper four years ago where they looked at similar problem of using MT aligners to identify paraphrasers. So similar but a bit different. We did experiments using the best known MT aligner, GIZA++, running it via the Moses toolkit with default parameters. We generated asymmetric alignments in both directions and then performed symmetrization using the well-known intersection heuristic and the initial results were very poor. Subjectively when you look at the output it looks like it's aligning most words at random. Not even aligning equal words. So if the same word appears in the hypothesis then the premises doesn't get aline usually. This is not too surprising. GIZA++ is designed for cross-lingual use, so it didn't ordinarily consider word equality between the source and target sentences. So to remedy this, we supplied GIZA++ with a election con using a trick common in MT. We supplemented the training data with additional synthetic training data consisting of matched pairs of equal words. So this gives GIZA++ a better chance of learning that man should align with man, for example. This resulted in a big boost in re-call and a smaller gain in precision. I'll show you results in the next slide. But as an additional comparison, we also ran a similar set of experiments with the Cross-EM aligner from Berkeley. So here are the results. This is based on using the election con and using the intersection heuristic. Both MT aligners do about the same on F1, so in the ballpark of 72 to 75 percent. But GIZA++ attains better precision and Cross-EM attains better re-call. Both do significantly better than bag of words baseline, especially on precision, although the bag of words actually does slightly better on re-call. We also tried using alternate symmetrization heuristics and asymmetric alignments but everything we tried did much worse than the intersection heuristic on F1. Qualitatively both MT aligners do a good job of aligning equal words when you use it with a election con. That's what it's there for. But they continue to align most other word pairs apparently at random. And this is not too surprising. The basic problem is that the quantity of data is just far too small for unsupervised learning of word correspondences. So a successful NLI aligner will need to exploit supervised training data and will also need access to additional sources of knowledge about lexical related -lexical relatedness. A better comparison is thus to an alignment system expressly aligned for NLI. So for this purpose, we use the alignment component of the Stanford RTE system. The Stanford system represents alignments as a map from hypothesis tokens to premise tokens. So phrase alignments are not directly representable, although the effect can be approximated by a preprocessing step which collapses multi-token named entities and certain colliquations into certain tokens. The scoring function exploits variety of sources of information about lexical relatedness and also includes syntax based features intended to promote the similarity of similar predicate argument structures. And decoding and learning are handled in a similar fashion to MANLI. So here are the results for the Stanford aligner. It outperforms the NT aligners on F1 but re-call is substantially lower than precision, and that's even after applying a correction which generously ignores all re-call errors involving punctuation which is systematically ignored by the Stanford system. Error analysis reveals that the Stanford aligner does a poor job of aligning function words. About 13 percent of the aligned pairs in the MR data are matching prep positions or articles, and the Stanford aligner misses about two-thirds of such pairs. By contrast MANLI misses only about 10 percent of such pairs. Function words matter less in inference than nouns and verbs but they're not irrelevant and because sentences often contain multiple instances of a particular function word, matching them properly is by no means trivial. Finally the Stanford aligner is handicapped by its token based alignment representation and often fails partly or completely to align multi word phrases such as peace activist with protestors or hackers with non authorized personnel. Now here are the results for the MANLI aligner. MANLI was found to outperform all other aligners evaluated on every measure achieving F1, 10 and a half percent higher than GIZA++ and 6.2 percent higher than Stanford, even after applying the punctuation correction that I mentioned. It also achieves a good balance of precision and re-call, and it matched the gold standard exactly more than 20 percent of the time. There are three factors which seem to have contributed the most to man's success. First MANLI is able to outperform the MT aligners principally because it's able to leverage lexical resources to identify the similarity between pairs of words such as jail and prison or prevent and stop or injured and wounded. Second, MANLI's contextual features enable it to do better than Stanford align or at matching function words. Third, MANLI gains a marginal advantage because its phrase-based representation enables it to properly align phrase pairs such as death penalty and capital punishment or abdicate and give up. However, the phrase based representation contributed far less than we had hoped. We did an experiment where we set MANLI's maximum phrase size to one, so effectively restricting it to a token based representation. And we found that we lost just 0.2 percent in F1. We don't interpret this to mean that phrases are not useful. Instead we think it shows that we failed to fully exploit the advantages of the phrase based representation, chiefly because we lack lexical resources providing good information on the similarity of multi word phrases. Error analysis suggests that there's ample room for improvement. A large proportion of re-call errors, may be 40 percent occur because the lexical similarity function assigns too low a value to pairs of words or phrases which are clearly similar, such as organization and agencies or bone fragility and osteoporosis. We just don't have a lexical resource that tells us those are the same thing. Or related things. Precision errors may be harder to reduce so these errors are dominated by cases where we mistakenly align two equal function words, two forms of the verb to be, two equal punctuation marks or two words or phrases of other types having equal lemmas. Such errors often occur because the aligner is forced to choose between nearly equivalent alternatives so these errors may be hard to eliminate. Okay. As a -- so those evaluations were sort of the main event. As a coda to that, I want to look at one other thing briefly. Over the last several slides we've evaluated the ability of the aligners to evaluate gold standard alignments. But alignment is just one component of the NLI problem. So we might also look at the impact of different aligners on the ability to recognize value it inferences. And the question is, does a high scoring alignment indicate a valid inference. Well, there's more to inferential validity than just close lexical or structural correspondence. So things like negations, modals, non-factive and implicative verbs and other linguistic construct can affect validity in ways hard to capture in alignment. Still, alignment score can be a strong predictor of influential validity and many analyze the systems rely entirely on some measure of alignment to -- of alignment quality to predict validity. If an aligner generates real valued entitlement scores -- sorry, real valued alignment scores we can use the RTE data to test its ability to predict inferential validity using the following simple method. So for a given RTE problem we predict yes if its alignment score exceeds a certain threshold and no otherwise. We tune the threshold to maximize the accuracy on the RTE2 development set. And then we measure the performance on the RTE2 test set using the same threshold. So here are results for doing that experiment. For several NLI aligners, the top three rows. Along with some results for complete RTE systems, including the LCC system, which was the top performer in the RTE two competition, and an average of all systems participating in RTE2. So I show average -- I show accuracy and average precision in predicting answers for the RTE2 development and test sets. Average precision was supposedly the preferred metric for RTE2, although in practice everyone seems to pay attention to accuracy not average precision. None of the aligners rivals the performance of the LCC system. But all achieve respectable results, and in particular the Stanford and MANLI aligners outperform the average RTE2 entry. So even if alignment quality doesn't determine inferential validity, many NLI systems could be improved by harnessing a well designed NLI aligner. Given the extensive literature on phrase based MT, it may be helpful to situate our phrase based model aligner in relation to past work. Phrase based MT systems usually apply phrase extraction heuristics to word aligned training data which hands at odds with the key assumption in phrase based systems that many translations are non-compositional. More recently several authors have presented more unified phrase-based systems that jointly align and weight phrases. But we would argue that this work is of limited applicability to our problem. In MANLI we use phrases only when word alignments are not appropriate and longer phrases are not needed to achieve good alignment quality. But MT phrased alignment benefits from using longer phrases whenever possible, since this helps to realize more dependencies among translated words, including things like word order, agreement, and subcategorization. Also, MT phrase alignment systems don't model word insertions or deletions as in MANLI. For example, in the example that I showed before MANLI can just skip in most specific countries there, whereas an MT phrased base model would presumably align in most specific countries there are to women are. Okay. So to wrap up, I think that the main idea is to take away our first that MT aligners are probably not directly applicable to the NLI problem. MT aligners rely primarily on unsupervised learning from massive amounts by text which are just not available for the NLI setting and they rely on an assumption of semantic equivalence between premise and hypothesis which is usually not the case in the NLI setting. I introduced the MANLI system which achieves success by first and foremost by exploiting both manually and constructed -- manually and automatically constructed lexical resources and also accommodates frequent unaligned phrases which arise very often in natural language inference. And the third take away is that the phrase based representation shows potential I think but we've sort of failed to prove it. And I think the reason we have failed to prove it is that we need access to better phrase based lexical resources. That's it. Thank you very much. [applause]. >>: I put together phrase and I wonder which of these alignment will align the best? [Inaudible] by reading book about war is one phrase. [Inaudible] John killed many enemy soldiers. So main verb ->> Bill MacCartney: Can you read the second one again? >>: Over [inaudible] over time John killed many enemy soldiers. So the correlation is [inaudible]. >> Bill MacCartney: Yes, that's a very difficult problem. I mean so is the question what should the gold standard be, or is the question what will my system ->>: [Inaudible] the phrase [inaudible] because it's obviously it's aline perfectly because John killed, John killed. >> Bill MacCartney: Yes. Yes. So the words are very similar, the structure is very different. >>: Absolutely. So each one can catch this. >> Bill MacCartney: I think the ones that would have the best chance of catching -- I actually think that the Stanford aligner might have an advantage on a problem like that, and the reason for that is the Stanford aligner is the only one of the ones I've presented, it's the only one that takes syntax into account. It explicitly incorporates syntactic features that essentially look at the paths through the syntax tree, actually through a type dependency tree, between candidate aligned pairs. And includes that as a feature in machine learning model for what should be aligned with what. So the MANLI system doesn't include any syntactic features explicitly, it's something that I think should be in there, and I wish it were in there, but didn't have time to put it in there. The MT aligners I would not expect to get that particular kind of problem right, and certainly the baseline aligner would not, the baseline aligner is ignoring structure and just matching based on words. >>: [Inaudible]. >> Bill MacCartney: That's right. >>: [Inaudible] and that's when you use the syntax. >> Bill MacCartney: That's right. Yeah. Yeah. There's -- it's [inaudible]. >>: [Inaudible]. >> Bill MacCartney: Right. >>: That briefcase. >> Bill MacCartney: Right. Yeah. There are a lot of RTE systems that essentially use alignment as a proxy for inferential validity and say if we have a good alignment then it's a valid inference. That's -- that strategy is actually surprisingly effective and it's more effective on some of the RTE data sets than on others. So I've done experiments where I've used a model which is pretty much like the baseline bag of words model that I described there, and tested how well it does on the different RTE data sets at predicting the RTE answer, inferential validity and there's tremendous variation from RTE data set to the next. So on RTE 1, I think that system got something like 59 percent accuracy, on RTE2, I think it got 62 or 63 percent accuracy, and then on RTE3, I think it got 67 percent accuracy. Which is very competitive with many of the systems that people worked really hard on for months and months and months. But more and more, I think people are recognizing that an RTE system needs to include more than just a measure of alignment quality. Yes? >>: So it is interesting that the phrasal matches haven't provided much benefit yet, and I think your analysis seems spot on. It's just tough to learn those resources. >> Bill MacCartney: Yes. >>: You're probably aware of some work that Chris Kellison Birch [phonetic] has done in the past. >> Bill MacCartney: Yes. >>: [Inaudible] I wonder if you've thought of either [inaudible] phrase tables or using live transition tables for instance to say here are two phrases to be translated into the same word. So if I looked up capital punishment and death penalty and they both translate into the same training phrase using one of these online translation systems. I mean, have you thought about exploiting some of those. >> Bill MacCartney: Yeah, we've thought about it and we're currently working on it. I think it's a really promising idea. I thought -- I thought that was a great idea, a great contribution of his and one of my colleagues at Stanford is currently working on sort of reimplementing that and extracting what from our perspective will basically be a new lexical resource or phrase based lexical resource which is derived from pivoting through MT. And we didn't have time to integrate it into this work, so we haven't yet reaped any benefit from it. But we have high hopes for the future. >>: And another easy thing to do, I just tried it again with German, capital punishment and death penalty both go to [inaudible] in German. So, you know, if you just want to take all the phrases in your data and shove them at a translation data to 12 different languages and see which ones land, I mean you have 12 independent feature functions. >> Bill MacCartney: Yeah. I think it's really promising. We actually looked at two different or, well, it's not yet finished but we hope -- we have we are currently working on two different ways of trying to get phrasal equivalences so the MT pivoting through translation table is one of them and the other thing we have explored is a dirt style, so dirt paraphrases in trying to get phrasal equivalences from that. That one is a little bit further along in its implementation. We had an undergrad student working on it over the summer. I think he did a good job, but we didn't get very much value from the results. So that one was a bit disappointing. But I have higher hopes for the MT stuff. Yes? >>: Going back to the first half of the talk. >> Bill MacCartney: Yes. >>: So in order to do a system like NatLog, what are the lexical resources that are required [inaudible] to do this and so like what is the size of your election con [inaudible] would you get more if you had more lexical [inaudible]? >> Bill MacCartney: Let's see. That's -- there's kind of a nest of interrelated questions there. So we used a bunch of different lexical resources, so we have -- we kind of have a collection of standard lexical resources that we use in lots of different context. We use it in the main Stanford RTE system which is different from this and we also -- we also use it here. So certainly we make heavy use of various lexical resources that are based on WordNet. Some of them are very specific like the ones that tell us whether two words are antonyms. Others are more general measures of semantic relatedness. So we used the Jiang Conrath measure which is based on path lengths through the hyponym hierarchy in WordNet. So we have a bunch of lexical resources that are based on sort of well known publically available manually constructed lexical resources like that, like non bank. Then we also have I mentioned earlier lexical resource based on DeConglin's [phonetic] style distributional similarity. We get a lot of mileage out of that. For NatLog we also found it to be very important to use some lexical resources, specifically constructed for NatLog that probably would not be of general utility. So this included things like quantifier categories. So in particular in the FraCaS data there's a lot of problems that involve relations between different categories and what happens when you replace a universal quantifier with an existential quantifier and things like that. And of course there's more than one universal quantifier, so you need to be able to recognize that every belongs to this universal quantifier category and what relation does that have to other categories? So a lot of that was hand crafted specifically for the purpose of NatLog specifically to be able to handle questions involving quantifiers. So from one perspective that doesn't scale up very well, but on the other hand, there's not very many quantifiers so there's not that much scaling to do. Yes? >>: [Inaudible] curious, you know, verbs like refuse and ->> Bill MacCartney: Yeah. >>: Because those are a little bit more [inaudible]. >> Bill MacCartney: A little bit, although maybe not as much as you might think. And happily for us, some other people have already done some of the hard work there. So another important feature that we had specifically for implicatives and factives was the implication signature and we leaned on some work that's been done at park in this area so they've compiled list of different verbs and verb-like constructs according to their implication signature. So they have an account in which there are 9 differently implication signatures and have lists of verbs that fall into each of those. And so we relied on those. Most of the implication signatures actually only have from maybe half a dozen to a couple of dozen instances. Then the biggest one has a few hundred. Yes? >>: How big does the lexicon have to be to get [inaudible] in other words, what -presumably there will be things like adjectives like [inaudible] and so on. [Inaudible]. >> Bill MacCartney: Yeah. >>: How large a lexicon? >> Bill MacCartney: I mean for that, for that particular feature for the implication signature feature, it's -- the total size of the set of words that it's looking for is not more than a couple of hundred. And it is mostly verbs. I don't think able is in there, although it should be, yeah. There's another -- I didn't mention it here, but I have another smaller hand crafted list of non-intersective adjectives. Which also break these rules. I need to be able to recognize them. So this is adjectives like fake or former or alleged. And so I have a little list of the non intersective adjectives. Probably somebody has already put together a better listen the list that I have. I just didn't put any effort into going out and finding it. But there are a number of things like this. And then also, I mean, this was a fair amount of work. I came up with a list of lexical entailment problems and went through an hand initiated them. It's actually pretty quick to annotate but that's a lot of problems so it took a little while. So there was a certain amount of labor that went into that as well. >>: [Inaudible]. >> Bill MacCartney: Yes. >>: [Inaudible]. >> Bill MacCartney: Yes. >>: [Inaudible]. I'm thinking some lexical relationships, they're disjoint classes but the ones that are disjoint [inaudible] versus disjoint [inaudible]. >> Bill MacCartney: Oh, yeah. And there were -- there were many difficult cases when I was doing that annotation. There are many pairs of words where it's really not clear what the right relationship is or it may depend on context, it may depend on topical context, the words may be sort of entering sickly bit like system and approach, for example, what's the relationship between those two? I mean, typically -- when I -- you know, lots of nouns, I put them into the alternation relation, so even if -- whether there's related like cat and dog which are pretty clearly related to each other but disjoint, or two words that have nothing to do with each other like cloud and cat, those are still disjoint sets, right. But what about system and approach? Well, maybe in some context, those are two separate things and in some context maybe they're the same thing and it's really just hard to say. And there were a lots of examples like that. >>: [Inaudible]. >> Bill MacCartney: Yes. >>: Okay. >> Bill MacCartney: Yes. So it would have been better to get somebody else to do it, but as a grad student I don't have many resources to call on. >>: So [inaudible] an example is a tall basketball player. >> Bill MacCartney: Yeah. >>: A short basketball player who is still nevertheless taller than the community at large. >> Bill MacCartney: Yes. Yes. >>: [Inaudible] you don't want to [inaudible] necessarily. >> Bill MacCartney: Right. Yeah. >>: Do you have any -- the Indian elephant, what's the difference between and after can elephant and an Indian elephant, an Indian elephant has small ears for example. >> Bill MacCartney: But they're still big ears. >>: But they're still big ears. >> Bill MacCartney: Yeah. I'm afraid I don't have -- I'm afraid I don't have anything useful to say about that and the NatLog system would probably get it wrong. There are -- yeah. That's one of a million hard problems I think in semantics. >>: Also, what about the [inaudible] position, the problem of [inaudible]. >> Bill MacCartney: Yeah. So the way I approach -- so I don't know if this is a general solution or not, but I have a solution which works at least for a few examples that I looked at. I actually handle some of those non intersective adjectives as being similar to implicatives in that you can specify the entailment relation generated by deleting that adjective. So for example let's think about alleged. An alleged criminal and a criminal. The entailment relation between alleged criminal and criminal is independence because an alleged criminal might or might not actually be a criminal and a criminal might or might not be alleged to be a criminal. Right? Whereas a former has different behavior. So a former student and a student, those stand in the alternation relation. If you're a former student, you are not a student anymore. Presumably. Right? So you can categorize these non interest -- at least for a -- at least for those examples you can categorize non-intersective adjectives according to what all -- what entailment relation is generated by their deletion and then that can be propagated up the composition tree and changed along the lines that I described. And there are a couple of examples along those lines in the FraCaS test suite which I believe are all correctly handled by NatLog. All right. Well, thank you very much. [applause]