>> Geoffrey Zweig: I am happy to introduce Professor Dan Roth today. He is professor of computer science at the University of Illinois at Urbana-Champaign and the Beckman
Institute. He is also the director of the Center of Multimodal Information Access and
Synthesis there and holds positions in statistics, linguistics, library science as well is the main appointment in computer science. For those of you who are going to it, he is program chair for the upcoming AAAI later. A lot of his research is focused on the intersection between NLP and machine learning. That is what he is going to talk to us today about, specifically on constraint driven language learning.
>> Dan Roth: Thanks Geoff. Thanks for coming. As I commented to--actually a lot of projects I have been running in my book has relations to the stuff going on here from the textual entailment ESL projects and others, but I have chosen to talk about--and all of them are related at some level to natural language understanding but I have chosen to talk about constraints driven learning partly because I think this has the broadest applicability.
I am sure that once I am done all of you will take these lessons and start using them. And also because of the Microsoft connections, I am going to start with stuff that Scotty has done five years ago and finish with stuff that Ming Wai Chan was going to join in a couple of months has been doing. This would allow me to give credit to both of them and many other people in my group, and we have to put some credit to the funding agencies that fund this as well.
So let me start by talking about the problem I want to solve. So this is an example of a problem I would like to solve. This is a short story taken from what is called an essay, a short story that is given to kids in elementary school as a language comprehension test.
This is taken from a third-grader’s test. Basically after each passage like this, they are given four questions which they formulated to be statements rather than questions because I am thinking about it often from the perspective of textual entailment but this doesn't change the thing. Unfortunately we don't know how to answer this question. I mean the best programs we can write will do at best 50% if we relax the success criteria.
And the question is why, why is it that every third grader can't do this? Once they can read, they can answer these questions. And if you start thinking about it you see that there are a lot of problems here. So here is the first one. You can see Christopher Robin,
Chris, Mr. Robin, is this the same person?
Most of you say, who votes for no? Okay, the majority is no, so I assume it is no. And if you read carefully, it is no actually. But you have to know something in order to make the decision. It is not so simple. There are many other things. There are a lot of pronouns here. You've have to decide who is him, who is his, who was he? And again it is not trivial. You have to think about some of these things. There are many relations that are expressed here. So someone wrote a poem. Someone wrote a book. Someone has written two books. So you have to be able to understand the text at the level of who does what to whom, to have a chance to answer these kinds of questions. But really we have made a lot of progress over the last 10 or 15 years in solving each one of these problems in isolation. However, the key problems I think our how to put these things together. Once we develop some capability in the direction of one problem, we still haven't developed capabilities to answer questions because it has to do with integrating
this information with a lot of other things outside the story, so we can answer the questions. So this is what I call an inference problem. And this is really what I want to focus upon. So I will give you two other more isolated examples just to explain what I mean. So think about this simpler problem of just analyzing sentences at the level of who does what to whom when and where. We call this semantical labeling. Most of the work in semantical labeling so far has been focused on verb predicates, but really in formation is kept in other predicates like prepositions here, the by, and the of.
So if you think about it, the simple sentence, the touchdown scored by McCoy cemented the victory of the Eagles, for these people that follow the NFL, this really happened, but you can think about this predicate. Victory is a predicate here, nominalization. Scored is another verb predicate here. Victory of, of is also a predicate here. And figuring out that
A0 is a predicate of victory has to be in agreement with the specific sense of the word of.
Similarly the verb score has McCoy as its A0 and it has to be coherent with the specific meaning of the predicate by in this case. So again we have to solve multiple problems here and they have to agree on each other. Of course you can say, if you think about it for a second, you can see that by has multiple meanings, not necessarily the one that I am using here in the sentence.
Here's another example of semantic parsing which is really part of the, I am thinking about natural language processing as a tool for communication or maybe communication is about natural language for the most part. So think about the case where you want to start with natural language text and you want to transform it into some form of presentation that you can use, for example, to access a database. So even in this simple example when you want to understand what is the largest state that borders New York and Maryland into a form of presentation, successful interpretation of this sentence involves multiple decisions. For example, you want to know, you want to recognize entities like New York and Maryland. You want to classify them. What is New York
[inaudible]? Probably because you see also the mention of Maryland, it is more likely that New York in this case is a state rather than a city. But in addition you want to compose this fragment that you made from the sentence into a representation and clearly state next to is a different then next to state, so you have to think about how these predicates will also interlock.
So the bottom line is that natural language decisions are structured or what we call an
NLP to the structure. There is a collection of global decisions with several local decisions playing an important role but their mutual dependencies are on their outcome.
So it is really important that we make coherent decisions that take into account these interdependencies, which calls for doing some joint global inference. However, and this is one of the messages I want to send today, there is a bit of confusion in the literature between joint inference and joint learning. So even though joint global inference is essential, learning structures or learning structure models require annotating structure and it is not so clear that you want to actually--it's not so clear what is the best way to learn when you want to support global inference. Sometimes decoupling learning from inference is the best thing to do and there are some theoretical results that justify this and
sometimes you do want to do some kind of joint learning to take more advantage of the interdependencies among decisions.
So basically the rest of the talk I want to focus on three ideas and if you stay with me for this slide that is really what I want to say. So one message is I want to suggest the formulation that focuses on separating the modeling and the problem formulation from algorithms. If you think about it, the nature of probabilistic modeling which is so common in our community is also like this. The goal there is to formulate problems and then algorithms come out and I am going to suggest a different formulation, but it has the same kind of philosophy. Separate formulation from algorithms.
The second idea where I completely deviate from the nature of probabilistic modeling is keep model simple. You still need to make expressive decisions but it doesn't mean it requires a to learn complex model. Keep the learn model simple and make expressive decisions via, for example constraints. And as I said this is a big deviation from the commonly used approaches in probabilistic modeling.
And the third idea that I want to send is often when we want to make expressive structured decisions, they can be supervised by rather simple binary decisions or binary labels. You don't really need to carry the load of annotating every structure if you want to make decisions with respect to these structures. So the talk is going to be hopefully sending these three ideas. You can think about the first one as a modeling idea, the third one is inference although it has some learning component and the final one is really about learning. So that is what I want to talk about and I am going to start with formulation and the model that we are talking about over the last few years…
>>: [inaudible] previous one there. So I assume there can be significant disagreement on what this structure should be and what it should get annotating each way. Different people different structure opinions and whatnot on how to do things. Whereas binary decisions many more people could agree yes, this is correct and no, this is incorrect. Is that something…
>> Dan Roth: That has a lot to do with it. And in fact many decisions, we actually eventually want to make a binary decision, but on the way we must deal with structures.
And often there is disagreement on the exact annotation or the exact labeling of the structure but there is no agreement on the final decision. I am going to give an example like that in fact taken from a [inaudible] that was built here, the [inaudible] [inaudible]. I think powerful is a very undefined notion, but often people agree in terms of alignment, it is a very undefined notion but people often agree on the final decision. So that would be an example of that and that would be something that drives why you want to be able to do this. By the way I did not say since we scheduled the talk for an hour and a half, feel free to ask questions and stop me.
The model that is going to underlie everything we are going to talk about is the model that we call constrained conditional models. The literature calls it ILP for integer linear program for ILP. So basically it is a very simple model. All of us doing NLP are using
some variation of it in one way or another even if you don't think that you are using it, take a look at your objective function and that is what you are using. So it has two components. The left side you have the weight vector for local models and I think about these F of XY either as local models, classifiers, features, could be HMM/CRF or a combination of some of these. So this is kind of the usual thing that you are doing all the time.
And then right side is my way to express constraints or soft constraints that will buy us this model one way or another, and again you have here weights, which you can think about as penalties for violating the constraint and this distance function that you see as a way to encode how far is your proposed decision vector Y from physical assignment or legal assignment. And once you think about it this way, you can immediately come up with two kinds of computational problems. There is a question as to how to solve this problem. It is an inference problem. It is an integer linear problem because we care about assigning value to these three variables in most cases. But I care only about the formulation. In fact the way you solve it is your problem. You can use commercial packages and in fact they are excellent commercial packages. In the last five years there has been a really huge improvement in what we can do, what kind of problems we can solve with commercial ILP solvers, but as far as I'm concerned you can do cutting planes, you can do dual decomposition, you can do search, whatever you want.
So that is one type of computational problem. I am not going to say a lot about this even though there is a lot to say about this. There is a question about how to train for these models. Training is really learning this objective function and there are many questions here and there has been a lot of thought that has been done on sub portions here. Do we want to decouple the left side from the right side? Do we want to decompose the left side to multiple models so that learning becomes easier? Or do we want to learn everything together? And there are also questions about how to exploit the structure in order to minimize the supervision. And I am going to attempt basically to this last point, because there is not enough time to talk about all of these issues. So I could actually [inaudible] a little bit more with this slide and as I said there is inference problem; there is training problems; there is a lot of work on joint learning versus joint inference. And I am going motivate by the difficulty of annotating. I am going to really attend to training. And I'm going to say something about these two kinds of training.
The outline of the rest of my talk is going to follow the three ideas that I want to focus on. I am going to start with modeling, move to how can we keep simple models and still make expressive decisions. And the second part of my talk will be devoted to indirect supervision. I am going to talk about several settings, and I think the most interesting one is the last one where we are dealing with indirect supervision that comes directly from the world's response to what an agent, a learning agent is doing. Let me start with some background. So pipeline is still the most common way of building NLP systems. And there are reasons for that.
Most problems are not single classification problems. We start with raw data. We do perhaps follow POS tagging. We believe the results and then we do some phrase
communication, and then we believe these results are then we identify semantic entities.
We believe this and we do some other things and so on. And you can think about multiple pipelines that would follow in NLP and in all cases we all know that really it is not the right thing to all, because pipeline is a crude approximation for what we want to do. Often downstream decisions are easier than decisions we have made and we want to change our view once we have made a downstream decision. But still there are some good reasons to use pipelines, because putting everything together into one bucket is probably not right. And again, there are engineering and scientific reasons why this may not be right, but if we want to deviate from this pipeline, how about choosing some consecutive stages and thinking about them jointly.
So I am going to give you an example that is taken from—it’s an example that is five years old or six years old taken from Scott's physics still I think delivers the point quite well. Think about these two stages of identifying entities and the relations between them together. So here is a very short sentence. Only three entities, Dole, Elizabeth and North
Carolina. Among these three entities there are quite a few relations because we think about direction of relations, but I am showing here only two relations, and assume that you have local models that, and so this is a pipeline the standard [inaudible] pipeline from
X input to entities to relations, and think about the local models that this is going to give me conditional probability tables over the entities and over relations between entities and what I show here, because I don't have more room on the slide, is just very small condition probability tables. I only have two options, three options, person, location and other. For entities and I have three options spouse or born in or other for relations.
Now the simplest way to make a decision is to take the model with the highest value in each one of them and I can do that, and I have a decision. Now clearly if you look at this, for example, look at these three boxes, you see that something is wrong here, right? You have two people and the relation in between them is born in doesn't make sense and I would like to change the decision, and the easiest way to change a decision here, just because in terms of scroll, it's the closest of the smallest duplication I can make, is to change one of these decisions from person to location. You can look at the other three boxes here, and again, you see two people and the relation between them is born in.
Again, it violates the natural constraint and I want to be able to change these, and in this case I am choosing to change the relation from born into spouse of.
So this all makes sense and in fact if you do it experimentally, you will get significant improvements over no inference. But there are several questions. So one point to make, these are not nonsequential decisions, so unlike a lot of the decisions that we are making in NLP, if I would have put here the whole graph of relation and entities, you would see that it is not sequential. So we have to think a little bit more about how to support this kind of decision. And the key questions are how to guide this global inference that I made here and of course not why not to layer jointly. So I am not going to attend to the second question now, as I said there are a lot of [inaudible] points. I am just going to use score one point which is it is often the case that models are learned separately. Someone gave me the entity model, and they don't want to relearn it together with the relation, they just want to use something of this. And perhaps often models cannot be learned on the
same data because we don't have jointly labeled data. So there are many issues involved here, but I am going to tend mostly on how to guide this global inference. So the key components are, you know, I need to write some objective function. It is going to be a linear objective function, and I am going to write down constraints again as linear inequalities. Now I don't want to go to the history, but today people view this formulation as something simple, you know, and it's good. But the history has been a little bit more important. The papers as late as 10 years ago where some [inaudible] claims it is impossible to do or they don't know how to do these things, how to set up this kind of what people called the theory and metric labeling type problem as linear problems. And there have been some solutions again that came from the theory community [inaudible] that has suggested solution and we actually adopted and modified slightly their formulation.
Again I want to say that this formulation, really you can think about it as a general interface that will allow you to easily combine domain knowledge and code it as concerns whose data driven statistical models, and I am going to move on to give some examples of how you can use these kinds of models. And I am trying to create in my example here in my examples to [inaudible] if you like HMM/CRFs, you can think about the sequence tagging example as you are running example so in sequence tagging what we want to do is basically we have a linear model, something like this. Think about the HMM/CRF like model and everything is nice and good. We can learn; we can decode. What happens though if you want to add more expressive constraints? For example, if you want to say I cannot have both states of types A and of type B in the output sequence. You cannot have this constraint in a HMM/CRF simple model. What you can do, of course, is you can make this model more complex and this formulation allows me to actually take this simple model on the left side and code this constraint on the right side and add explicity at decision time for example.
If you prefer language models, let's think about an example that comes from this way of modeling. So think about an example that comes from this way of modeling, so let's think of an example of sentence compression. I am starting with a long sentence and I am going to compress it in such a way that most of the stuff that is in the sentence, the meaning if you want, is still maintained so one natural way to do it and, you know, there's been a lot of talk on this, is to use a language model based thing so the indices here, JK,
IJK indicate that I am taking a trigon based model, and I am going to choose basically this objective function says that I'm going to choose the heaviest trigons into my into my compressed sentence and I am going to be happy.
Now of course, this is not going to give me a good sentence. I want to add some legitimacy constraints and one simple constraint could be if you choose a modifier into your compressed sentence, include also the head, or if you choose a verb, include its coargument for example. No once you want to enforce these on the output, you can deal with the simple model as is. The learning becomes, could become a little bit harder and what we are suggesting is don't make it harder just keep the simple learning model on the left side and enforce these linguistic concerns in this case using the right side at decision time.
And the final example that I want to go over, I am going to go over with a few more details is going to be this semantical labeling. In this case the model is going to be a collection of independent classifiers that I am going to learn locally, and then I had going to put this together using global constraints. So I am assuming that in this crowd most people know about semantical labeling. I don't need to explain it. Just very briefly, it is the form of understanding sentences at the level of who does what to whom, when and where. So I want to know in this case that I am the leaver; my pearls are the things left to my daughter. That is the benefactor and in my will it is the location, the location argument. So you can think about it as a chunking problem where I sort of chunk the sentence into parts and then I color them based on the types. Of course I'm not going to cover them but cause a distribution over possible colors, but I don't know how to draw this. So of course there are many ways to chunk and color sentences and some of them are not going to be legitimate. They are not going to be legitimate because, for example, the arguments overlap and this formulation does not allow me to have overlapping arguments, or it could be illegitimate because if A2 is present, I know that A1 must also be present, or many other linguistic concerns. I want to be able to support this kind of things and modify my output accordingly to satisfy these concerns.
So again this is a well-known problem in the NLP community. There is a large data set put together by my department and others; I am not going to do it on this. What is a sort of algorithmic report? And, you know, this is, you can modify, tweak this with this approach but this is going to be good enough for our purposes here. So you really start by identifying the argument candidates, the N square candidates’ consecutive arguments.
If you have a sentence of size N and you want to prune them somehow. Next you want to classify them as I said you want to color them or propose a distribution of colors, and finally you want to take these decisions and make them one coherent output. And this is the stage I am going to focus on, so you want to use the estimated probability distribution that you got from your learning stage as a way, together with structure and linguistics concerns, as a way to infer optimal global output.
So this is what you really have. You have a collection of possible chunks. Each one comes with the probability distribution over the colors. You can choose the top one. In this case if you choose the top one you get overlapping arguments. You want to choose one that is not the top but the gray one means non-argument, it is not an argument. You get something coherent. So you're going to make one ILP inference here for every predicate, and this is the problem that you are going to solve. So for each argument, AI, you're going to set up a Boolean variable AIJ that indicates whether AI is classified as type T, and then your goal is to maximize this linear objective function scroll of AIJ being equal to T times AIT which could be 01. You want to do the subject to linear constraints.
And, you know, really you could put to scroll many things, probability, non-probability, each one of them is going to have some what different meaning, but you can actually, you get the flexibility to choose whatever you want here. So constraints, this is the interesting part, so you can say a lot of things. You can start by saying that you don't want to have duplicate argument classes, and if you think of it, and this is this argument. This linear
inequality and I am summing over all potential arguments, the variable that says the argument is labeled as A0, and I want to require that I don't have more than one like this.
And I can say things like if I have arg argument for any arg vary of arg, then I must have an arg somewhere in the sentence, or many other things including, for example unique labels. I can say that I don't have overlapping arguments. I can set up relations between the number of arguments. I can say that if I have a variable of this type, I cannot have an argument of that type. Or if I have the variable of this type, I cannot have more than three arguments, and so on. The bottom line is we know that any Boolean rule can be encoded as a collection of linear constraints. So in fact, this is a very expressive thing.
Any constraints that you want to declare here, you can actually do. And in fact we propose, we have a tool that we call LBJ for learning based Java that I am going to mention a little bit later again, that allows developers to encode constraints in relatively high level language. And this tool will compile your constraints into linear inequality, so you don't have to think about how you want to write this.
Okay. So the bottom line is this. You can go to the demo on my growth webpage and play with it. You have this objective function. You have a bunch of constraints, not that many. And it produces a pretty good system overall. Okay so, I finished the first idea.
Hopefully convinced you that you can separate modeling and problem formulations from algorithms, and in fact, even though I formulated that this is an ILP, you don't have to want an ILP, and in fact, the demo system on my webpage runs a search rather than an
ILP. It is still an ILP problem.
So the next thing I want to talk about is keep the model simple. Before I get to it, just the summary slides. There has been a lot of talk over the last five years on these models in
ILP and NLP and in many, many contexts, form semantical labeling, summarization, coreference resolution, information extraction, transliteration, many others. A lot of best paper awards people got talking on ILP formulation including this last ACL, but if you want earlier, there is some theoretical understanding in different contexts and you can go and look at it. There are a couple of tutorials that we have given, the most recent one was last year and ACL, you can go and look at the slides. There has been a workshop in ACL
‘09 and you can see a pretty recent bibliography within the last year on my webpage.
But I want to move to training. So still we are dealing with structures. It is difficult to annotate structure. So let's see what can we say about training. The running example I am going to use at the beginning here is that of policy citation, information extraction.
So I want to take this citation of the talk, and parser to the level of who is the author, what is the title, who was the editor, what is the book title and so on. So this is what you see here is really the prediction result of a trained HMM. So basically the left side of my objective function, this linear model, and it is easy to see that it is very bad, violates a number of very natural constraints, you don't want your output to look like that. What can you do? So the standard machine learning approach to treat this is to move to higherorder models, higher order HMM/CRFs, basically add some more constraints which basically means increase the model complexity. Now you can do this but what if you don't have a lot of data to support an increased model complexity? What are the options that you have? Can we keep the model simple and still make expressive constraints?
Basically the intuition here is that these outputs make no sense, and I want to push them into a direction that makes sense without changing the model itself, but rather imposing the model to behave better. So example of constraints, you know, it is very easy for you to sit for five minutes, and in fact, that is what we did for this model and there are ten constraints; some of them are written here. Each field must be a consecutive list of words and can appear at most once in a citation. Very natural and we all know this. You want to be able to enforce that. So basically these are very easy to express pieces of knowledge. In most cases they are not propositional; they have some kind of quantifier in them. If you do this on this specific example without changing the model, just adding this right side that pushes the model in the right way, you will get a perfect output. Of course we will not always get the perfect output, but this is a nice example. So that is the lesson here. I want to be able to learn a simple model nevertheless make a decision with a more complex model and I can accomplish this directly without changing the model by incorporating concerns that bias a weak link if you want decisions, made by the simpler model.
So I am going to illustrate this. This lesson is applicable in many contexts. I am going to illustrate it very quickly in the context of semi-supervised learning.
>>: Do you have to train of models jointly and I mean do they have to be together? For example, what if your HMM system gave you a bunch of options to choose from and all of the options were screwed up because the HMM system when it generated those options was not aware of any of these constraints?
>> Dan Roth: That is a very good question. And you can think about how do you want to do? Basically you are asking do you want to train these lambdas in these rows together or not. And I'm going to argue that you can and it is going to be much cheaper to actually train them separately. In order to do this, to succeed in doing it, I need to make sure that not all of the outputs that I get from my HMM are screwed up, because-think about it as re-ranking in some sense. If all of the talk K are very, very bad models,
I am doomed. But if I do have somewhere at the bottom things that make sense, eventually by pushing, biasing them in the right direction, I am going to get good things.
So we actually have, I am not going to talk about it here but we have actually a coming
German vision coming up that does all these studies and looks at decoupling this completely, learning them together and really it depends a lot on how much data you have for your original model. Sort of the question of whether you want to run things together or not. If you can get sensible results from your original model, the left side, just do that. It is going to be much cheaper. If you cannot, you may need to train jointly.
I am going to show a little bit of an experimental result, but I want to kind of give the flavor of the context in which I am talking about. So I am talking about the context of semi-supervised learning in this way. It is not the only place that you can do this. So in semi-supervised learning you typically start with a model. You have a bunch of unlabeled data. You train your model on whatever data you have. You go to your unlabeled data. You label it with the model. You go back and train another model. And
so on. Now I can apply constraints here in two places. I can apply while training by fixing the down labeled data so that when I push it back to the model I push better labeled examples to the model, and it can apply at decision time, at the end. And this gives a simple algorithm, conceptually simple. I am not going to go into the details but basically you can train it in many ways as I mentioned. But the overall, kind of 20,000 feet view is you start by learning your model which means both W and row using whatever you have, a simple HMM, for example, and then you do inference with your constraints, which means now you fix your examples and you augment, you set up examples with those corrected examples. So this is where I use a row model. I place them into a model and now I have a corrected WP and I keep on going this way.
So again, I am not going to give details but there are excellent experimental results showing the advantages of this, and in fact, at the same kind of ideas have been pushed further. Those of use, so originally this was proposed in the ACL 07 and later in a few other papers. This was followed up in a very nice work by Ben [inaudible] and students.
They named it very nice. The name is [inaudible], which is a cool name, better than what we gave it. But basically it is the same idea where the key difference between what they are doing and what we did is we talked about constraints and we insist on each instance satisfying either hard constraints or soft constraints, always satisfying constraints. And they move to deal with constraints and expectations, sort of overview data set you are trying to be satisfied. And in some contexts that is better and in some contexts the real concerns are better, but basically it is the same, really the same framework. They also are doing their projections in a slightly different way. But basically that is what is happening here. We have this objective function where this is the model that you learn, and this is the way to bias the model in the direction you want your output to satisfy. And just one experimental result, and again this is a pool model on the left and the constraints, and you can actually do some experiments to evaluate what is the value of the constraints. And this is a simple example that shows, so the blue line is the result that you will get without constraints with 300 examples for the citation domain.
And the red line here is what you will get when you train with 10 constraints, the constraints I showed in the previous slide. You start with a small number of examples, around 15 to 20 examples you already get, what you will get with 300 examples and you can improve even more as you add a few more examples. So of course this is, this graph is specific for this problem. You will see a different graph on your own domain but that is kind of the key idea.
So I finish with the first two ideas, hopefully delivering the message that you can learn simple models and nevertheless make expressive decisions. Yes?
>>: Isn't there one other related area of works is that [inaudible] networks, where instead of it seemed that they grouped the W's and the lambdas and the rows into a single factor and the constraints have to be expressed in some sort of first order logic so in that case maybe it is sort of a special substantiation of what you are talking about, or how would you contrast what you are saying…
>> Dan Roth: I have a slide that talks about it. So conceptually it is very similar. The key difference is that I want to, and actually the second point here is the key difference. I want to learn a simple model and I want to allow you, you don't have to take this option, but I want to allow you to add your constraints and bias the model only at decision time or in the final stages of training, as opposed to learning everything together from the beginning. So that is really the key conceptual difference.
So I want to move now to the second type of training which we call indirect supervision, and really the key difference is that in semi-supervised learning, which is an important learning paradigm, you still deal with structures. So you still label structures. And I want to argue that in many cases even when you care about structures or you have to deal with structures, maybe you don't care about them, you can actually gain a lot from very simple binary supervision rather than the complex annotated task of structure. So I am going to start with an example that actually started here with a couple of paraphrases.
The question is, are these two sentences S1 and S2 to a paraphrase of each other. And the answer in this case is yes. But the question is how do we think about it? So really even if you think about this as a Boolean decision problem, you know, I have X, the pair of sentences and Y, yes no, really in order to support this Y I need to add some hidden representation. And this hidden representation or latent collection of latent variable is that a presentation that supports why is this positive, that explains why is this is a positive example. So I don't really care about this intermediate representation, but I must have it in order to support this representation and I am giving here in this slide a very simple hidden representation as if it's [inaudible]-based and really it's not [inaudible]-based but still it can be illustrated. And by the way this is exactly the same type of picture that you would draw if you care about textual entailment. The graphs are going to be bigger and more complicated, but conceptually it is the same kind of question. So the question is really given input X, learn a model X to H and then to Boolean decision.
So one possible way to deal with it is a two-stage approach, which is really a pipeline.
Start with X, learn hidden representation using some heuristics or maybe label the hidden representation somehow. In this case you have to learn alignment and this is exactly the point that Jeff mentioned at the beginning, there is going to be a lot of disagreement on the alignment. There is not going to be a lot of disagreement at the binary level in this case. So you can do this, you know, annotate the alignment, learn and then use this to learn the Y or what we propose is do joint learning, drive the learning of H from the binary labels, find the best H of X for this specific pair, use it in order to represent the Y.
Now really notice that I don't care about H. I care about it only to the extent that it tells me to make a good decision at the end. So the real technical insight is that this follows asymmetric. If X is positive, and these are paraphrases, then there must be a good explanation, there must be a good intermediate representation that supports this. And technically given the tool working with linear model, we can say that there exists in H such a [inaudible] or the product is positive. Or equivalently I can say that the max X over all Hs is positive. However if X is negative there is not an explanation because if there was an explanation I would say yes. So for all Hs this [inaudible] product test will be negative or the max over all Hs has to be negative. And this intuition can be formulated as an objective function, and I am not going to go into the objective function.
I just want to give the intuition that really what you have here is this creature here is a new feature vector for the final decision. Basically you can think about it as the H, the chosen H using my max operator really selects a representation for this specific instance.
So the question is why does it help me to do this max operator? And really it helps because it constrains the intermediate representation. I have to consider, think about a lot of possible intermediate representations and I do this max operator with constraints to constrain the legitimate intermediate representations that can support good predictions.
So I am actually not going to talk about the optimization, I am going to try to get some intuition as to what is really going on. So I am going to start some initial objective function and I am going to do inference. And the inference is going to give me the best intermediate representation, the best structure for this instance. Once I have this I can compute my features for the Y, make a prediction, on the Y and on the level of the Y, I get feedback. I can take this feedback and use it to update my objective function. With this new objective function, I can do again inference, get the best current structure and again generate features from it and so on.
Really the algorithm doesn't go exactly this way, because we formulate it instead as a structure. This VM plus constrained structure, but I think this gives some intuition for what is happening here. So we call this algorithm LCLR, should be for learning with constrained latent representation and really the ILP inference that they talked about before sits in that left box here. It allows me to restrict the possibility of the hidden structure that I consider. And really this is part of the first point that I mentioned. I want to have a general formulation, and I don't want to think about algorithms, and I don't want the algorithms to necessarily depend, or to have to depend on the specific problem that I am solving now. And this allows me to do it. So I have this model here; I have the
LCLR model. I have a way to encode problem specific declarative constraints and I put it into this box and just run the algorithm. So for example, in the case of paraphrasing what is going to be in this box, I am going to model my input as two graphs, one for each sentence.
And I am going to have two types, four types of hidden variables. Variables that encode vertex mapping, this node corresponds to this node. And variables that encode edge mappings. This edge is going to be mapped to this edge. And the constraints are very straightforward. I want to say that each vertex in G1 has to be mapped to at most one vertex in G2 and each edge can be mapped to at most one edge in the other one and if I map an edge, I must also map the node that corresponds to this edge. Very simple and I can rewrite it as linear inequalities. You do not need to check this. It is rather straightforward. And then I can put it into this box. Now if you have another problem the only thing you will have to do is this. The setting is there. And indeed we have tried this model on several problems consideration and entailment paraphrasing, and it actually works very well. I want to focus on the entailment results because we actually started with a pretty good entailment system for people that follow entailment. Sixty-five % on
[inaudible] E5 is one of the top three or four systems without doing anything to the system. No changing in futures or anything, just changing the way we train, we get almost 2% which is huge in entailment. So really this is doing something. Notice that
the interesting things here, one of the interesting things is our intermediate representation is not fixed. Give me an instance on choosing the best intermediate representation and computing features based on it. Give me another feature of intermediate representation and it is going to be different. There is a model to determine this.
I talked about indirect supervision whenever latent structure. I want to move to the next step where I am talking about cases where I actually care about the structure. Before I did not care about the structure. If I do care about the structure what do I do? So an example that I do care about the structure is information extraction, the citation example that I gave you. In this case I am going to do a reduction to the problem that we just talked about by inventing a companion binary decision problem. So think about this problem again. I care about the output. And so here is the same text again. What I am going to do is I'm going to invent a binary decision variable that actually means, so I am starting with this. I have X in the hidden level, but now it is not hidden. I care about it. I am going to add this Y by inventing a companion variable that has the following meaning.
Given a citation, does it have a legitimate citation pulse? Yes or no. Boolean decision.
Think about POS tagging. Given a sentence, I want to figure out what the possible structure, but I can also ask the question given a sentence or a [inaudible] sequence does it have a legitimate sequence? These decisions are very easy. They are almost free. For the POS tagging give me a sentence, it has a legitimate POS tagging sequence. Scramble the words a little bit with high probability, it doesn't. Citation, take a citation it has a legitimate citation POS structure. Take another piece of text from the newspaper, with high likelihood it doesn't. So I get this in a very cheap way and now I have reduced it to the problem that I studied before. It has the same kind of structure. So I am talking about transliteration also but the key thing is that all positive examples must have good structure. None of the negative examples can have a good structure. So now you can believe hopefully that I can use the previous example for the previous algorithm, because
I am in the same setting as before.
Only that I now invented the binary decisions, but I am claiming that it is rather easy to do in almost all settings. And in general, it is much easier to do than label structure. So the algorithm is going to be to combine this binary labeling with a little bit of structure annotation that I can get. So this is going to give me, in fact, I am going to skip that too, but I can go back to it if there is time. This is going to give me a two component objective function slightly more complicated than before where I have a structural component, this is LS, is my structureless function. LB is my binary loss function, and now I can combine them. If I don't have binary labels, which would be unfortunate because it is very easy to get, I am reducing the problem to just structured learning. If I don't have structure, I am reducing the problem to what I had before, just [inaudible] binary supervision. But the best thing is if you can do both, and the key idea here is that you share the W. It is the same weight vector. The same set of parameters for both functions and the optimization here--I see that they don't talk about optimization. Let me just say just a couple of words so. It is said an ongoing optimization problem but you can split it to deal with positive example and negative separately and you can get the
difference between two objective functions. Each one of them is convex and the reason algorithm that actually can be shown to converge or at least to the objective decreases provably.
So what do we get? These are results that you get for three tasks that we worked on here.
Information extraction, POS tagging and phonetic alignment when all you have is structured prediction, structure labels, and this is what you get when you add the cheap or almost free labels, binary labels. And in all cases you get significant improvements in the results.
Okay. So I want to finish with another interesting setting that really follows the same conceptual idea. In this case the supervision is going to be driven from interaction with the word. So at the end, I mean why do we play with this natural language, and we want to place it somewhere in the world to interact and communicate and do something, in particular, perhaps what you want to do is you want to ask your friend the robot over there to give you coffee. And you don't want to translate the request to some formal language and label it and annotate it and train the robot this way. Rather what you want to do is be able to once you get the coffee tell the, if it is good or not and they will learn from it. So the question is can we do this. Can we rely on this interaction to provide supervision? In a more concrete way, let's go back to semantic parsing or one interpretation of semantic parsing, where we want to take sentences and interact with the database. This is that you [inaudible] example, the data set that Ray Mooney put together. So I want to be able to transform internally to some formal presentation and then go to the database to get an answer. Now the standard way of dealing with it is to supervise at the level of the logical form. So the training set is sentences, logical forms.
And you can imagine that with this kind of training set you can learn mapping. But of course it is expensive, because someone needs to know how to write these logical forms.
So what we want to do is we want to use only the responses. So I am internally going to translate it to some logical form. I am not going to be supervised at that level, rather I am going to send it to the system, get the response and then someone is going to look at the response. If it was Pennsylvania they are going to tell me yes. If it was New York City then they are going to tell me no. And the question is can I use this level of supervision to train a structured predictor?
And hopefully given what I have said in the last half an hour, you can believe that this is possible to do and in fact the same kind of ideas can be used to do this. So the key question here is can we learn from this type of supervision? And the answer is yes. And
I am not going to give the details but as I said, you know, hopefully I have led to this.
And what you can see is this is one of the experimental results from last year’s CoNLL where the best result as of that time was trained with 300 structures and we trained with no structures only binary supervision for these 310 examples, and we can get almost the same results. So basically it is a first step. There is a lot more to do here but it shows that we can actually reduce a lot the level of supervision, make it a lot more plausible and push it to what we call response best learning which is a lot more plausible. The current talk that we are doing on this is actually in a different context. We tried to learn instructions for games and we have a new paper in H Sky this year that talks about how
we use response based learning to actually understand instructions and learn how to play card games legitimately.
So that is basically what I wanted to say. I talked about constraint conditional models which are the computational framework for global inference and I view it really as a vehicle for incorporating knowledge in structured tasks. Most of the people that use this kind of framework will collect ILP formulation. And as I said there is a lot of work on it and you can see the tutorial for more details. I really focused on learning and on really two paradigms. One of them which I call constraint driven learning where the focus was to show you that you can actually start learning using simple models and bias them with expressive knowledge or constraints to make more expressive decisions, and the second point was the indirect supervision. It is often possible to learn structures or to use structure as an intermediate level even when you don't supervise for the structures. And I am going to finish with another shameless plug for one of the software packages that we have, Learning with Java which is a modeling language or programming language that we are developing that will allow you to deal easier with building learning systems, building constraint conditional models. There is a natural way to incorporate learners into programming in this language including constraints and through inference.
Thank you [applause]
>>: About the last example, I don't know how [inaudible] you can [inaudible] added supervision. For example in the last example [inaudible]'s process knowledge, I mean
[inaudible] my question is how far those [inaudible] supervision go?
>> Dan Roth: Okay. Basically what you are saying, if I understand correctly, is sometimes this binary supervision could be noisy? But, you know, the structured supervision can also be noisy. And I think it is more likely that the structure supervision will be noisy because it is much harder. I mean you really must be an expert to supervise at the logical form level. You need to be less of an expert to supervise at the binary level so I think we are going to, we will see noise. And we know today that we have pretty good learning algorithms that can tolerate certain amounts of noise. My claim is that it is still much easier to supervise at this level. There is the technical question, of course, how far can we push it? Of course the best thing would be to structure, to supervise at the structural level. Only that I am saying that it is implausible in many cases. I mean both cognitively if you care about this, but also engineering wise. It is sometimes very, very difficult to supervise at this level and we need to develop capabilities to push supervision at this level. That is the…
>>: My question is [inaudible] action is there must be some kind of machine structure
[inaudible] structure. [inaudible] intrinsic variations of approval [inaudible].
>> Dan Roth: They do because they don't have as tight a relationship as you would like maybe. But if I translated a completely wrong answer, a completely wrong logical form, it is possible that it is unlikely that I would get Pennsylvania as the answer. I can still get it. So and that is the noise that I am going to get into my training, but it is very unlikely.
On the other hand if I got Pennsylvania, I can still get Pennsylvania with the wrong logical form, but again it is more likely that I will have a pretty good logical form if I got
Pennsylvania. So there is noise inherent noise in this process, and in fact we can show examples where sometimes the logical form is not perfect, nevertheless we got the right answer. But, you know, that is something that your algorithm would have to deal with.
>>: I would say the opposite, that there is not a close relationship with your [inaudible] structure and the supervision is not an advantage because we don't know what the right instruction is [inaudible] and so my analogy would be borderline [inaudible] people put a lot of work into annotating [inaudible] and supervise crew are like [inaudible] and there is no correlation between how accurately you do all of that task and your actual translation and the reason is because you don't know what the right [inaudible] is from the human point of view so it is better to…
>> Dan Roth: This was one of the motivations for this work and in our work on understanding instructions for games; we downloaded from the web hundreds of instructions for FreeCell. It is a version of solitaire if you have not, it is a better version played [inaudible] [laughter] so and really my thinking was, I read instructions. First of all the instructions were pretty bad. People who have nothing else to do but write instructions for FreeCell, but I don't know if your representation and my representation for the game are the same. In fact I can bet that they are not. Nevertheless we play legitimately. So I don't know how to supervise well at the level of the intermediate representation. The textual entailment is the same problem. We really had problems getting annotators to tell us why is it that this is the same as that, at least the operators that you moved from the hypothesis to the text or from the text to the hypothesis. And there was disagreement on that. But there is almost no disagreement at the binary level.
So I completely agree with that.
>>: With this binary supervision where you have a hidden layer, where is the place for adding constraints? Because you are argued for that in the second part of your talk.
[inaudible].
>> Dan Roth: [inaudible] yes, in fact we do put them there. So at the very high level the constraints are, where are they? They are here. But the idea is that you really need to search a lot of hidden structures. And there is no chance that you will converge to a good hidden structure unless you have some knowledge that really constrains you. So constraints come in this inference or if you want to look at the objective function, this max operation here chooses the best hidden structure currently and the best under some constraints, so I bring in some knowledge on what is a legitimate internal representation.
So example and paraphrasing, I have some constraints on how an alignment may look.
For example, in most languages, pairs of languages, I do not allow core [inaudible] edges.
Or I have some other constraints to take into account, you know, some lexical information that I have, distance between tokens and so on. In the logical form I also have some knowledge about what could a legitimate logical form look like, so I am restricting--you can think about it as restricting the search space very significantly.
>>: But in that way are you not at odds with just, you know, saying I just don't know what the hidden structure is I just want to make the right predictions?
>> Dan Roth: Yes. So if you want I am saying if I really don't know where the instruction is, I don't know that you can solve it. We have to pour knowledge into our problems and I think that is essential in NLP in general. I would argue even more that the key problem we have in NLP today is that we don't know how to do that. How to pour knowledge on our problems and this is an instance where we figured out a way. I do not know that it is as general as I would like it to be, but it certainly a case where, or a collection of cases where we have figured out a way to encode knowledge declaratively and have it help us to improve how we do. So, I call it sometimes cheating. We are cheating in this response best learning because we are incorporating knowledge expectation on what do I want the answer to look like into my model.
>>: So an example of what did you…
>> Dan Roth: So in the paraphrase I initially tell you, it is very simple. In the paraphrase all we have is we have no coursing and some lexical method. In the game that we had in
H sky it was a lot harder. And so we have a little bit more knowledge that we poured into. But basically it all boils down to some lexical knowledge and locality information, so basically we are saying--you can also think of this as some kind of alignment between text and a representation and we want things to be rather local. So if you have an SLL representation of the text I want things from one argument to stay locally in my logical representation and I don't want to see one argument here, one argument there and so on.
So these kinds of constraints that we are using our sensible. In fact, they are not hard constraints because you can imagine cases where they are violated, but it gave us a lot in terms of constraining the representation space.
>> Geoffrey Zweig: Any other questions?
>>: So [inaudible] examples that you generate for the binary [inaudible] how many
[inaudible] examples [inaudible]?
>> Dan Roth: That's good question. I don't have here a graph actually [inaudible] showed a nice graph. We had an--the threshold between positive and negative could go up to you know ten or something like this. And we have a very nice graph that shows that we actually gained from adding more negative examples, which I think is kind of cool because a lot of settings of multitask learning and so on or semi-supervised really only use positive examples, because we don't have a way to use negative and we have, so there is a paper in ICML last year that has this graph that shows the gain from negative examples.
>>: [inaudible] levels would be more informally than [inaudible]?
>> Dan Roth: Excellent, yes. I think there is a lot of room to figure out what are good negative examples. And we haven't done a lot of work on that. Completely agree.
>>: [inaudible] example of [inaudible]. So how, in a sense [inaudible] tell me anything.
>> Dan Roth: So two things I can say. One, some of you that may know about the work of Jason Eisner and [inaudible] Smith on [inaudible] estimation, it has a similar flavor only, with two key differences. One is they are doing EM so they do some and we do max. We can discuss at length the advantage and disadvantage of each one, but the key difference is that they have to pair positive example and negative. For each positive they have to generate the corresponding negatives. Here this is completely decoupled. You can choose negative any way you want. But there is still the question of what are good negatives and intuitively you are right. The good negatives are going to be those that are closest to positive. We haven't done any kind of really good study of that. We have a policy of how we generate negative examples for each one of the three domains we worked at, and I am sure you can do more there and kind of develop an understanding of how to generate good negative examples.
>> Geoffrey Zweig: Let's think the speaker. Thanks. [applause]