>> Kuansan Wang: Okay. Let's get started.

advertisement

>> Kuansan Wang: Okay. Let's get started.

Today it is my pleasure to introduce Mr. Ming-Wei Chang from University of

Illinois at Urbana-Champaign to visit us. Mr. Chang is a student of Professor

Dan Roth, who is very well known in the machine learning and NLP area, and last month when I visited UIUC Mr. Chang's work was highly touted as one of the most proud outcomes of his lab, and, as such, Mr. Chang will be giving a talk later next month in the special workshop. So it's very nice that he can stop by today and give us a preview of what he's going to talk about.

So without further adieu, take over.

>> Ming-Wei Chang: Thank you.

So, hi, everyone. My name is Ming-Wei, and today I'm going to talk about structured prediction with indirect supervision.

So everybody can hear me, right? Good.

So first I'm going to talk about the motivation, so what is indirect supervision and why, and then we're going to talk about other parts of the talk.

So the first. So as a computer scientist, we always dream one day we can communicate with machines. So, for example, you ask machine, I'd like a coffee with no sugar and just a little milk.

So one way to communicate with machines is that if we can change that human request into some meaning representations, so, for example, make coffee, sugar equal to zero, milk equal to 0.3, and then the machine can understand your comment and maybe machine can do something for you.

So, for example, maybe Wally will come here and make a cup of coffee for you.

Okay. So the thing is that we need a translation to help you to translate a human request into meaning representations. So how do we do that?

So the common -- the usual way to do this is to build a supervised learning model. So, for example, in the training time you have a lot of pairs of texts and meaning representations so you hire a annotator for some human request and the corresponding meaning representation. You collect a lot of those pairs.

Now, you fit this training data into the training algorithm, then you obtain a model such that in a testing time when you see a human request, you can translate this into some meaning representations.

So what's the problem with this approach? The problem is that labeling the data is very expensive. Especially in this case, the annotator need to know how to translate the human request into some meaning representation, and this is not easy.

So the question we're trying to ask is can we build this model, can we train this model with other type of information.

So, for example, so let's say you drink this cup of coffee and you tell Wally it's a good cup of coffee or a bad cup of coffee. So this response doesn't tell you the meaning representation directly, but it contains some information about the meaning representations. So the question we are going to ask here is can we use responses as a supervision signal to improve the statistical model. That's the first example.

So the second example is constraints. So let's say right now you want to move to someplace and you want to rent an apartment. So you go to CraigsList and there's as post say two-bedroom condo, garage, and a new oven and stove, blah, blah, and you want to know that the first three words tell you about the size of this apartment and then the following five words are talking about the feature of this apartment, then followed by the size of apartment again followed by the neighborhood of this apartment followed by the content information of this post.

So you want to build a machine learning classifier that tell you this information.

So, by the way, if you are really into this condo, don't call this number, because this is my cell phone number [laughter].

Okay. So if you build a [inaudible] model and then you predict on this post, you'll probably get something like this. So is this correct? The answer is no. Why?

Because as a human, we can immediately point out something wrong here. So you shouldn't stop at some word which is not meaningful. So basically you ignore the punctuation as a [inaudible]. You should only reach state at the punctuation marks. And there are other simple constraints.

For example, phone number should belong to contact. So we all know that if we can invoke this constraint when you predict the output, it is likely you will make the predictions better.

But the question we are going to ask here is different. We're going to ask can we use constraint to improve the model.

So here's the motivation. So here is the diagram of the user supervised training algorithm. You have labeled data. You go to the training algorithm, then you build a model.

So what's the problem of this approach is that this is a time-consuming and expensive process, and it is worse for structured prediction where you have many decisions to make. So, for example, in meaning representation task you need to know the first will move to this comment and the second move to another comment.

And it does not use existing knowledge, and it do not use [inaudible] labeled data. So the motivation for today's talk is we want to find some way that we can accept all forms of supervision. We want to develop a [inaudible] approach that

can accept all forms of supervision such that we are not limited to only labeled data.

And why? Another reason is we want to reduce the supervision effort. So this is one of the major [inaudible] in [inaudible] because there many, many new interesting applications unexplored, and there are many languages. So because the reason we don't explore those application is because we don't have enough supervision.

So I know many of you are machine learning experts, so we all know we already have some way we can consider this problem. For example, we have a supervised learning algorithm -- unsupervised learning algorithm where you just input x and you fit into a training algorithm, then you build a model, you have semi-supervised learning algorithms.

So the problem is that if you list of these three learning frameworks, you will find they have already make an assumption where the supervision only equal to labeled data. So when you say this is the unsupervised learning algorithm, which means you didn't use any labelled data, but that's not true because the supervision does not only equal to labeled data. Some other information can only be supervision.

So the main idea of today's talk is we want to learn with indirect supervision. So besides the labeled examples and unlabeled example, we want to use something else. That's what we call indirect supervision.

So what is indirect supervision? The rough definition is the supervision that does not tell you the target output directly. So, for example, in response in the meaning representation example, the response doesn't tell you the meaning representation directly. But somehow it carries on information about your output.

The constraint doesn't tell you the target output directly. The constraint doesn't tell you the first three word is size funny followed by the feature, but the constraint tell you something about what the target output should look like.

So the advantage of using indirect supervision is we can directly use human domain knowledge to improve the model. It allow us to use supervision signals that are a lot easier to obtain and using existing label data, and it can be combined with semi-supervised learning.

So I hope after today's talk you'll be convinced that you can use some indirect supervision in your application and it can help you to reduce the supervision effort.

So another reason is because of this. So this is my three-month-old daughter, and obviously she's trying to say something. And so the scientific challenge here is that people learn to speak not with direct supervision. So my wife tried to teach my daughter to say something. She used visual clue and feedback interaction with my daughter. We didn't give you 1,000 sentence and labeled data and hope that she will speak automatically.

Okay. So how many of you are familiar with structured output predictions?

Okay. Good. So let's go over this very quick.

So given an example, a binary classification problem is that given an example, you want to know if this is a positive example or a negative example. So, for example, spam filtering. This is spam or not a spam?

The structured output prediction is different. You want to choose one structure from all possible ones. So there are two characteristics. The first one is that in one example there are multiple interdependent decisions, and the second characteristic is usually you have exponential number of structures.

So, for example, let's make the advertisement post here. You have two-bedroom condo, garage, and this is your input. You can think of this as a [inaudible] model, and then so you have exponential number of outputs. So, for example, you can say everything is neighbor or the first one is feature and followed by the neighbors, and the correct output is also along the exponential number of structures.

So in order to build a model you need to have a capability to find out the best structure among the exponential number of structures. So this is a review. And this is annotation. So we annotate -- we denote input as x, the output as h, structure as h, and our training model is w, which is basically a linear model. So basically we assume there's a feature vector between input and output space, and the idea is we're going to have scoring function between input and output.

And the score function is simply the [inaudible] w and the feature vector.

And as I say, in a testing time you need to be able to pick the best structure among exponential number of structures, and this is a mathematic way of doing that.

So you want to pick the best structures among all possible structures such that the score is maximized.

So you can think this is a [inaudible]. So I just finished the motivation, and I'm going to talk about constraint-driven learning very fast. Then we're going to talk about how can we learn semantic classifier with constraint on latent variables and how can we use binary labeled data as indirect supervision for structured output predictions and how can we use response to learn the meaning representation. That's the final part.

Okay. Let's go to constraint driven learning. So we have constraints. We want to see can we use constraint to improve the model.

So how do you combine constraint with HMM? So one easy thing is that when you make predictions, you just disallow the assignment that violates constraint to appear. So basically you penalize the assignments that violate the constraint.

Okay. So think about a model. You have a model like an HMM model. So the semi-supervised learning algorithm usually can be expressed in this [inaudible].

You have a model and you label the unlabeled data. After you got a newly labeled data, you've got a new data, and you flip back to your model to retrain your model. So you've got an improved model, and then you do this iteratively until it converts.

So, for example, [inaudible] self-training, and other algorithm can be throw into this [inaudible].

>>: The question is where do you get the constraints? Do you think you're given the constraints or you want to learn the constraint?

>> Ming-Wei Chang: Oh, we're given the constraint.

>>: You're given the constraint?

>> Ming-Wei Chang: The human give us the constraints.

Okay. So the idea of constraint driven learning is very, very simple. It is after we have constraint we just put constraint into the cycle. Okay? So after you have constraints you've got a better labeled data, but the point is because after you got the labeled data you got a better feedback for your model, so your model will get better. Okay?

So that's why constraint can help you to improve your model.

And the whole process -- this is not ad hoc -- the whole process can be cast as an optimization procedure.

So here's the result. I show this result just to show -- to make sure that you understand.

>>: Can you go back a slide just so I understand? So the learning part, the feedback into the model, at that point when you're updating the model, you ignore the constraints?

>> Ming-Wei Chang: Right.

So this advertisement dataset. So if you just use the HMM model with EM and you got a 60.75 accuracy, if you have constraint and you train and test with constraint, you can got 70.79. So you get big improvement.

But if you take away the constraint in the testing time, so basically you go back to

HMM model, but you put -- because the constraint populated information during this learning process, so you've got a better HMM model, and your HMM model is in fact better than your original one.

>>: [inaudible]

>> Ming-Wei Chang: Accuracy on how many token-based accuracy, what's the accuracy that you classify correctly. This will carry the information of size of the feature [inaudible].

And when we published this paper, our number is one of the best, so if you use all the data you get 82. And here is a new result we did today -- this year, not today, on domain adaptation. So you have some state-of-the-art NLP tools, but the problem is that when you apply on other domains, usually you suffer from domain difference and you got better results. So this is a known problem, very important problem.

So the way we're going to solve this is we're going to read the annotation guideline. So we have a [inaudible] we take data in the news domain and we going to publish it in a biomedical domain, and it so happen on the web there's an annotation guideline for the [inaudible] data on the medical document. We just read that annotation guideline and we translate the annotation [inaudible] into constraints.

So this is the baseline of the state of the art tool. So this 86.2, this is another test, semantic role labeling, and you got 58.6 on a new domain.

After adding constraint you are already get a lot better. But you can also feedback into the constraint into your learning process and you can get even better result.

So the idea is that we can use constraints as a [inaudible] to improve your model, and it is help not only for the small [inaudible] but also for the large [inaudible] if you want to apply this technique to a new domain.

And there are many recent works that adopt similar ideas. So McCallum in

UMass and Taskar in 2010 has a paper. So basically there's a tutorial this year talking about similar ideas. And they've applied idea on dependence parsing, word alignment, and document classifications. And there's also a project in CMU called NELL, Never Ending Learning, where they want to try to do web scale information extractions. So basically they want to know from reading the web that Michael Jordan is a player for Chicago Bulls.

So they do that, and they found out if they don't put constraint in their bootstrap learning procedure, their result is a lot worse. So they claim it is necessary to put constraint into their new process.

So I just finished constraint driven learning, and I'm going to jump to a second a little bit unrelated task, but I'm going to jump back to structured output predictions.

So let's say we're trying to learn a deep semantic classifier. What do I mean by a deep semantic classifier? It is a purpose identification. So let's say you have two sentences. The first sentence is Alan will first face murder charges, Bob said.

And the second sentence is Bob said Alan will be charged with murder.

So the question is are these two sentences purposes of each other? And the answer is yes. But why? You know the answer is yes because you know you form an alignment internally in your brain such that you know the first -- the information in the first sentence has been carried by the second sentence and vice-versa.

So in order to make this decision you need to have an intermediate representation for this problem. And this is just an example. The real intermediate representation is a lot more complicated. But now just think the intermediate representation is an alignment.

So the problem of interest here is we want to do binary classification output, so we want to say yes or no. But we need some intermediate representations, and it requires some structure that justify the label.

Okay. So one important thing is that in a training data -- in a testing data, unfortunately you don't have intermediate representation. All you have is two sentences and the answer is yes or no. So they are latent.

So I just talk about two types of problem. The first one is structured output learning, so input is x, output is a structure. And the second type now is the input is x, output is a binary, but you need request on hidden structures. So I promise you I will go back to the structured output prediction, but let's stay on this problem for a second.

Okay. So how do people right now handle this [inaudible] identification problems? Basically they use a two-stage approach. In stage one people generate intermediate representations, so we obtain the intermediate representation, then they fix it. Maybe you can use some heuristic, maybe use some other model.

And in the second stage you just use -- build a model to extract -- based on the feature, you extract your input and intermediate representations.

So once you fix the first stage, basically the second stage is a classical machine learning binary classification problem.

But the problem is that you don't know if the intermediate representation you choose is good for the second stage or not. Maybe you can with some alignment, but maybe that alignment is not good for your binary classifications.

So we want to fix that.

And the second problem is that if you want to apply the same framework into three different problems, you need to have three different procedures that are finding the intermediate representation, and you won't have a unified framework to do that.

So our framework called LCLH, but it's just a name, is we want to jointly learn the intermediate representations and labels. So, for example, the first -- we have input, x as an input, h as an intermediate representations, these are the features

and these are the binary labels. We want to see if we can use the binary output label -- to feedback to see if we can find a better intermediate representation for the output [inaudible].

And the second property is that we want to use the constraint based inference for intermediate representations. So basically we want to make it easy to inject knowledge for the latent variables and easy to generalize to other tasks.

Okay. So this is probably one of the most important slides here so let's go slow here.

So I keep talking to you that we should do a joint learning between the intermediate representation and the binary output. But who said that is a good approach? So here is the justification, the intuition.

So let's talk about this example. This is a purpose identification of two sentences, and you want to know if they are purposes of each other. And only positive example have good intermediate representations. Why? Because only if they are purpose of each other you know they have alignment, make sure that they carry the same information.

And if they are not purpose of each other you cannot find alignment that justify that they carry the same information. So there's a connection between the output and the intermediate representation.

So we want to formalize this idea, so let's say x is the sentence pair, h is the structure alignment between these two sentences, and the weight vector is w.

Okay. So let's say you have two sentence pair. The first one is positive, the second one is negative, okay? So I say if these two sentence are purposes of each other, there must exist a good explanation that justify the positive label.

And if they're not, then no explanation is good enough to justify the positive label.

Okay?

So how do we formulate this idea is that we say exists an alignment such that the score is good enough. And here is no explanation is good enough, so no matter what alignment you try, the score is not going to be good enough.

Any problems here?

Okay. So -- okay, so I just copy information from last page to here. Let's see what does it mean on the hyperplane, on the geometric interpretation.

So we all know in a classical binary problem you have one point that is positive, one point is negative. You just draw a hyperplane that says, okay, this is a good hyperplane because [inaudible] positive example for the negative example.

But it is another case here, because our feature vector is over the input and the hidden variable, but we don't know what is the hidden variable.

So basically because we don't know what is hidden variables, we just put all possible hidden variable here. So this is a sentence. We just put all possible alignment here, and this is a feature set of all possible alignment.

And we can do the same thing for the negative data. So we just put a set. So for a second example we put for all possible alignment, for each alignment we put a point in the hyperplane and they will form a set.

So I'm going to this is a good weight vector because this increment the intuition there. Why is that? So when we look at a positive example we say there must exist one point that is on a positive side, right? So we know I can find one point, one blue point, on that positive side, so this is good.

So for a red constraint I say all of the point need to be on a negative side, and all the point is on the negative side here. So that's why we also satisfy the second constraint.

So this weight vector help us to increment the intuition we mentioned in the last slide, but if you want to verify in this way you need to go over all possible point in the set, but in fact you don't need to do that. All you need to do is to find out the best point here.

So this is the best point because this is the farthest point along this direction because the weight vector is [inaudible] here. So this is a [inaudible] point, and we only need to know this point is on the positive side because exists a point on the positive side is equivalent to, say, the best point is on the positive side.

And for all points on negative side is equivalent to say the best point is on the negative side. So for a negative example we only need to check this point and we only need to see they are on a negative side.

So I say a lot, but what I really want to say is the final sentence.

So although we are trying to do a binary classification task, but in this problem in order to make a prediction you first need to find out the best structures, the best alignment, then you see if this alignment is good enough or not to make your final predictions. So it's kind of a weird comparison. You have to do a binary classification, but you need to find a structure in order to make these predictions.

So any problems here?

Okay. So we also use the integer linear programming here, and importance is to generalized to all the tests. So in the experiment we used one learning framework and one inference framework. So all we do is we replace different -- for different problem we replace the problem with different constraint and different features.

So basically what we're trying to do is that you have a declarative framework, so you define your problem into using constraint and features, then you plug into this linear framework and we will generate a weight vector for you.

I'm going to skip these details.

>>: So you made this decision that the alignment between the two sentences are a good signal of the new paraphrases. It could be others, right? So you can look at it like grammatical like a parse and make sure that no parse is aligned or something --

>> Ming-Wei Chang: In fact we are trying to do is the parse is aligned.

>>: So I guess what I'm saying is there's all these different potential structures that you can [inaudible]. Don't people today generate these different kinds of hypotheses and use those as features for the binary decision task?

>> Ming-Wei Chang: Yeah. The problem is that -- so, for example, one of our tests is [inaudible]. So, for example, you want to say -- [inaudible] is in the third sentence you have a paragraph, and your target sentence is one short sentence, and you want to know if this paragraph imply this short sentence or not.

However, let's say the short sentence is George Bush is smart, okay? But maybe in the source paragraph you have five sentence, and they all have

George Bush, and maybe some of them have smart. So then if you just find one alignment you don't know if this alignment can support your final decision or not.

Even though you use parse [inaudible], even though you use other information, you do [inaudible] recognition, you do everything perfectly, you still need to find a correct alignment to support your final decisions. Is that clear?

>>: So you made a point earlier on, I think -- following up on Patrick's question, you made a point earlier on that one of the distinct approaches you take is that you actually put some constraints in here?

>> Ming-Wei Chang: Yeah.

>>: And so where are the constraints in the paraphrasing example?

>> Ming-Wei Chang: Oh, okay. So I skipped this one, but let's go back to here.

So basically right now we're using very simple constraints, but we can think there are other constraints. For example, you have -- right now you have dependency tree for the first sentence and a dependency tree for the second sentence. So you're saying I'm going to not only align on a node but also align on the dependency age as well.

>>: [inaudible]

>> Ming-Wei Chang: The traditional is you find alignment, then you just fix the alignment, then you do it right. Now, what I'm trying to --

>>: My understanding is you would create the alignment and you'd get some kind of score for that in the distance and you'd use that as a feature for your classification task, and you'd also use one that compares the difference between two [inaudible]. So you'd have this collection of features from these different structures and then you'd let the binary --

>> Ming-Wei Chang: Okay. So say you [inaudible] three alignment and then they all have different score?

>>: [inaudible]

>> Ming-Wei Chang: Okay. Yes. Okay. So right now what people do in this task as far as I know is they generate one alignment, then they just fix it.

Maybe they use the score of this alignment or not. I'm not so sure. But the problem is that -- my point is once if you commit into one alignment, there's no way to go back. Right?

So that's the point here. The point is that we want to see if there's a way we can fix that alignment. You can think this is like a supervise EM. We want to do EM, the procedure, but we want to see this EM procedure was guided by your output, not just you run even free there. I'm not so sure if this explanation is good enough.

>>: [inaudible]

>> Ming-Wei Chang: Yeah, we can talk about later.

Okay. So this is how do we do here, and I'm going to do a bit quicker here, okay, because we don't have a lot of time.

So basically this is how you do logistic regression and support vector machines.

You write down the formulation and you put this decision function layer.

Depends on what kind of [inaudible] function you use, this can be SVM or logistic

[inaudible] or something else.

Okay. Now, as I told you, our decision is inference process push, you need to find the best structures and to see if the best structure is good enough. So basically your decision function is [inaudible] programming here, and you want to see if the result is -- the result of your alignment is good enough or not.

And you just put it here, but, however, right now it's pretty complicated because inside we have a complicated inference problem.

So this is not a [inaudible] in SVM because inside the decision function you need to solve the max problem. Okay?

And unfortunately if this also affect the features, if you choose a different alignment you're going to generate different features, and that will impact your learning result, and that's why this is not a [inaudible] situation or SVM.

And there are many related learning frameworks. I will be very happy to talk about the related learning framework, but I don't have time, so if you want to know about related framework, I'm very happy to talk to you offline.

Okay. So how do you optimize this function? This is not a good description of

SVM so you cannot apply the regular package to that. There's no shortcut. If you want to write a wrapper and it call SVM multiple times and hope you can optimize the function, it doesn't do that.

So basically our solution is an optimization algorithm that do EM-like procedure with other machine learning -- optimization tricks. These are interesting but not that important. So if you are interested you can ask me later.

And it's simple to implement in the support parallelization of the inference procedure.

Okay. Let's talk about the experimental result. We have three tasks here. One is transliteration -- is named entity B a transliteration of A -- textual entailment, and the paraphrase identifications.

So we have a goal of experiment to see if our method works. That's one goal.

But in fact in our framework we have two component. One is the IOP component and one is the joint learning component.

So in order to make sure that we didn't cheat here we used exactly the same features, exactly the same constraint for these two approach. The only difference is that in a two-stage approach we used domain-dependent heuristic to find the alignment in our approach also finds the agreement representation automatically.

So this is transliteration on [inaudible] past people got 89.4. If you just fix your alignment and then you do it, you can get 85.7, but if you allow your alignment to move you can get a much better result.

This is a textual entailment system. You have the median of the TAC system is

61.5. Our two-stage is already good because I think because we added some constraint to capture dependency between these two sentences. And our LCLR is even better. Two percent here is very difficult.

And here is the paraphrase result. So here's one interesting experiment. So in order to do the paraphrase -- to find alignment for paraphrase, we tried two different heuristic to find alignment.

So Wup is the scoring function that you're coming from [inaudible] and you can find alignment between -- find the most similar on the other side.

And WNSIM is the storing function that we developed in our group, and my friend told me this is a much better scoring function for this task.

So if you use Wup, you fix the alignment, then you do the accuracy, you got 72.3.

But if you let the learning framework allow to change you can get a much better result.

However, if you are using a very good heuristic at the beginning, 63, and you got a 63.2 and you can only get a little improvement here.

So the point here is if you know what the correct alignment in the beginning, then you might not need to do this, but if you don't know, then it is a good idea to let learning algorithm to drive you a good alignment for your task.

So this part I'm talking about how can you focus on binary output problems with latent structures and can we find the best latent structures for the binary problems.

Okay. So here is the first part of the talk, and here we're going to move back to the structure output prediction problems.

So our goal is the same, is that given that supervising structures is time consuming and require expertise, how can we reduce the supervision effort for the structure output problems?

And the research question here is is it possible to use additional cheap sources of supervision? So note this is a structured output problem, so we are going back to the structure output problems, but there's a connection. We will see it immediately.

So let's say you have a task that given a car image, you want to know where are the body, windows, and wheels. And here they are. And this is a structured output prediction problem because there are multiple decisions, there are wheels and there are bodies, and they are interdependent because usually wheel is in the bottom of the body, usually.

So you can build a supervised approach. You do some labelled data, you build a machine learning model. But this is expensive. So you can also use semi-supervised learning approach where you collect a lot of car image and then you do semi-supervised learning.

But the question is what happened to flower? Can you use flower to give you some hint or to help to you learn how to recognize car parts?

Okay. This is what we call invalid data. And here's what we're trying to do.

We're trying to say we're going to collect this two data together, we're going to say these are invalid data, they are invalid data, so we have binary data here, and we will see if this data can help you to do structured output predictions.

Okay. Here's my daughter again. And she needs some help from you, and she is interested in indirect supervision.

Okay. So the first question is here on the left-hand side you have a car image, and you want to know where are the body, windows, and wheels.

On the right-hand side you want to know is there a car in this image. So my question is, is there a connection between these two problems?

>>: [inaudible]

>> Ming-Wei Chang:

>> Ming-Wei Chang: Yeah, very close.

So only a car image can contain car parts in the right position. And a non-car image cannot have car parts in the right position. So here's another example.

So in the left-hand side you want to know given an English named entity and its

Hebrew transliteration, what's the phonetic alignment? They say phonetic alignment is that the sound of the first character is equal to the sound of two

Hebrew characters. Hebrew is found right to left, by the way.

And on the right-hand side we have a problem is given one English [inaudible] and one Hebrew entity, are these two [inaudible] a transliteration pair?

On the right hand the answer is no, and this is the alignment.

So is there any connection between these two problems? And the answer is yes.

Because only a transliteration pairs can have a good phonetic alignment. If a non-transliteration pair and they have a good alignment, then this is not right, because if they have good phonetic alignment then they are transliterations.

So the key intuition here is for many structured output prediction tasks you can find a companion binary task where here is a decision problem. Prediction whether input possesses good structures or not. And why is this important?

Because we're going to argue the binary label is a lot easier to obtain than this structure labeled data. And the question is how can we exploit the relationship between these two problems.

So first let me convince you that binary label is very easy to obtain. So in this example for the phonetic alignment we hired a Hebrew [inaudible]. So he sit there one day, he go over the data set and annotate the alignment.

On the right-hand side, the binary output label was -- binary label data set was collected by an American student who doesn't have any idea of Hebrew. So the reason he can collect the binary labelled data is because he went to Wikipedia.

For each English page he found the corresponding Hebrew page and he just generate a positive example from that and he generate a negative example by

[inaudible] pair the [inaudible].

Okay. So as I say, the binary test is does this example process good structures or not?

So how can we formalize this idea? Okay. If this is a good structure then it must exist a good structure to justify the positive label.

And if this is not negative, it means that no matter how hard you try, you cannot find a good structure that is good enough.

If you pay attention to the second part of the talk, this is familiar, because we are going to use the same intuition here to see if we can use binary labeled data to help structures.

Okay. So here is the geometric interpretation again. So assuming this is for a given example, this is the feature vector for all possible structures and this is a set, and let's say you use your favorite structure learning prediction model over the label data, you find a weight vector here. And this is your prediction because this point is the farthest along this direction, this is your prediction, and however the gold data is there.

So the question is why the binary labeled data can help you to correct your output even though you don't know the gold label is there. This is unlabeled data so you don't know the gold data is on that corner. You only know that you make the predictions.

So let's say you put negative data. However, once you put negative data here, this thing doesn't satisfy the constraint where I said one positive example -- one positive point need to be on the positive side, which is okay, but all the negative point need to be on the negative side.

So if we have a negative data and we are lucky enough in order to make that aligned to that second side constraint we need to rotate the line. And once we rotate the line, if we're lucky, we can make the correct predictions again.

So I mentioned about two things. One is in the second part of the talk where I say you want to find the hidden structure to help you to find the binary output, but now I'm going to reverse the order. I'm going to say can binary output problem help you to find a better structure.

So in the second part the main focus is binary output problems, and we don't have any labeled data for the structures. But now the main focus is on the structure output problems, and want to see if binary output problems is -- can be helpful or not.

So in this case we have both structured labeled data and the binary labeled data.

So here is our formulation. The first thing is a regularization [inaudible] the second is the direct supervision, which is corresponding to the structured labeled data, the third term is related corresponding to binary labeled data.

And the point is that we share the same value between the direct supervision and indirect supervision, so we can see if the indirect supervision can help us to get a better weight vector or not.

Optimization is similar to the one we used for LCLR, but now we need to support structure SVM because right now if you don't have the certain, then you are going back to the structure output SVM. So if you don't know what is structure output SVM, it is the opposite of CIF. Sorry, I didn't explain anything.

So what I'm saying is that it's a learning algorithm that can help you to find our structures, and our learning code is online.

>>: Is C1 is C2 here?

>>: Yeah.

>>: [inaudible]

>> Ming-Wei Chang: So we have three tasks, and we have phonetic alignment, part-of-speech tagging and information extractions. And for the transliteration pair, as I told you, we collect the positive data from Wikipedia, we generate a negative [inaudible] by shuffling the pairs.

For possess tagged sentence we get the positive data from the English sentence and we [inaudible] data by just shuffling a word.

So this is the figure of our result. This result is [inaudible] because in this figure we use very little structured labeled data. We used a lot of binary labeled data.

But in this setting we can get significant improvement if you use binary labeled data because the binary labeled data can help you to find other better structures.

And here is a figure to show you that in fact we fix the positive example and we increased the number of negative example along the way, and as we can see, if you add more and more negative example, you'll get better and better result.

And this is the crucial difference between the discriminative semi-supervised learning framework and our framework because usually in a discriminative semi-supervised learning framework people didn't use negative data.

So what I'm trying to say is it is possible to use binary labeled data as indirect supervision to help structure output predictions, and our framework [inaudible] and can gain from both direct supervision and indirect supervisions.

So one interesting thing is that what happened if you don't have any -- if you don't have any structured labeled data? Basically then you expire this framework to EM.

And we did a preliminary result, and we surprisingly found that if we don't have any labeled data our result is comparative to EM, sometimes better than EM.

Okay. So here is the last part of my talk is want to use world's response as indirect supervision.

So let's go back to the first example where you are asking for coffee from Wally, and you want to know if this is good or bad. And the question is can we use responses as a supervision signal?

So here is the real problem we are trying to attack is called -- this is a problem called semantic parsing. The problem in the input is what is the largest state that borders Texas? And the output is a meaning representation which is largest state next to Texas, some logical expression, such that in the running time you can translate this logical expression into an sql query and you can query the database.

So in our experiment we have database, we have a translator that translate the logical expression to sql query and want to see if we can get an answer.

Okay. So we are not the first one who work on these problems, but all of the past works, they assume feedback at the meaning representation levels basically in the training data. They have questions and the meaning representation pairs, and it is very expensive to collect.

So what trying to do is that we're trying to use indirect supervision, response as indirect supervision, and we think we can get responses more easily.

So what we're trying to do is we don't use any labeled meaning representation, zero labeled meaning representations. We use the answer. The answer is

[inaudible] here.

And here's what we called response driven learning. So we assume there's a teacher, and the teacher just stand there, and you let your model run. So your model for each human response -- human request, he will try to translate into some logical representation and we will try to query this representation into the database and then we will see if the answer matched with the -- our answers or not.

And if the answer is correct, we will say positive one, if the answer is wrong, we say a negative one.

And the question is can we just use this feedback to learn how to translate human request into some meaning representations.

So here is a diagram again. You have a meaning representation as input, you try to predict into some meaning representation, you apply to your database and you got the answer, you collect, you compare your answer, and then you tell your model if this is correct or not.

So here is how we evaluate. Oh, sorry, I forgot to tell you.

So why do we believe this will work?

So the answer is the intuition was very easy. The intuition is that if you got the answer correct, it doesn't guarantee, but it's very likely your meaning representation is correct. So we work straight from there.

So our algorithm is very simple. It's we will let the model run. And until the model keep the first [inaudible] we're going to keep it running. But once it hit the current answer we will try to collect other meaning representation and we're going to use this meaning representations and to improve our model such that in the next time we will make more predictions -- we will find more correct answers for our database.

>>: You don't have any constraints at all?

>> Ming-Wei Chang: We have a lot of constraints. We have constraints about how do you -- so if you just let it run and you don't have any constraints, basically it is not possible you're going to find a logical representation.

So this is a geoquery domain, so this is a limited domain. Otherwise, this is not possible to do this. Okay.

So your constraint is first of all, your output logical -- logical [inaudible] need to be varied. And each one of these predicate can only map to one logical predicate, and in the grammar they have -- in the geoquery grammar they have some grammar rules about what can put on the largest. So you can only put state on largest, river on largest, but you cannot put people on the largest.

>>: Are they all finite sets?

>> Ming-Wei Chang: They're all finite sets.

>>: Okay.

>> Ming-Wei Chang: No, they are finite sets, but they are a large number of finite sets.

>>: Okay.

>> Ming-Wei Chang: Right. So I think this problem was coming from [inaudible] project many years ago about try to book air ticket online automatically. So it translate human request into -- but that in that sense is also a very limited domain.

So I think the biggest problem if you want to apply it on large domain is how to define a meaning representation, because I don't know how to define meaning representation at that time.

So here is the result. We have 250 training and answer pairs. We don't use my meaning representations. And in a testing time we have 250 queries and

answers, so we measured accuracy by see how much -- what's the percentage we can get the answer correctly.

So if you use supervised model, which you get all of the meaning representations, you can get training error of 87.6 and 80.4 in the testing time, the accuracy.

But our model we get 82.4 and 73.2 here.

Okay. It is worse than supervised than the algorithm, but you need to remember we used zero meaning representations. We just let the model run. So this result is not so bad.

So although supervised model are ranging from 60 percent to 85 percent accuracy --

>>: Can you give me a rough idea how complicated this problem is? Because your predicates all have rules, right? So you could just enumerate all the possible combinations --

>> Ming-Wei Chang: It is not possible. So we --

>>: [inaudible]

>> Ming-Wei Chang: Yeah, okay. But I can tell you we formulate this problem as linear integer programming, but we have thousands of variables.

>>: Right. But --

>> Ming-Wei Chang: Okay. So theoretically you are right. You can just keep trying to until it hit, but that -- but --

>>: I just want to ask you, just have a rough idea, what is the complexity of the problem?

>> Ming-Wei Chang: What is the complexity of the problem?

>>: Yeah, [inaudible]

>> Ming-Wei Chang: I would say -- I'm not an expert at this because I implement this with my friend, but I think he told me that for one human query, even with all the constraint, you can still find 20 to 30 possible varied queries. So it's not that easy.

>>: Okay.

>> Ming-Wei Chang: But I'm not so sure the number is correct.

>>: So part of the point of what you're trying to do is to try to make it easier to gather the supervision variables, right?

>> Ming-Wei Chang: Right.

>>: But in this domain do you think it's actually easier to provide the right answer than to provide the semantic parse?

>> Ming-Wei Chang: This is very good question. I think for some cases it is not easy. For example, what is largest state border Nevada? I have no idea.

But for some question -- but for some question it is easy for human to do that.

Okay?

So the interesting point here is that if you -- we haven't done this experiment, by the way, but if you start with those easy questions, it can generate to the hard questions as well. That's what we are trying -- we think that's an important next step, but we don't know how to evaluate this, no.

>>: Okay.

>>: Suppose instead of a binary [inaudible] you had a task which gave you a positive label, a negative label or a don't know label? I was wondering how that knowledge would affect your model and how that might change the characteristics of the solutions

>> Ming-Wei Chang: That is a very good question, and the answer is I don't know [laughter]. Because I don't know -- but I know this have some relationship maybe to reinforcement learning. So you can broaden your question and say I'm going to give you real value feedback. So your answer is 90 percent correct. I don't know. So maybe you need to fix your model to do that. Right now our model doesn't take advantage of any feedback other than binary.

>>: It's a little different than that, though, because if the don't know case gets answered [inaudible]

>> Ming-Wei Chang: Right. But you can just -- you can know that's don't know label. That's what we will do right now.

>>: [inaudible]

>> Ming-Wei Chang: Yeah.

>>: [inaudible]

>> Ming-Wei Chang: Right.

>>: And from your formulation it seems that you can still go through the

[inaudible].

>> Ming-Wei Chang: So right now our formulation doesn't allow you to do that,

but we're trying to think is there any way you can do -- get positive feedback along the way.

So right now what we're trying to do is that we first -- we only query the database until you finish translating the human query into sql queries, right? But we are thinking is it possible that you can just -- you query the database before you finish the transformation to see if the partial translation is good enough or not.

>>: [inaudible] but I'm asking the fundamental problem from the relation problems. Let's say that you asked me for the [inaudible] yes or no, but you know [inaudible]

>> Ming-Wei Chang: Right now it will suffer because right now it just trusts the feedback completely. But I think it is possible to adjust that, but I don't know how good it is going to be.

>>: Okay.

>> Ming-Wei Chang: So recap. So first thing we talk about is constraint driven learning is that can we use constraint to improve statistical model. And the second thing we talk about is we want to find the latent structures with constraint for binary output problems. And in the third part we're trying to see can we use binary supervision signal to improve structure output prediction learning. And in the final part we're trying to learn with voice responses.

And here's the last part of my talk is we -- I tried to encourage you to use indirect supervision. So there are many exciting new directions. One direction I'm particularly interested in is to use existing labeled data as the [inaudible] for all the tasks, and maybe you want to build a better interactive learning algorithm or want to use other indirect supervision forms. And we can think about many other ways to do indirect supervision.

Thank you.

[applause]

>> Ming-Wei Chang: So I'm happy to take any other questions [laughter].

>>: So given that you solved this optimization problem with constraints, is it useful to do something [inaudible]

>> Ming-Wei Chang: We only do sensitive analysis on very small problems, and it is clear some constraint is a lot more important than the other constraints, but we haven't performed a serious analysis on [inaudible], but these are good questions.

>>: The way you encoded the constraints [inaudible]

>> Ming-Wei Chang: No. You can also encode as a soft constraint.

>>: [inaudible]

>> Ming-Wei Chang: Right. But the constraint can be hard or soft, yeah.

>> Kuansan Wang: All right. Let's thank the speaker.

[applause]

Download