Document 17889406

advertisement
>> Alias: Okay. Hi everyone. We're going to start with the last session of talks, three talks, and
I'm sure you still have some mental energy to kind of listen and mainly by asking questions at
the end. So the first talk is unsupervised dependency parsing with transferring distribution via
parallel guidance and entropy regularization. And the presenter is Max.
>> Xuezhe Ma: My name is Xuezhe Ma, and I'm from the Department of Linguistics University
of Washington. Our work is about dependency parsing for resource-poor languages which have
no label training data. And our idea is to use parallel and unlabeled data to transfer
distributions from the parsing of resource-rich language and [inaudible] work [inaudible]. This
is the outline. First, I will introduce our task and then I will describe our approach and show our
experiments, and then at last you will get our conclusion and discuss some possible future
work.
Let's see what our [inaudible] is. Here is an example of dependency tree. It is projective
meaning that there are no crossing edges in this tree. Sometimes we can use some labels on
the edges to represent the dependency types, but in our work we do not use edge labels. In
our scenario, we assume that we have three kinds of resources. First, we have a resource-rich
language with labeled monolingual Treebank. And we [inaudible] this language as source
language. And second, we have parallel text between resource-rich and the result per
languages. Third we have unlabeled text for results per languages. And we expect to develop a
monolingual parser.
Here, [inaudible] the test data are in only one language. They do not have to be bilingual text.
And this [inaudible] some related methods. We have no time to go into the details when you
compare our systems with all these existing systems. In the following I will describe our
approach. Our method can be summarized as estimating the transferring distribution from an
English parsing model. Here the transferring distribution is the distribution projected from an
English parsing model. Maybe you’ll describe this in the future. By using the transferring
distribution we can transfer cross-lingual knowledge between the target and the source of
languages. The parsing model we use is the edge-factored parsing model which is a kind of log
linear model. Here X is an input sentence like is a valid parse and P, X is a set of all possible
dependency parses for input X, F, G, other fixed functions, and the lambda are the parameters
of the parsing model that we need to learn.
For the weight function for edge E it first sums over other features belongs to the edge E with
the current parameters lambda and then and it takes the exponent function. By using the
weight function the conditional distribution can be written as the multiplication of each edge,
each edge’s weight width, and here is the X is a normalization factor.
In the supervised case a common method for all [inaudible] is to minimize the negative loglikelihood function with respect to a set of label training data. But in our scenario we do not
have label-training data. So the easiest way to solve this problem is to minimize the expected
negative likelihood function. There the [inaudible] of P is the transferring distribution that
reflects our identity and certainty about the two labels.
As mentioned, we assume that we have a set of parallel data P and a set of a label sentence U.
So we can [inaudible] our projective function into two terms, K, P and K, U where K, P is the
contribution of the parallel data and the K, U is the contribution of our label data. And here we
introduce a parameter gamma as a tradeoff between K, P and K, U.
The next problem we need to [inaudible] is how to define the transferring distribution. For a
label data we take the transferring distribution to be the actual current distribution in our
parsing model which is P lambda. It's written since that we know nothing about the distribution
of the unlabeled data, but for parallel data we define the transferring distribution by defining
the transferring weight function. Here [inaudible] W is the transferring weight function and W,
E is the weight function of English parsing model P lambda E, and the E, T is an edge in the
target parse. So if E, T is aligned to an edge E, S in the source set then it's transferring weight
equals to the weight of S corresponding edge E, S in English parsing model, otherwise the
transferring weight for E, T is equal to, is the delexicalized form.
So here is an example to interpret the definition of the transferring distribution. Suppose that
we have a pair of sentences and here German is the target languages and the lines between the
two sentences denotes the word alignments. Since the word alignments are generated
automatically, so they are not perfect, now suppose that there are two edges in the target
parse, according to the word alignments they are aligned to two corresponding edges in the
English sentence. So coming into our definition their transferring weight equals to the weight
of their corresponding edges in the English parsing model. So here we use different colors to
use English the different edges and the difference equations. But for that to red edges they are
distinguish the different edges in the different equations. But for the two red edges they are
not aligned to any edges in the English side so for these two edges the transferring base
function equals to the weight of their delexicalized form. F example, for this edge, we keep
only their part of speech tags and drop other lexical information.
So the optimization algorithm to be used to optimize our objective function is the limited
memory BFGS and we calculate, to calculate K per M function each gradient we use the insideoutside algorithm, the complexities O and Q. In maximization you show the results of our
experiments. The Treebanks, using our experiments, we can go Google Treebanks, both version
one and version two and the Treebanks from CoNLL shared-tasks. For parallel tags to be used,
Europarl Corpus version 7 for European languages and use the Kaist Corpus for Korean. The
available data comes from the training portion of each Treebank. For what alignments we used
GIZAA plus plus and for part of speech wed use the universal part of speech tag set proposed in
Petrov 2011 and then add an alignment and we used the Stanford POS tagger.
This [inaudible] system was to be run in our experiments for comparison. DTP is the direct
transfer parser proposed in McDonald 2011. We re-implemented this and the result of our
experimentation marked as DTP tag. PTP is the protected transfer parser, and we also reimplemented [inaudible] and you notice it has PTP tag. The negative U and positive U are two
versions of our approach. For the negative U our approach treating only parallel data and the
positive U training on both parallel and [inaudible]. O, R is the fully supervised parser which can
be regarded as an OR code.
Here are the results on going in there with Google Treebanks version one. The left three
columns are the [inaudible] terms. The two columns in the middle, the two versions of our
approach and is OR oracles. You can see that our approach is significantly up from all the
[inaudible] terms across other five target languages. And here are the results on the Google
Universal Treebanks [inaudible] and we can get the same thing with the results on version 1.
We also ran experiments on different amounts of parallel text. This figure is the UAS on
German. Here the X axis is the number of parallel sentences and the Y access is UAS and both
blue bar are the results of projected transfer parser and the orange bar and green bar are the
versions of our approach for other [inaudible] obtained the same observations. We also ran
experiments on the Treebanks from CoNLL shared-tasks. Here are the results for each target
languages. Here DMV is [inaudible] un-supervised dependent parsing model, and the four
columns in the midlle are the [inaudible] terms. Here is our result and the OR is oracle. You
can see that our [inaudible] term achieves the best result across other eight target languages.
And other experiments, for the experiments on CoNLL Treebanks we compare our results,
compare our [inaudible] terms with more [inaudible] terms. We compare with the weakly
supervised terms, the phylogenetic grammar induction model and the unsupervised parsing
model with no-parallel multilingual guidance and the posterior regularization approach. They
both [inaudible] other UAS on sentence of 10 or less result punctuations from CoNLL sharedtask Treebanks. Here the four columns in the middle are the new complacency terms and next
to you is our approach. We can see that our approach actually were the best results on most of
target languages.
Conclusion, future work. So in our work we propose approach for the dependence parsing for
results per languages which do not have labeled training data. And now our approach can be
used in a completely monolingual setting. Our approach will achieve better path parsing
performance on three data sets.
For future work, first we can extend our approach is to an non-projective parsing by replace the
Shannon entropy with Renyi entropy. Another way is to extend the transferring weight to
[inaudible] paths by depending the transferring weight to that involve small edges. And
another way is to replace the parallel text with editing data. So, that's it. Thank you.
>> Alias: Any questions? Actually, I have one question. You mentioned at the beginning this
parameter gamma that is trading off between the two probabilities. So I was wondering how
you>> Xuezhe Ma: [inaudible] gamma?
>> Alias: How you frame that or how you set the parameter.
>> Xuezhe Ma: Yes. For the parameter gamma, since that we assume that we have no label
data so we do not want to [inaudible] the gamma using the development status. So in our work
we chose the [inaudible] according to the [inaudible] between the number of tokens of the
parallels as data and unlabeled data. So, for example, if the parallel data has 10,000 tokens and
the unlabeled data has, for example, 10 times of the parallel data then we choose the gamma
.1.
>>: Very interesting work. I was wondering if you happened to look in the amount of
improvement you get as a function of the specific target languages you used in terms of how
freer order they are compared to say English which you probably used. Is there any correlation
between some characteristics of the target languages like free order or not free order or the
extent of the freer order they have compared to English?
>> Xuezhe Ma: I do not understand the meaning of free order.
>>: So English, for example, the word order is very rigid. This is how you know basically that
subject versus object because subject’s able to come before the verb and object will come
after. In other languages you might have other cues like morphological inflections and so on.
Maybe it doesn't matter for your approach, maybe it does. I was just curious to see if you
looked at that.
>> Xuezhe Ma: I think, you know, we have the experiments on Korean. I think Korean and
English have different word orders. And we can see our results on Korean here. So for Korean
we can see that for Korean if you use the direct transfer parser then the parsing increases
pretty low, but if we use our approach we got significant improvement. So I think that's on the
effect of increase the [inaudible] order of alignments so maybe less [inaudible] may lower our
accuracy. So by taking our approach it's not be a big factor.
>> Alias: A super quick one.
>>: So if you make the word alignment worse would the Brian McDonald model do better
because they have a subsequent trading that doesn't entirely trust a word alignment, right, if I
remember correctly. But in your case>> Xuezhe Ma: In the McDonald work [inaudible] parser.
>>: Yeah.
>> Xuezhe Ma: I think, I have not think about this, but you know McDonald's work is a kind of
weakly supervised learning. So if the word alignment was worse than now yes, of course you
lower the, upload the parsing accuracy, but I don't think that this would be a bigger factor for
our approaches since that.
>> Alias: So let's thank our speaker again. So the second paper for this session is a graph-based
posterior regularization for semi-supervised structure prediction. And it's presented by Luheng
He.
>> Luheng He: Can you hear me? Yes? It's working?
>> Alias: Is the mic on?
>> Luheng He: It's on. Okay. Hi. I'm Luheng. I'm a student at UDub. Today I'm going to tell
you about how to use graph propagation to help semi-supervised part speech tagging and
especially how we did it by simply optimizing a joint object function. This is joint work with
Jennifer Gillenwater and Ben Taskar. And this work has originally been presented at CoNLL
2013.
Here is an overview of my talk today. First, I'm going to talk about structure prediction, in
particular conditional random field with its application to part speech tagging. And then I will
talk about graph propagation which is an alternative approach to the part speech tagging. It's a
semi-supervised, an instance-based method. And finally I will talk about how we successfully
combine these two very different methods into a single joint objective function that can be
efficiently optimized.
So I guess most of you are familiar with part speech tagging. Basically for each word in a
sentence we want to assign to it a syntactic category such as a determiner, a verb. So for each
instance we have the input X which is simply the sentence and the output structure Y is the
sequence of part speech tags. Conditional random field is a standard technique for doing part
speech tagging, and its tagging prediction is based on the feature function F. This is defined on
each of the local factors in the output structure Y. It usually contains the current tag Y, T; the
previous tag Y, T minus one; and the input X. And also we have a set of parameters theta which
are the feature weights So with theta and feature function F we can compute the potential
scores for each of the local factors, therefore, we can model the conditional distribution and for
all possible tagging sequence given an input sequence X. In particular, P theta of Y given X is
the conditional probability of a particular tagging sequence Y given an input X. It is computed
by taking a product of other local factor scores and normalized to sum up to one. So therefore,
learning the conditional random field model can be formulated as an optimization problem
where we want to choose the best parameters theta to minimize an active log-likelihood of
other labeled sentences from X, 1 through X, L.
And different from conditional random field, graph propagation is an instance-based method
and it tries to predict one tag at a time without considering any structuring information. In our
particular setting each input is a trigram in a sentence and the output is the part speech tag of
the center word in the trigram. The basic assumption is that if two words frequently co-occur
in similar context than they should be assigned with a similar part speech tags.
So to apply this assumption we start by building a sparse K-nearest neighbor graph using all the
trigrams that occur in the Corpus as nodes, and for each pairs of trigram we compare the
distributional similarity using the words that are in the context of the trigram and inside the
trigram. And for trigrams that are similar enough we add an edge between them with edge
scores representing the similarity scores. So if a trigram ever occurred in a label sentence we
consider the corresponding node as labeled, otherwise it's unlabeled. Then we can propagate
information from those labeled nodes to the unlabeled nodes and this gives us a tagging
distribution of the center word for each unlabeled trigram. And formally this tagging
distribution is determined by choosing the one that minimizes its difference with its neighbors.
A final step is to sum over the entire graph and write a quadratic penalty term. We call it a
Graph Laplacian Regularizer. This regularizer forces the tagging distribution for similar trigrams
to agree with each other. So again, graph propagation can be formulated as another
optimization problem where we want to choose the best tagging distribution to minimize this
Graph Laplacian Regularizer term.
Now take a step back. We talked about two very different methods to do part speech tagging.
The first one is conditional random field which is a supervised method that learned from
labeled data and it models the conditional distribution of all possible taggings given an input
sequence and it's parameterized by theta. And we also talked about graph propagation which
builds a K-nearest neighbor similarity graph from labeled and unlabeled data and propagates
information across that similarity graph. It models the tagging distribution for a center word for
each unlabeled trigram. In the next few slides I'll talk about how we can combine these two
methods.
There are some prior work that tries to do this. The work that's most related to us is by
Subramanya et al. EMNLP 2010. They used an iterating method. They first trained a
conditional random field from the labeled data, they used the CRF prediction to initialize for the
graph propagation, and then used the graph propagation results to provide additional training
data for the CRF, and this goes on for many iterations. So this work is [inaudible] successful but
it also raises several questions such as, is it optimizing some joint objective? Or, does it have
guarantee to converge? And these questions actually motivated our work to build a joint
framework that has nice guarantees.
A first step to do this is to introduce a set of auxiliary variables Q. Even consider a copy of the
conditional distribution as P theta. So in particular, Q, I, Y models the conditional probability of
a particular tagging sequence Y given the input X, I and we make to further assumptions on the
conditional distribution Q; one, that because Q is a probability distribution, it should be
normalized to sum up to one for each of sentence; and second, we assume that Q can be
decomposed exactly the same way as P theta does. So we can do write each Q, I, Y as the
product of a bunch of local factor scores. We call them R, and later we'll see that these
assumptions are crucial to our efficient optimization.
Now with an auxiliary distribution Q can write our joint objective function. It consists of three
parts. The first part is the Graph Laplacian Regularizer that's defined on the distribution Q, and
the tagging distribution for each trigram is computed by normalizing Q with respect to each
trigram in each tag. The second part is simply the objective function of the conditional random
field defined over the P theta. So if we optimize the first two terms separately we'll get two
very different distributions that capture different information that's helpful for part speech
tagging. So we have the final term that minimizes the K all divergence between the two
distributions, Q and P theta. Basically this term forces Q and P theta to agree with each other
so that the final prediction can benefit from both the graph propagation information and the
conditional random field information.
So the final question remains is that how we can optimize this joint objective efficiently. We
have two sets of parameters Q and theta. Theta, remember, are the feature weights in the
conditional random field. It’s completely unconstrained so it can take any positive or negative
value so optimizing for theta is very easy. We just do a straightforward gradient descent
method. However, updating Q is more complicated for two reasons. The first one is that Q is
constrained to sum up to one. And second, there's no compact representation for Q.
Remember that for each sentence X, I the number of all possible taggings Y equal to the
number of all tags raised the power of sentence length. So basically there are exponential
number of components in the distribution Q and it's simply impossible to memorize or update
that many variables.
However, we still want to do this. Remember we made an assumption that Q can be
decomposed as a part of local factors R. So it's natural to think that what if instead of updating
those Q variables we want to maintain and update those R variables and there are much less of
those variables. But consider if we do an additive gradient update upon that Q then what
happens is that the resulting Q prime will no longer be able to be able to retain it as a product
of local factors, therefore, we need to use an algorithm called exponential gradient descent. So
instead of an additive gradient we do a multiplicative update which means we take the gradient
term and take its exponential and multiply it to the original Q and resulting the updated Q.
When written as this form we can see that both the original Q and the gradient term can be
decomposed into a set of local factors. We [inaudible] the details here, but basically we can
write the gradient term into a bunch of local factors and distribute those local factors on each
of the R variables. This means that instead of updating all the Q variables we only need to
maintain and update those R variables and then compute the Q using a forward-backward
algorithm.
So to summarize, we have two sets of conditional distributions Q and P theta and we have this
joint objective function. We can optimize it using an EM style algorithm where at M step we
update theta and at E step we update each local factors in Q which are the R variables and then
we can normalize Q by doing a forward- backward. And this algorithm is guaranteed to
converge the local optimum.
Now we come to the experiments. We tested our algorithm on 10 different languages from the
CoNLL data set and we tested in their weakly supervised setting which means for each language
we use only 100 labeled examples. We used universal part speech tagging set, so it's the same
tag set for all the languages, and for the base model we used a second order conditional
random field.
Here's our results. On the X axis are the 10 different languages starting from English, and on
the right most bar is the average of all the languages. And the bars show here parts speech
tagging errors, so the lower the better. The first place line we’re comparing against is the graph
propagation results which is simply by minimizing the first term in our objective function. And
the second baseline is computed by training, learning a conditional random field from the graph
propagation result for one single iteration. It has some improvements but not much. And the
third baseline is simply by training a conditional random field using the 100 labeled examples.
So it actually outperforms the previous baselines for some languages. It tells us two things; one
is that structure information is really important, the other is that graph propagation provides
additional information as well as additional noise.
So finally, we have our joint objective which combines the power of the previous two methods.
It outperforms previous baselines for all languages consistently. It has a 28 percent relatively
relative error reduction on average.
So as a conclusion, we want to use graph propagation to help weakly supervised structure
prediction, in particular, we formulated this graph propagation method into a regularizer term
and encoded it in the joint objective function; and we also proposed an efficient algorithm for
optimizing this joint objective. It is more interesting to think about that what other kinds of
posterior constraint that we can use to help structure prediction. We can use it as long as it can
be written as any convex and differential regularizer term. Our code can be found at the link
below. You can try it out and play with it. Thank you for attention, and I'm happy to answer
questions.
>> Alias: Any questions?
>>: So am I right in thinking that you are forcing the graph to do a prediction on every single I in
the chain?
>> Luheng He: It's not exactly. It's all the trigrams. So if two I positions have the same trigrams
they are combined into the same node.
>>: But in your joint objective could you use the CRF for known words? I guess another way to
ask the question is the graph in the Subramanya work could be sparser. It doesn't have to
predict on everything, right, because they take the average. They sort of do self-training in
order to combine the prediction of the graph and the CRF. But as you have a full a joint model>> Luheng He: Yeah, but the graph regularizer is still based on the tagging distribution for each
node and the node is marginalized and normalized, basically taking average over all of the
labeled data. Am I understanding it right?
>>: I mean, you could use a Subramanya method without forcing the graph to label every single
word in your training data, right? You could use it for let's say unknown words. But in this one
you have to do it on all, every single>> Luheng He: Yeah. The graph contains information about all the tokens.
>>: Okay. I just wanted to clarify. It’s just a question about the model. I just wanted to
understand it better.
>> Luheng He: Thank you.
>>: So I was just wondering how fast to do the EM steps converge?
>> Luheng He: We ran it for 20 iterations.
>>: Is that different by languages?
>> Luheng He: For larger languages it's slower, of course.
>> Alias: You have time for a quick question, otherwise let's thank our speaker again. All right.
Okay. Our next and final talk for today. We kept the best for the end is multi-metric
optimization using ensemble tuning, and Anoop Sarkar is presenting.
>> Anoop Sarkar: So you know that two out of the three quarters are my students because
they put me as the last talk. This is their way of getting revenge on me. So I'm going to talk
about this work that was done by primarily by my student, Baskaran, who couldn't be here
today. So I'm presenting on his behalf and also on behalf of Kevin who is in Japan. He's a
collaborator in this work, and it's about multi-metric optimization. So I'm going to start with a
caricature of how we do discriminative training in MT, and this has two objectives. If you
already know how it works then it will introduce you to the caricature I'm going to use to
explain what we did, and if you don't know how MT tuning works this will give you a distorted
view of what it does and at least enough to hopefully understand contribution.
So we start with a frame sentence there, an English sentence here, which is the translation of
the frame sentence. This is a machine translation decoder which is going to produce what the
machine translation system currently thinks is the translation of that frame sentence. So you
can get an end-based output for this frame sentence, and we are trying to tune the parameters
of the model. So what we want to do is to use these end-best list in order to train the
parameters of a model, so we have a bunch of knobs that we want to turn in order to make
sure that the translations we get in that end-best list matches the actual translation of F, right?
So we might get something wrong, and we want to tune these knobs so that in the next
iteration we use a better weight and that better weight vector goes back into a decoder and
hopefully we get better.
So this is our tuning loop. We are training our system to produce better machine translation
outputs. So the question is how to choose this weight. Well, this weight is something that's
going to match an extrinsic score, so a lost function of some kind. And that loss is based on
some metrics. So some way of saying that this is a good translation or this is a bad translation
and machine translations are notoriously bit difficult because there's more one way to translate
a sentence, so you don't get a loss that's quite nice like you usually do in machine translation.
So let’s look at this weight optimization step. We want to say if the model likes it and the
metric likes it that's great. Our model score is good, and it also performs well according to the
metric. If our model score is really high and our metric says this is bad then we should be
changing our weights. In this case maybe they're both good; in some cases the model score is
bad and the metric says it's bad and that's good too, right? So we are sort of predicting the
distribute should be bad.
So here's one particular metric that’s called Bleu. It just basically matches n-grams in the
output with the reference. And they might be, and this is, on the x-axis as a model score so you
can see that this guy and this guy are kind of evil twins, right? So this has a high model score
and really good on our metric, so higher the better, and this is really good model score but
terrible on the metric. So what we really want to do is say here’s two things that are getting
high model score, prefer this guy over this guy. So you can kind of replace this complicated
machine translation loss function into something that is basically binary classification. So this is
the idea behind PRO.
So the question is, is this metric the most suitable one? So you go to, it's happening less often
for some reason, but every time you go to an ACL or an echo somebody in the audience will get
up and say, I hate the Bleu score. When is the Bleu score going to end? And lots of people
have tried many, many replacements for the Bleu score. You can take your pick. There’s a
whole bunch of different metrics so each of them can be used to choose a weight vector that
matches one of these extrinsic scores. But when you look at every single paper on machine
translation we optimize using Bleu. And it’s the Highlander of machine translation metrics;
there can be only one. It has killed everything else. And it is true; it does work better. So
people have tried other metrics, it just seems to be the best one. But in this work we want to
say maybe you don't have to choose one. Maybe we can be sort of anti-Highlander and have
many people live.
So in this work we're actually going to look at different metrics and see how we can use them
all in order to tell us what's a good translation, what’s a bad translation, all at the same time.
And so we have the Bleu score. Its n-gram matches with Meteor, [inaudible] favorite one. And
it's like Bleu, so Bleu essentially captures recall in a funny way. So Meteor captures precision
and recall and also has some synonym in stem matching. There's one that you might not have
heard of that’s called RIBES, and RIBES is really good at figuring out whether things are out of
order or not. So it's very popular for Japanese-English translation because it measures out of
order matches and does it using a Spearman’s correlation coefficient. And finally, there's a
good old one from the speech record days, which is TER. So this is translation error, so
insertion, deletion, and block shifts.
Okay. So these are all different reasonable ways to measure how different is your hypothesis
from the reference. So what we're going to do is use them all at the same time. So let's define
what that means. We want to find the weight vector that's good according to multiple metrics.
So this is kind of a [inaudible] way to say that. So find one W that simultaneously optimizes
many different objectives. Each metric is treated as an objective, and you treat the hypothesis
that I reproduce for each metric and G is the way to combine them.
So one way to do it is to simply, at this point when you choose a weight vector, just to take two
things like TER minus Bleu divided by two. Just combine them together and now you have a
linear combination of two different metrics. And this has been done before. This is going to be
our baseline. There’s one problem with this is if you want to combine them and you want to
say I want to actually have more weight on the Bleu score and less weight on the error rate you
have to do that by hand. There's no automatic way to do it. So we're going to address that
issue, and we're going to do things a bit differently.
So what is a reasonable way to combine different metrics? Well, one reasonable way to do that
is a notion of Pareto efficiency which says you have optimality with respect to different metrics;
then you should look at this objective A and objective B and look at this Frontier. So these
points here don't matter, they're worse according to both metrics. So these points on the
Frontier are the ones he should be looking at. So this is the idea we are going to use. There’s
previous work by Kevin that actually just modifies PRO to push the points in the Frontier, well
one problem that was in that previous work was there was no way to actually take advice from
all the different objectives at the same time. And that's another problem we solve in this sort.
So I'm now going to presents three, we introduced four different ways of doing it, and I'm just
going to present our main result due to lack of time, and I'm going to explain to you what it
does using my caricature of tuning. And we call it ensemble tuning. And the idea is the same as
before. You get your end-based input, nothing different here. Now if you have, for example,
Bleu as one of your objectives you do Bleu weight optimization, so you turn the knob so that
you get the best weight that is going to optimize Bleu; and let's say the other metric is Meteor,
you do that as well, you get another weight vector. So you get two different weight vectors, W,
1 and W, 2, and they both have different views on what the weight vector should be. And now
what we do is we combine these using this metal weights. The metal weights are saying how
much should I trust Bleu, how much should I trust Meteor, and the metal weights are trained so
that the points that we pick that is a combination of these two lie on the Pareto Frontier. So it
will reward things that are on the Pareto Frontier and everything under that Frontier will get
penalized.
Now there's one issue here which is when I combined these two I still want one model. I don't
want a Bleu model and a Meteor model because now I have two problems, not one. Before I
had at least one problem. So I need a way to combine the predictions of the Bleu model with
the error model, the Meteor model in this case. So how do we deal with these multiple
components? So that's the second problem we solve is you don't have to pick and choose. So
the usual way people use Pareto optimality is they say well, somebody else is going to pick one
of these depending on the order you want. So if really want Bleu scores you will pick one this
one, if you really want Meteor scores maybe you'll pick this one. So we don't actually have to
make that choice. We combine the weights using ensemble decoding, and the idea behind it
ensemble decoding is very simple. We have weight one and weight two, one is the Bleu, one is
the Meteor, and whenever we make a prediction we actually combined them together to make
a prediction while we're decoding.
So basically, we can take as many weight vectors as we want and produce something that will
be the combination of these. And we just do that all the way up until we get a translation, and
each case is going to be a combination of these two weight vectors. Okay. So we implemented
this ensemble decoding idea as well as the Pareto weight training in our favorite decoder which
we wrote which is available on github is you wanted to use it, if you're brave enough. We use
fairly standard large scale [inaudible] Arabic-English and Chinese-English and this is going to be
two-dimensional. So this one is RIBES in Bleu and I'll show you results for other metrics, but
let's just look at Bleu on one axis and RIBES which is another empty metric on the other axis and
LC. So Bleu by itself is over there, RIBES by itself is over there. So you can see that if you
optimize to where it's Bleu you'll do better on Bleu, if you optimize on RIBES you do better on
RIBES, those LC points are the linear combination points. This is linear combination plus
ensemble. And that point up on the right is Bleu, RIBES, and TER altogether. And this point
over here is Bleu and RIBES together. So you can see that what we want is points that are on
the upper right quarter, right? Things that are good according to both metrics. And ideally
better than both, either doing one or the other individually.
Here is Chinese-English and you can see that as you add more metrics you can get a bit better.
It's a better on Bleu then just doing Bleu itself. It's not, in this case is not as good as doing
Meteor by itself, but you can see that Meteor scores really badly on Bleu. So the question is,
are those points up there good because you get a good Bleu score? So can you always improve
the Bleu score? So here actually we show that if you have a single objective that's just Bleu, it
gets better; if you have to metrics Bleu and RIBES does better, this is Meteor, RIBES, Bleu,
Meteor, TER are both better than just doing Bleu. And then we did some crazy stuff over here
like B3 is like Bleu three-gram instead of four-gram and so on. So you can invent as many
metrics as you want. And you can see that, in fact, it does better.
So we also tried to look at whether those points on the upper right quadrant are actually
better. Do they actually read better? So we did a human evaluation which was a post-editing
task, and we saw six percent gain in the post-editing error. So it was six percent easier to post
edit multiple metrics than one metric. And that's it.
>> Alias: Do you have any questions?
>>: I just wondered if you find any language difference in preparing one metrics over the other?
>> Anoop Sarkar: So Kevin has done work on Japanese-English and RIBES helps more for
Japanese-English just because of the reordering, and we haven't done anything other than
Arabic-English and Chinese-English yet.
>>: Do you use a combination of [inaudible]?
>> Anoop Sarkar: it's considerable that you could have some metrics that help more with
languages that are quite different. I don't know. If you had a really good metric for
morphologically more complex languages, which doesn't exist, but if you had one then we
could use that in order to improve the Bleu score.
>>: Right. So if in that case, then you combine multiple metrics in there you have to take that
into consideration.
>> Anoop Sarkar: Yeah, yeah. But I mean it's an interesting challenge at least. It's not just
mindless. I think that's an interesting problem to have.
>>: I didn't 100 percent understand why you needed to combine the weights on the fly during
inference as opposed to combining them off-line into a single model.
>> Anoop Sarkar: Oh, yeah. It does better. That's what we claim. We have a lot of, I agree it's
not a given, but we have compared it to linear combination off-line; and ensemble decoding
allows you to different things, so it allows you to do, for example, do switching which you can>>: Oh, I see. So you're not just, okay.
>> Anoop Sarkar: It allows you to different kinds of [inaudible]. So we have different ways of
combining the models that you can sort of turn off and on in the decoder.
>>: So you're learning combination methods.
>> Anoop Sarkar: So [inaudible] thesis is on that. We have a lot of papers on ensemble
decoding as our hammer that improves over linear combinations.
>>: So I'm surprised that when you optimize both Bleu and Meteor, for example, that you can
do better on Bleu than just optimizing on Bleu alone.
>> Anoop Sarkar: It’s not surprising.
>>: It's not surprising?
>> Anoop Sarkar: Well, you can think of the Bleu as being very, very sensitive to length, but
actually maybe it shouldn't be. In some cases it might penalize it really badly because you hit
against a [inaudible] penalty and that leads you into a bad space.
>>: So I can see why optimizing both metrics at the same time would do better for humans, but
just on the Bleu metric alone>> Anoop Sarkar: It just stops Bleu from making a mistake. I mean that's a story you could tell.
I don't know. We have tried to analyze it in different ways. So Baskaran has done some things
where he sort of on purpose sort of damages one metric and then sees what you can recover
from the other metric. It's kind of interesting to see. But those are still artificial [inaudible].
You can tell, there's an interesting paper by that BBM guys I think, about putting together a
bunch of different features to sort of get Bleu on the right track, as it were, and this might be
what's happening here as well.
>>: Thanks.
>> Alias: So let's think our speaker again. And I think now we have some closing remarks.
>> Anne Clifton: Hi. I'm Anne Clifton. I'm a PhD student at Simon Fraser with Anoop and one
of your co-chairs today. So just really quickly I wanted to thank all of you for participating today
and making this the most successful Northwest NLP to date. And in particular, I would like to
take a moment to thank our hosts here at Microsoft, particularly Will Lewis who’s set all this up
for us. So I hope you’ll join me. So that's all. Thanks for coming. I hope to see you in 2016.
>>: Before you leave I do want to mention that actually I'm getting almost too much credit
here. Anne, Anoop, Maryam, and Yashar[phonetic] did an enormous amount of work and this
honestly is the most successful, by number, NWNLP that we've had. If we continue at this rate
we're going to have to move to a much bigger venue next time. So hopefully we don't double
yet again next time. Let's try to keep it around 300 next time; maybe it will be okay. Thank you
everyone.
Download