>> Alias: Okay. Hi everyone. We're going to start with the last session of talks, three talks, and I'm sure you still have some mental energy to kind of listen and mainly by asking questions at the end. So the first talk is unsupervised dependency parsing with transferring distribution via parallel guidance and entropy regularization. And the presenter is Max. >> Xuezhe Ma: My name is Xuezhe Ma, and I'm from the Department of Linguistics University of Washington. Our work is about dependency parsing for resource-poor languages which have no label training data. And our idea is to use parallel and unlabeled data to transfer distributions from the parsing of resource-rich language and [inaudible] work [inaudible]. This is the outline. First, I will introduce our task and then I will describe our approach and show our experiments, and then at last you will get our conclusion and discuss some possible future work. Let's see what our [inaudible] is. Here is an example of dependency tree. It is projective meaning that there are no crossing edges in this tree. Sometimes we can use some labels on the edges to represent the dependency types, but in our work we do not use edge labels. In our scenario, we assume that we have three kinds of resources. First, we have a resource-rich language with labeled monolingual Treebank. And we [inaudible] this language as source language. And second, we have parallel text between resource-rich and the result per languages. Third we have unlabeled text for results per languages. And we expect to develop a monolingual parser. Here, [inaudible] the test data are in only one language. They do not have to be bilingual text. And this [inaudible] some related methods. We have no time to go into the details when you compare our systems with all these existing systems. In the following I will describe our approach. Our method can be summarized as estimating the transferring distribution from an English parsing model. Here the transferring distribution is the distribution projected from an English parsing model. Maybe you’ll describe this in the future. By using the transferring distribution we can transfer cross-lingual knowledge between the target and the source of languages. The parsing model we use is the edge-factored parsing model which is a kind of log linear model. Here X is an input sentence like is a valid parse and P, X is a set of all possible dependency parses for input X, F, G, other fixed functions, and the lambda are the parameters of the parsing model that we need to learn. For the weight function for edge E it first sums over other features belongs to the edge E with the current parameters lambda and then and it takes the exponent function. By using the weight function the conditional distribution can be written as the multiplication of each edge, each edge’s weight width, and here is the X is a normalization factor. In the supervised case a common method for all [inaudible] is to minimize the negative loglikelihood function with respect to a set of label training data. But in our scenario we do not have label-training data. So the easiest way to solve this problem is to minimize the expected negative likelihood function. There the [inaudible] of P is the transferring distribution that reflects our identity and certainty about the two labels. As mentioned, we assume that we have a set of parallel data P and a set of a label sentence U. So we can [inaudible] our projective function into two terms, K, P and K, U where K, P is the contribution of the parallel data and the K, U is the contribution of our label data. And here we introduce a parameter gamma as a tradeoff between K, P and K, U. The next problem we need to [inaudible] is how to define the transferring distribution. For a label data we take the transferring distribution to be the actual current distribution in our parsing model which is P lambda. It's written since that we know nothing about the distribution of the unlabeled data, but for parallel data we define the transferring distribution by defining the transferring weight function. Here [inaudible] W is the transferring weight function and W, E is the weight function of English parsing model P lambda E, and the E, T is an edge in the target parse. So if E, T is aligned to an edge E, S in the source set then it's transferring weight equals to the weight of S corresponding edge E, S in English parsing model, otherwise the transferring weight for E, T is equal to, is the delexicalized form. So here is an example to interpret the definition of the transferring distribution. Suppose that we have a pair of sentences and here German is the target languages and the lines between the two sentences denotes the word alignments. Since the word alignments are generated automatically, so they are not perfect, now suppose that there are two edges in the target parse, according to the word alignments they are aligned to two corresponding edges in the English sentence. So coming into our definition their transferring weight equals to the weight of their corresponding edges in the English parsing model. So here we use different colors to use English the different edges and the difference equations. But for that to red edges they are distinguish the different edges in the different equations. But for the two red edges they are not aligned to any edges in the English side so for these two edges the transferring base function equals to the weight of their delexicalized form. F example, for this edge, we keep only their part of speech tags and drop other lexical information. So the optimization algorithm to be used to optimize our objective function is the limited memory BFGS and we calculate, to calculate K per M function each gradient we use the insideoutside algorithm, the complexities O and Q. In maximization you show the results of our experiments. The Treebanks, using our experiments, we can go Google Treebanks, both version one and version two and the Treebanks from CoNLL shared-tasks. For parallel tags to be used, Europarl Corpus version 7 for European languages and use the Kaist Corpus for Korean. The available data comes from the training portion of each Treebank. For what alignments we used GIZAA plus plus and for part of speech wed use the universal part of speech tag set proposed in Petrov 2011 and then add an alignment and we used the Stanford POS tagger. This [inaudible] system was to be run in our experiments for comparison. DTP is the direct transfer parser proposed in McDonald 2011. We re-implemented this and the result of our experimentation marked as DTP tag. PTP is the protected transfer parser, and we also reimplemented [inaudible] and you notice it has PTP tag. The negative U and positive U are two versions of our approach. For the negative U our approach treating only parallel data and the positive U training on both parallel and [inaudible]. O, R is the fully supervised parser which can be regarded as an OR code. Here are the results on going in there with Google Treebanks version one. The left three columns are the [inaudible] terms. The two columns in the middle, the two versions of our approach and is OR oracles. You can see that our approach is significantly up from all the [inaudible] terms across other five target languages. And here are the results on the Google Universal Treebanks [inaudible] and we can get the same thing with the results on version 1. We also ran experiments on different amounts of parallel text. This figure is the UAS on German. Here the X axis is the number of parallel sentences and the Y access is UAS and both blue bar are the results of projected transfer parser and the orange bar and green bar are the versions of our approach for other [inaudible] obtained the same observations. We also ran experiments on the Treebanks from CoNLL shared-tasks. Here are the results for each target languages. Here DMV is [inaudible] un-supervised dependent parsing model, and the four columns in the midlle are the [inaudible] terms. Here is our result and the OR is oracle. You can see that our [inaudible] term achieves the best result across other eight target languages. And other experiments, for the experiments on CoNLL Treebanks we compare our results, compare our [inaudible] terms with more [inaudible] terms. We compare with the weakly supervised terms, the phylogenetic grammar induction model and the unsupervised parsing model with no-parallel multilingual guidance and the posterior regularization approach. They both [inaudible] other UAS on sentence of 10 or less result punctuations from CoNLL sharedtask Treebanks. Here the four columns in the middle are the new complacency terms and next to you is our approach. We can see that our approach actually were the best results on most of target languages. Conclusion, future work. So in our work we propose approach for the dependence parsing for results per languages which do not have labeled training data. And now our approach can be used in a completely monolingual setting. Our approach will achieve better path parsing performance on three data sets. For future work, first we can extend our approach is to an non-projective parsing by replace the Shannon entropy with Renyi entropy. Another way is to extend the transferring weight to [inaudible] paths by depending the transferring weight to that involve small edges. And another way is to replace the parallel text with editing data. So, that's it. Thank you. >> Alias: Any questions? Actually, I have one question. You mentioned at the beginning this parameter gamma that is trading off between the two probabilities. So I was wondering how you>> Xuezhe Ma: [inaudible] gamma? >> Alias: How you frame that or how you set the parameter. >> Xuezhe Ma: Yes. For the parameter gamma, since that we assume that we have no label data so we do not want to [inaudible] the gamma using the development status. So in our work we chose the [inaudible] according to the [inaudible] between the number of tokens of the parallels as data and unlabeled data. So, for example, if the parallel data has 10,000 tokens and the unlabeled data has, for example, 10 times of the parallel data then we choose the gamma .1. >>: Very interesting work. I was wondering if you happened to look in the amount of improvement you get as a function of the specific target languages you used in terms of how freer order they are compared to say English which you probably used. Is there any correlation between some characteristics of the target languages like free order or not free order or the extent of the freer order they have compared to English? >> Xuezhe Ma: I do not understand the meaning of free order. >>: So English, for example, the word order is very rigid. This is how you know basically that subject versus object because subject’s able to come before the verb and object will come after. In other languages you might have other cues like morphological inflections and so on. Maybe it doesn't matter for your approach, maybe it does. I was just curious to see if you looked at that. >> Xuezhe Ma: I think, you know, we have the experiments on Korean. I think Korean and English have different word orders. And we can see our results on Korean here. So for Korean we can see that for Korean if you use the direct transfer parser then the parsing increases pretty low, but if we use our approach we got significant improvement. So I think that's on the effect of increase the [inaudible] order of alignments so maybe less [inaudible] may lower our accuracy. So by taking our approach it's not be a big factor. >> Alias: A super quick one. >>: So if you make the word alignment worse would the Brian McDonald model do better because they have a subsequent trading that doesn't entirely trust a word alignment, right, if I remember correctly. But in your case>> Xuezhe Ma: In the McDonald work [inaudible] parser. >>: Yeah. >> Xuezhe Ma: I think, I have not think about this, but you know McDonald's work is a kind of weakly supervised learning. So if the word alignment was worse than now yes, of course you lower the, upload the parsing accuracy, but I don't think that this would be a bigger factor for our approaches since that. >> Alias: So let's thank our speaker again. So the second paper for this session is a graph-based posterior regularization for semi-supervised structure prediction. And it's presented by Luheng He. >> Luheng He: Can you hear me? Yes? It's working? >> Alias: Is the mic on? >> Luheng He: It's on. Okay. Hi. I'm Luheng. I'm a student at UDub. Today I'm going to tell you about how to use graph propagation to help semi-supervised part speech tagging and especially how we did it by simply optimizing a joint object function. This is joint work with Jennifer Gillenwater and Ben Taskar. And this work has originally been presented at CoNLL 2013. Here is an overview of my talk today. First, I'm going to talk about structure prediction, in particular conditional random field with its application to part speech tagging. And then I will talk about graph propagation which is an alternative approach to the part speech tagging. It's a semi-supervised, an instance-based method. And finally I will talk about how we successfully combine these two very different methods into a single joint objective function that can be efficiently optimized. So I guess most of you are familiar with part speech tagging. Basically for each word in a sentence we want to assign to it a syntactic category such as a determiner, a verb. So for each instance we have the input X which is simply the sentence and the output structure Y is the sequence of part speech tags. Conditional random field is a standard technique for doing part speech tagging, and its tagging prediction is based on the feature function F. This is defined on each of the local factors in the output structure Y. It usually contains the current tag Y, T; the previous tag Y, T minus one; and the input X. And also we have a set of parameters theta which are the feature weights So with theta and feature function F we can compute the potential scores for each of the local factors, therefore, we can model the conditional distribution and for all possible tagging sequence given an input sequence X. In particular, P theta of Y given X is the conditional probability of a particular tagging sequence Y given an input X. It is computed by taking a product of other local factor scores and normalized to sum up to one. So therefore, learning the conditional random field model can be formulated as an optimization problem where we want to choose the best parameters theta to minimize an active log-likelihood of other labeled sentences from X, 1 through X, L. And different from conditional random field, graph propagation is an instance-based method and it tries to predict one tag at a time without considering any structuring information. In our particular setting each input is a trigram in a sentence and the output is the part speech tag of the center word in the trigram. The basic assumption is that if two words frequently co-occur in similar context than they should be assigned with a similar part speech tags. So to apply this assumption we start by building a sparse K-nearest neighbor graph using all the trigrams that occur in the Corpus as nodes, and for each pairs of trigram we compare the distributional similarity using the words that are in the context of the trigram and inside the trigram. And for trigrams that are similar enough we add an edge between them with edge scores representing the similarity scores. So if a trigram ever occurred in a label sentence we consider the corresponding node as labeled, otherwise it's unlabeled. Then we can propagate information from those labeled nodes to the unlabeled nodes and this gives us a tagging distribution of the center word for each unlabeled trigram. And formally this tagging distribution is determined by choosing the one that minimizes its difference with its neighbors. A final step is to sum over the entire graph and write a quadratic penalty term. We call it a Graph Laplacian Regularizer. This regularizer forces the tagging distribution for similar trigrams to agree with each other. So again, graph propagation can be formulated as another optimization problem where we want to choose the best tagging distribution to minimize this Graph Laplacian Regularizer term. Now take a step back. We talked about two very different methods to do part speech tagging. The first one is conditional random field which is a supervised method that learned from labeled data and it models the conditional distribution of all possible taggings given an input sequence and it's parameterized by theta. And we also talked about graph propagation which builds a K-nearest neighbor similarity graph from labeled and unlabeled data and propagates information across that similarity graph. It models the tagging distribution for a center word for each unlabeled trigram. In the next few slides I'll talk about how we can combine these two methods. There are some prior work that tries to do this. The work that's most related to us is by Subramanya et al. EMNLP 2010. They used an iterating method. They first trained a conditional random field from the labeled data, they used the CRF prediction to initialize for the graph propagation, and then used the graph propagation results to provide additional training data for the CRF, and this goes on for many iterations. So this work is [inaudible] successful but it also raises several questions such as, is it optimizing some joint objective? Or, does it have guarantee to converge? And these questions actually motivated our work to build a joint framework that has nice guarantees. A first step to do this is to introduce a set of auxiliary variables Q. Even consider a copy of the conditional distribution as P theta. So in particular, Q, I, Y models the conditional probability of a particular tagging sequence Y given the input X, I and we make to further assumptions on the conditional distribution Q; one, that because Q is a probability distribution, it should be normalized to sum up to one for each of sentence; and second, we assume that Q can be decomposed exactly the same way as P theta does. So we can do write each Q, I, Y as the product of a bunch of local factor scores. We call them R, and later we'll see that these assumptions are crucial to our efficient optimization. Now with an auxiliary distribution Q can write our joint objective function. It consists of three parts. The first part is the Graph Laplacian Regularizer that's defined on the distribution Q, and the tagging distribution for each trigram is computed by normalizing Q with respect to each trigram in each tag. The second part is simply the objective function of the conditional random field defined over the P theta. So if we optimize the first two terms separately we'll get two very different distributions that capture different information that's helpful for part speech tagging. So we have the final term that minimizes the K all divergence between the two distributions, Q and P theta. Basically this term forces Q and P theta to agree with each other so that the final prediction can benefit from both the graph propagation information and the conditional random field information. So the final question remains is that how we can optimize this joint objective efficiently. We have two sets of parameters Q and theta. Theta, remember, are the feature weights in the conditional random field. It’s completely unconstrained so it can take any positive or negative value so optimizing for theta is very easy. We just do a straightforward gradient descent method. However, updating Q is more complicated for two reasons. The first one is that Q is constrained to sum up to one. And second, there's no compact representation for Q. Remember that for each sentence X, I the number of all possible taggings Y equal to the number of all tags raised the power of sentence length. So basically there are exponential number of components in the distribution Q and it's simply impossible to memorize or update that many variables. However, we still want to do this. Remember we made an assumption that Q can be decomposed as a part of local factors R. So it's natural to think that what if instead of updating those Q variables we want to maintain and update those R variables and there are much less of those variables. But consider if we do an additive gradient update upon that Q then what happens is that the resulting Q prime will no longer be able to be able to retain it as a product of local factors, therefore, we need to use an algorithm called exponential gradient descent. So instead of an additive gradient we do a multiplicative update which means we take the gradient term and take its exponential and multiply it to the original Q and resulting the updated Q. When written as this form we can see that both the original Q and the gradient term can be decomposed into a set of local factors. We [inaudible] the details here, but basically we can write the gradient term into a bunch of local factors and distribute those local factors on each of the R variables. This means that instead of updating all the Q variables we only need to maintain and update those R variables and then compute the Q using a forward-backward algorithm. So to summarize, we have two sets of conditional distributions Q and P theta and we have this joint objective function. We can optimize it using an EM style algorithm where at M step we update theta and at E step we update each local factors in Q which are the R variables and then we can normalize Q by doing a forward- backward. And this algorithm is guaranteed to converge the local optimum. Now we come to the experiments. We tested our algorithm on 10 different languages from the CoNLL data set and we tested in their weakly supervised setting which means for each language we use only 100 labeled examples. We used universal part speech tagging set, so it's the same tag set for all the languages, and for the base model we used a second order conditional random field. Here's our results. On the X axis are the 10 different languages starting from English, and on the right most bar is the average of all the languages. And the bars show here parts speech tagging errors, so the lower the better. The first place line we’re comparing against is the graph propagation results which is simply by minimizing the first term in our objective function. And the second baseline is computed by training, learning a conditional random field from the graph propagation result for one single iteration. It has some improvements but not much. And the third baseline is simply by training a conditional random field using the 100 labeled examples. So it actually outperforms the previous baselines for some languages. It tells us two things; one is that structure information is really important, the other is that graph propagation provides additional information as well as additional noise. So finally, we have our joint objective which combines the power of the previous two methods. It outperforms previous baselines for all languages consistently. It has a 28 percent relatively relative error reduction on average. So as a conclusion, we want to use graph propagation to help weakly supervised structure prediction, in particular, we formulated this graph propagation method into a regularizer term and encoded it in the joint objective function; and we also proposed an efficient algorithm for optimizing this joint objective. It is more interesting to think about that what other kinds of posterior constraint that we can use to help structure prediction. We can use it as long as it can be written as any convex and differential regularizer term. Our code can be found at the link below. You can try it out and play with it. Thank you for attention, and I'm happy to answer questions. >> Alias: Any questions? >>: So am I right in thinking that you are forcing the graph to do a prediction on every single I in the chain? >> Luheng He: It's not exactly. It's all the trigrams. So if two I positions have the same trigrams they are combined into the same node. >>: But in your joint objective could you use the CRF for known words? I guess another way to ask the question is the graph in the Subramanya work could be sparser. It doesn't have to predict on everything, right, because they take the average. They sort of do self-training in order to combine the prediction of the graph and the CRF. But as you have a full a joint model>> Luheng He: Yeah, but the graph regularizer is still based on the tagging distribution for each node and the node is marginalized and normalized, basically taking average over all of the labeled data. Am I understanding it right? >>: I mean, you could use a Subramanya method without forcing the graph to label every single word in your training data, right? You could use it for let's say unknown words. But in this one you have to do it on all, every single>> Luheng He: Yeah. The graph contains information about all the tokens. >>: Okay. I just wanted to clarify. It’s just a question about the model. I just wanted to understand it better. >> Luheng He: Thank you. >>: So I was just wondering how fast to do the EM steps converge? >> Luheng He: We ran it for 20 iterations. >>: Is that different by languages? >> Luheng He: For larger languages it's slower, of course. >> Alias: You have time for a quick question, otherwise let's thank our speaker again. All right. Okay. Our next and final talk for today. We kept the best for the end is multi-metric optimization using ensemble tuning, and Anoop Sarkar is presenting. >> Anoop Sarkar: So you know that two out of the three quarters are my students because they put me as the last talk. This is their way of getting revenge on me. So I'm going to talk about this work that was done by primarily by my student, Baskaran, who couldn't be here today. So I'm presenting on his behalf and also on behalf of Kevin who is in Japan. He's a collaborator in this work, and it's about multi-metric optimization. So I'm going to start with a caricature of how we do discriminative training in MT, and this has two objectives. If you already know how it works then it will introduce you to the caricature I'm going to use to explain what we did, and if you don't know how MT tuning works this will give you a distorted view of what it does and at least enough to hopefully understand contribution. So we start with a frame sentence there, an English sentence here, which is the translation of the frame sentence. This is a machine translation decoder which is going to produce what the machine translation system currently thinks is the translation of that frame sentence. So you can get an end-based output for this frame sentence, and we are trying to tune the parameters of the model. So what we want to do is to use these end-best list in order to train the parameters of a model, so we have a bunch of knobs that we want to turn in order to make sure that the translations we get in that end-best list matches the actual translation of F, right? So we might get something wrong, and we want to tune these knobs so that in the next iteration we use a better weight and that better weight vector goes back into a decoder and hopefully we get better. So this is our tuning loop. We are training our system to produce better machine translation outputs. So the question is how to choose this weight. Well, this weight is something that's going to match an extrinsic score, so a lost function of some kind. And that loss is based on some metrics. So some way of saying that this is a good translation or this is a bad translation and machine translations are notoriously bit difficult because there's more one way to translate a sentence, so you don't get a loss that's quite nice like you usually do in machine translation. So let’s look at this weight optimization step. We want to say if the model likes it and the metric likes it that's great. Our model score is good, and it also performs well according to the metric. If our model score is really high and our metric says this is bad then we should be changing our weights. In this case maybe they're both good; in some cases the model score is bad and the metric says it's bad and that's good too, right? So we are sort of predicting the distribute should be bad. So here's one particular metric that’s called Bleu. It just basically matches n-grams in the output with the reference. And they might be, and this is, on the x-axis as a model score so you can see that this guy and this guy are kind of evil twins, right? So this has a high model score and really good on our metric, so higher the better, and this is really good model score but terrible on the metric. So what we really want to do is say here’s two things that are getting high model score, prefer this guy over this guy. So you can kind of replace this complicated machine translation loss function into something that is basically binary classification. So this is the idea behind PRO. So the question is, is this metric the most suitable one? So you go to, it's happening less often for some reason, but every time you go to an ACL or an echo somebody in the audience will get up and say, I hate the Bleu score. When is the Bleu score going to end? And lots of people have tried many, many replacements for the Bleu score. You can take your pick. There’s a whole bunch of different metrics so each of them can be used to choose a weight vector that matches one of these extrinsic scores. But when you look at every single paper on machine translation we optimize using Bleu. And it’s the Highlander of machine translation metrics; there can be only one. It has killed everything else. And it is true; it does work better. So people have tried other metrics, it just seems to be the best one. But in this work we want to say maybe you don't have to choose one. Maybe we can be sort of anti-Highlander and have many people live. So in this work we're actually going to look at different metrics and see how we can use them all in order to tell us what's a good translation, what’s a bad translation, all at the same time. And so we have the Bleu score. Its n-gram matches with Meteor, [inaudible] favorite one. And it's like Bleu, so Bleu essentially captures recall in a funny way. So Meteor captures precision and recall and also has some synonym in stem matching. There's one that you might not have heard of that’s called RIBES, and RIBES is really good at figuring out whether things are out of order or not. So it's very popular for Japanese-English translation because it measures out of order matches and does it using a Spearman’s correlation coefficient. And finally, there's a good old one from the speech record days, which is TER. So this is translation error, so insertion, deletion, and block shifts. Okay. So these are all different reasonable ways to measure how different is your hypothesis from the reference. So what we're going to do is use them all at the same time. So let's define what that means. We want to find the weight vector that's good according to multiple metrics. So this is kind of a [inaudible] way to say that. So find one W that simultaneously optimizes many different objectives. Each metric is treated as an objective, and you treat the hypothesis that I reproduce for each metric and G is the way to combine them. So one way to do it is to simply, at this point when you choose a weight vector, just to take two things like TER minus Bleu divided by two. Just combine them together and now you have a linear combination of two different metrics. And this has been done before. This is going to be our baseline. There’s one problem with this is if you want to combine them and you want to say I want to actually have more weight on the Bleu score and less weight on the error rate you have to do that by hand. There's no automatic way to do it. So we're going to address that issue, and we're going to do things a bit differently. So what is a reasonable way to combine different metrics? Well, one reasonable way to do that is a notion of Pareto efficiency which says you have optimality with respect to different metrics; then you should look at this objective A and objective B and look at this Frontier. So these points here don't matter, they're worse according to both metrics. So these points on the Frontier are the ones he should be looking at. So this is the idea we are going to use. There’s previous work by Kevin that actually just modifies PRO to push the points in the Frontier, well one problem that was in that previous work was there was no way to actually take advice from all the different objectives at the same time. And that's another problem we solve in this sort. So I'm now going to presents three, we introduced four different ways of doing it, and I'm just going to present our main result due to lack of time, and I'm going to explain to you what it does using my caricature of tuning. And we call it ensemble tuning. And the idea is the same as before. You get your end-based input, nothing different here. Now if you have, for example, Bleu as one of your objectives you do Bleu weight optimization, so you turn the knob so that you get the best weight that is going to optimize Bleu; and let's say the other metric is Meteor, you do that as well, you get another weight vector. So you get two different weight vectors, W, 1 and W, 2, and they both have different views on what the weight vector should be. And now what we do is we combine these using this metal weights. The metal weights are saying how much should I trust Bleu, how much should I trust Meteor, and the metal weights are trained so that the points that we pick that is a combination of these two lie on the Pareto Frontier. So it will reward things that are on the Pareto Frontier and everything under that Frontier will get penalized. Now there's one issue here which is when I combined these two I still want one model. I don't want a Bleu model and a Meteor model because now I have two problems, not one. Before I had at least one problem. So I need a way to combine the predictions of the Bleu model with the error model, the Meteor model in this case. So how do we deal with these multiple components? So that's the second problem we solve is you don't have to pick and choose. So the usual way people use Pareto optimality is they say well, somebody else is going to pick one of these depending on the order you want. So if really want Bleu scores you will pick one this one, if you really want Meteor scores maybe you'll pick this one. So we don't actually have to make that choice. We combine the weights using ensemble decoding, and the idea behind it ensemble decoding is very simple. We have weight one and weight two, one is the Bleu, one is the Meteor, and whenever we make a prediction we actually combined them together to make a prediction while we're decoding. So basically, we can take as many weight vectors as we want and produce something that will be the combination of these. And we just do that all the way up until we get a translation, and each case is going to be a combination of these two weight vectors. Okay. So we implemented this ensemble decoding idea as well as the Pareto weight training in our favorite decoder which we wrote which is available on github is you wanted to use it, if you're brave enough. We use fairly standard large scale [inaudible] Arabic-English and Chinese-English and this is going to be two-dimensional. So this one is RIBES in Bleu and I'll show you results for other metrics, but let's just look at Bleu on one axis and RIBES which is another empty metric on the other axis and LC. So Bleu by itself is over there, RIBES by itself is over there. So you can see that if you optimize to where it's Bleu you'll do better on Bleu, if you optimize on RIBES you do better on RIBES, those LC points are the linear combination points. This is linear combination plus ensemble. And that point up on the right is Bleu, RIBES, and TER altogether. And this point over here is Bleu and RIBES together. So you can see that what we want is points that are on the upper right quarter, right? Things that are good according to both metrics. And ideally better than both, either doing one or the other individually. Here is Chinese-English and you can see that as you add more metrics you can get a bit better. It's a better on Bleu then just doing Bleu itself. It's not, in this case is not as good as doing Meteor by itself, but you can see that Meteor scores really badly on Bleu. So the question is, are those points up there good because you get a good Bleu score? So can you always improve the Bleu score? So here actually we show that if you have a single objective that's just Bleu, it gets better; if you have to metrics Bleu and RIBES does better, this is Meteor, RIBES, Bleu, Meteor, TER are both better than just doing Bleu. And then we did some crazy stuff over here like B3 is like Bleu three-gram instead of four-gram and so on. So you can invent as many metrics as you want. And you can see that, in fact, it does better. So we also tried to look at whether those points on the upper right quadrant are actually better. Do they actually read better? So we did a human evaluation which was a post-editing task, and we saw six percent gain in the post-editing error. So it was six percent easier to post edit multiple metrics than one metric. And that's it. >> Alias: Do you have any questions? >>: I just wondered if you find any language difference in preparing one metrics over the other? >> Anoop Sarkar: So Kevin has done work on Japanese-English and RIBES helps more for Japanese-English just because of the reordering, and we haven't done anything other than Arabic-English and Chinese-English yet. >>: Do you use a combination of [inaudible]? >> Anoop Sarkar: it's considerable that you could have some metrics that help more with languages that are quite different. I don't know. If you had a really good metric for morphologically more complex languages, which doesn't exist, but if you had one then we could use that in order to improve the Bleu score. >>: Right. So if in that case, then you combine multiple metrics in there you have to take that into consideration. >> Anoop Sarkar: Yeah, yeah. But I mean it's an interesting challenge at least. It's not just mindless. I think that's an interesting problem to have. >>: I didn't 100 percent understand why you needed to combine the weights on the fly during inference as opposed to combining them off-line into a single model. >> Anoop Sarkar: Oh, yeah. It does better. That's what we claim. We have a lot of, I agree it's not a given, but we have compared it to linear combination off-line; and ensemble decoding allows you to different things, so it allows you to do, for example, do switching which you can>>: Oh, I see. So you're not just, okay. >> Anoop Sarkar: It allows you to different kinds of [inaudible]. So we have different ways of combining the models that you can sort of turn off and on in the decoder. >>: So you're learning combination methods. >> Anoop Sarkar: So [inaudible] thesis is on that. We have a lot of papers on ensemble decoding as our hammer that improves over linear combinations. >>: So I'm surprised that when you optimize both Bleu and Meteor, for example, that you can do better on Bleu than just optimizing on Bleu alone. >> Anoop Sarkar: It’s not surprising. >>: It's not surprising? >> Anoop Sarkar: Well, you can think of the Bleu as being very, very sensitive to length, but actually maybe it shouldn't be. In some cases it might penalize it really badly because you hit against a [inaudible] penalty and that leads you into a bad space. >>: So I can see why optimizing both metrics at the same time would do better for humans, but just on the Bleu metric alone>> Anoop Sarkar: It just stops Bleu from making a mistake. I mean that's a story you could tell. I don't know. We have tried to analyze it in different ways. So Baskaran has done some things where he sort of on purpose sort of damages one metric and then sees what you can recover from the other metric. It's kind of interesting to see. But those are still artificial [inaudible]. You can tell, there's an interesting paper by that BBM guys I think, about putting together a bunch of different features to sort of get Bleu on the right track, as it were, and this might be what's happening here as well. >>: Thanks. >> Alias: So let's think our speaker again. And I think now we have some closing remarks. >> Anne Clifton: Hi. I'm Anne Clifton. I'm a PhD student at Simon Fraser with Anoop and one of your co-chairs today. So just really quickly I wanted to thank all of you for participating today and making this the most successful Northwest NLP to date. And in particular, I would like to take a moment to thank our hosts here at Microsoft, particularly Will Lewis who’s set all this up for us. So I hope you’ll join me. So that's all. Thanks for coming. I hope to see you in 2016. >>: Before you leave I do want to mention that actually I'm getting almost too much credit here. Anne, Anoop, Maryam, and Yashar[phonetic] did an enormous amount of work and this honestly is the most successful, by number, NWNLP that we've had. If we continue at this rate we're going to have to move to a much bigger venue next time. So hopefully we don't double yet again next time. Let's try to keep it around 300 next time; maybe it will be okay. Thank you everyone.