Document 17881151

advertisement
>> Michael Auli: So today we have Liang Huang here as a speaker, and Liang is currently an assistant
professor at the City University of New York. Before, he was at ISI, and I think before that he was at
UPenn.
>> Liang Huang: I was at Google.
>> Michael Auli: At Google, as well, very briefly. Yes. And Liang is very well known for his work on
large-scale discriminative training, which he will be talking about today, but also on parsing and on
efficient algorithms for machine translation. So with about further ado, please join me in giving Liang a
warm welcome.
>> Liang Huang: Thank you. Thank you, Michael. Thanks, everybody, for coming to my talk on a
Friday morning, and I was giving another talk at UW yesterday on parsing and machine learning, and this
talk today will be on the application of those algorithms for translation. And the title sounds very
technical, like max-violation perceptron, blah, blah, blah for scalable blah, blah, blah, but the real kind of
more intuitive version of the title says Large-Scale Lexicalized Discriminative Training for Machine
Translation is finally made successful for the very first time. That last thing I think is the kind of takehome message, take-home version of the talk. And before I talk about anything technical, I always have a
lot of jokes for translation, and this time I will show these jokes in a way that you can actually tell what
kind of technology is behind it and what kind of error is behind it. So this first one is clearly an ATM
machine in China, but the sign reads Help Oneself Terminating Machine. But if you look at a Gauss, it's
actually not that bad. If you read Chinese, so [indiscernible] came in, it's not that bad. So it's self-help
terminal device. But it's not something that you can help yourself to terminate yourself, so this means
that translation has to be done in context, and ideally, it should be done with understanding on the source
language. So this is basically a word-based translation, and I think from a very old online website or
whatever. That's even before phrase-based translation. Maybe it's rule based. I don't know. But the next
few examples I think are phrase-based examples from either Bing Translator or Google Translation, so
seating reserved for consumption of McDonald's guest only. This is a typical PP problem on the source
side. This is clearly not human translation, right? Nobody is going to be that creative, and please check
out the cashier. That's apparently a phrase-based translation, a very typical of phrase boundaries, and tons
of stuff like shaoxing, be careful. In China, people always ask you to do these dangerous activities
carefully in case you have to do it, right? So slip carefully, fall into water carefully, blah, blah, blah. And
if you try them on Google or Bing, you get roughly speaking the same. You get carefully slip for this
one. You get fall into water carefully, something like that. So my rule of survival is that if you don't read
Chinese and you go to China, if you see something like X carefully, where X is a verb phrase, just don't
do it. You'll be fine. And this problem is more interesting. So why is it click here to visit? Do you guys
know why? Because it's trained on web text, and you have tons of click this button, click this link, to
enter something. And this is in a museum, in a Chinese museum, so it should be like go here or follow
this direction, but it's a domain adaptation problem. So that's very typical. And you can actually see -from these jokes, you can actually debug what's going on behind the scenes. This one, actually, I couldn't
figure out what kind of problem it is, but it's very funny, explosive dog. I couldn't figure out what's the
technology behind it. But my all-time favorite must be this one, translate server error. I guess most
people are already very familiar with. That's the cafeteria in China, and unfortunately maybe the Bing
Translator server was down that day and he didn't know that, and he just put it up. It's becoming the most
famous cafeteria in China, called Translate server error cafeteria. But I like these examples not just
because they are funny, but also because they are the best evidence that MT technology is used in people's
daily lives, because look at this and look at these. Look at these list examples and these examples, and
this one and that one, there's no way that a human being can translate these things, regardless of how bad
his English is. There's no way that he could be that creative. So this is clearly machine translation, and
you can see a lot of problems with machine translation that we work on, like domain adaptation, like
language model, like the syntax and semantics of the source side and stuff like that. So, really, machine
translation is becoming more and more useful and involved in people's daily lives, but its quality is not
good enough, so what can we do? Sure.
>>: I think there's a counterexample. Are you familiar with a book called English as She is Spoke? It's a
guide on how to translate from Portuguese into English, written by a guy who only spoke French and
Portuguese, so humans can make really, really crazy translations if they don't really speak the language.
>> Liang Huang: Okay, maybe there is more levels.
>>: You should check it. There are some great translation examples from it.
>> Liang Huang: Sure, sure, sure. It may be very funny. Maybe next time I will have those examples.
But if you look at these examples, they all involve the Chinese word shaoxing, which means either be
aware of, be cautious or be careful. But, really, it has to be done in a context, so that you know the
syntactic category of the phrase after shaoxing, it's either a noun phrase, and it's like be aware of dog, but
if it's a verb phrase, it should be be careful not to do something. So really, you should know the syntactic
category of this word, of this phrase. So translation, you need context as a rule selection. So how do you
encode this knowledge in our translation systems, like, say, phrase-based translation? We often use some
features of the context, like say, if the next word is noun phrase, then this shaoxing should be be aware of.
Otherwise, it should be be careful not to. This knowledge, it can be encoded as context sets of features to
guide our rule selection, but how do you train a system with so many features, because you can imagine
you have a lot more features. You have very rich features like what if the next noun has the as the
beginning of the noun and has dog as the head word and blah, blah, blah. You can have millions of
features like this, right? So we have to do this discriminative training with so many features, and
discriminative training has been a difficult task, a central problem in machine translation, and it started
with MERT from more than 10 years ago and then Percy Liang did standard perceptron to train it on a
training set, which is much larger than a dev set, which I think is a good direction, but it failed miserably.
It didn't work out. But then people completely abandoned this line of work and switched back to the dev
set, and you have MIRA, you have PRO, you have HOLS, you have many others. You have Michel
Galley's work on regularized MERT and other variations of MERT, which works better than MERT, for
sure, but they're all trained on the dev set. And if you just train it on the dev set, you cannot afford to
have many features, because dev set is really too small, like 1,000 sentences. How can you see a
combination of try one on these 1,000 sentences? It's very unlikely that you'll see it again on the test set.
So you've got a data sparseness problem here, so really we should get back to that direction to train it on a
training set, so that you can have millions of features and so on and so forth. But it's been so hard that
nobody is following up on that line. So finally, after seven years, we did it successfully, using a different
version, using kind of a specialized perceptron, which is designed for search errors problems, designed for
a problem with heavy search errors, because MT is all about search error. Search, like phrase-based
translation and syntax-based decoding, the search space is just humongous. And you have to use beam
search, you have to use prune, you have to use a lot of approximate search methods to make it tractable,
but those methods unfortunately introduce a lot of search errors. And our learning algorithms, like it's
perceptron, blah, blah, blah, they don't deal with search errors that well. They are not designed to handle
search errors. So my work, why it succeeded, is because we are the first ones to learn to accommodate
search errors. In a sense, we want to live with search errors. We cannot get rid of search errors, because
we use the same search, same decoding algorithm in both training and testing, and your same beam size,
for example. So the search quality is fixed. So search for us is fixed. You cannot even improve it,
because unless you can increase the beam size or use a much better search algorithm, the search quality
stays the same. The only thing you can do is change the learning to accommodate the search, to be robust
to the search errors. So that's our contribution, so we changed the learning algorithm to adapt to search
errors so that we can train our stuff on something really fast but really bad search, like phrase-based
translation. Really bad search. It's almost like greedy search. Okay, that's kind of the general idea. And
why standard perceptron doesn't work out well? It's because, as I said, the theory is based on exact
search. It assumes that your search is kind of perfect, but MT has such a huge search base, and as I said,
full updates like perceptron, MIRA, PRO, they all do full updates in the sense that they always update on
full sequence. It doesn't deal with search errors. So what we should do, we should have some kind of
thing, mechanism, to address the search errors in the middle. Question?
>>: What if people had -- I mean, so you're assuming that the problem really is a huge amount of search
error. There are people who have claimed that if you run with very wide beams, etc., etc., you see that in
fact ->> Liang Huang: You get better search quality.
>>: But search error is not necessarily such a huge issue.
>> Liang Huang: I will convince you at the end of this talk. We have statistics, we have plots to
convince you that even if your beam size increases by a lot, it doesn't help.
>>: What about the noisy training data? So one of the reasons why distributive training might have a
problem is you're trying to fit this training data, and the training data can be ridiculously bad.
>> Liang Huang: Yes, so I have another kind of small method to address that problem, which is forced
decoding. I will talk about that in a minute, but that's the kind of byproduct. The main idea is to address
search errors, and the argument is that the original complexity of phrase-based translation, for example, is
exponential, 2 to the N, N-squared, something like that. And you shrink it into linear time beam search.
You pay a huge cost. You've got the speed, but you sacrifice the search quality by introducing a huge
amount of search errors. Now, you double the beam size, doesn't help that much. You're kind of 10 times
the beam size. It doesn't help that much. The complexity is really 2 to the N. You can't do it unless your
beam size goes upwards. Otherwise, if it's constant, it just doesn't help that much. That's the difference
between your very easy search algorithm, search problem, like part-of-speech tagging. There, you
increase the beam size, you almost get it perfectly correct. Machine translation, it's just impossible. Even
if you do syntax based, you have cubic time and you shrink it into linear time, it just doesn't help you that
much. So my argument is just you can't fix that much search errors, even if you have a very large beam,
at least for phrase based. Okay. Okay, then our whole point is we want to use some partial updates or
prefix updates up to the point of search errors. Not all the way to the end, because if you just do full
updates, it just doesn't address the problem in the middle, so really you should focus on somewhere in the
middle, where the search is so bad. That's our intuition. And then we used forced decoding as a guidance
to update towards. That we will talk about in detail in a minute. And the end result is that we scale to a
very large portion of the training data, and we can use more than 20 million sparse features. I think that's
the largest size in the literature in the online learning fashion, and we've got more than two points of
BLEU over MERT and PRO, so that's the final result. So let's see, how can I deliver that story? So I'll
first discuss structured classification with latent variables as a model to train MT, because in MT we have
input and output in the powered text, but we don't have the derivations annotated. How do you get from
the input to output? It's completely hidden, and that's the latent variable, and we used forced decoding to
address that, so we will use phrase-based translation as an example and use forced decoding to compute
latent variables. And a central piece of this talk is about how do you learn to accommodate a huge
amount of search errors. That's the new learning algorithm, and we use some new stat-based strategies
like early update and max-violation update, and we designed some rich feature sets to kind of learn from
the data to be context sensitive, and we have experiments to come. Okay, so the whole story of structured
learning, I would start with structured perceptron, because that's by far the simplest algorithm for
structured prediction. It's much simpler than CRF or structured SVM and stuff like that. So if you
understand structured perceptron, it's enough. So this is extended from binary classification and binary
perceptron, so structured classification is just like you have input, and the output could be millions of
classes. You can imagine it's still classification, but just the output is so many classes. And it looks like
the exact same architecture with ordinary perceptron, except that this box currently is much harder than
this box, here, because it used to be just two classes or 10 classes, and it's trivial. Now, you have
exponentially many classes for each input, so we often use dynamic programming, CKY or whatever, like
phrase based, you have dynamic programming, but they are still too slow. They are still too slow. So
what we often do is to -- because this is exponentially large, we often have an inexact or approximate
inference box or bad search box, like greedy search, beam search, to replace the exact inference, which
we cannot afford. And that will have a detrimental effect on the learning part, because the learning is
really not designed to handle inexact search. Okay, so there are two challenges here to apply this story to
MT. One is the inference box is too slow, so we have to do approximate, but another problem is that the
correct derivation is also hidden, which is the latent-variable part, which I'll talk about in the next slide.
Okay, so how do you get from the input to the output that's hidden? Okay, so we have to kind of extend
perceptron a little bit to introduce latent variables, to handle latent variables, and that's actually found in
previous work by Percy Liang and other people. So let's say we have this training example in our training
data [indiscernible] is Chinese input, and this is the man bit the dog. And at training time, during online
learning, the perceptron, you try to decode or try to translate this input using your current model. And in
the full search space, you find the highest-scoring derivation according to your current model which will
lead to a translation, the dog bit the man, which is different from the reference translation, and you realize
that, oh, I made a mistake. I should update. And how do you update? Well, normally speaking, you have
something to update towards, the positive signal here, but currently you don't have a positive signal,
because you don't know the derivation. Now, what do you do? Because there are millions of ways that
you can translate this input to this output. Which one should you prefer? Well, the simplest thing to do is
you prefer the one that is scored highest by the current model. So you do a forced decoding, so that space
is a much smaller subset of the original forced space, and every single derivation in the forced-decoding
subspace produced exactly the same reference translation, and you just search for the one that is highest in
score according to the current model, so that's our positive signal. So we just do an update to reward that
derivation, the positive derivation, and penalize the wrong derivation. So that's just a reward correct and
penalize wrong. That's just like normal perceptron, except that this part, you have to do another decoding,
which is called forced decoding, which is on a much smaller space, and this decoding is the original
decoding or unconstrained decoding, or the real translation decoding. Okay, so that's the main idea. The
problem, though, is that we cannot afford to do a full search. Now, we can only do a beam search here,
which is very narrow and very likely the correct translation, the highest-scoring correct translation falls
off the beam very early on, very easily. In that case, if we just update that way, it just doesn't work. So
because there are search errors here, you should really address some search errors here, so the next time
the model would guide the search so that the BLEU guy, the BLEU dotted derivation, would not fall off
that early, would actually survive the search much longer. So, likely, if the BLEU derivation survived the
whole search, then you would be able to produce a correct translation. So that's the whole idea, to address
the problem of the beam search and the errors introduced by beam search, and how would you do it? You
cannot do a full update. You cannot do standard perceptron, so we have to do something new. But before
I talk about that learning part, I'll first give you the kind of brief intro about phrase-based translation.
Sure.
>>: Do we have such error in forced decoding, as well?
>> Liang Huang: You could, you could, but right now we do an exact search for forced decoding because
the search base is much smaller, because it's constrained to produce exact output. So for now, I think we
don't need much pruning here. Maybe a little bit, tiny little bit, but here you use tons of searches.
>>: Do you leave one out in your forced decoding space?
>> Liang Huang: You could. You could just use the rules strategy from other sentences. We do that for
small data sets. For large data sets, it's not that important, so we just leave out the one-count rules.
Otherwise, it would remember too easily the sentence. But we could have some search errors here. I
think that's okay. That's okay, as well. Very good question, so any other questions? Good. Now, I will
do a brief intro about the searching in phrase-based decoding, which I think for this audience, I don't need
to do those slides, but I just quickly want to go through it. So phrase-based decoding, you have states like
this, which just says no words is covered, and then you can cover the first word, and then you can cover
these last three words and can jump around, and that's why it's 2 to the N in exact search. At least the
stage space is 2 to the N, right? It's just like traveling salesman. You have to cover one word once and
only once, and so on and so forth. And that's one derivation, and you can have other derivations, and you
have a graph, and there are many paths from the beginning to the end. That's why it's exponential. Now,
that's not the full story. The full story has language model in it, so you have to split each state in the
original space by adding the last word being translated, if you have a background model, so Bush -- and
these three states used to be the same state. Now, they are three different states, because they have
different last words being translated. So that you can add language model cast when they extend to the
full translation and so on and so forth, so you have a lot more states than before, after you introduce the
language model. The space gets even bigger, but either way, it's at least exponential. So to make it
tractable, we use beam search to make it linear time, so that's why it has a huge amount of search errors.
So at each step, you allow like, say, five guys. The beam size may be five or 10, and this five is all these
guys are covering one words, all of these guys are covering two words and so on and so forth. That's
what we're doing in practice. Now, from decoding to forced decoding, you are basically trying to say,
what if I have a much smaller space, constrained by the constraint that you have to produce exact output?
Now, it's much easier to search, because you can suddenly prune away all these guys which kind of
violates this constraint, so you can only have talks here. If you have meeting or talk, which is not found
in reference, just completely delete them away. Then you only have one derivation in this space, but you
could have millions of derivations in the forced decoding space, so actually we store a small lattice. So
the full space is a big lattice. Now, the constrained version, forced decoding, is a much smaller lattice,
but it's still lattice, so you still have exponentially many stuff, like Bush held talks, or Bush held talks in
one rule and so on and so forth. You still have millions of paths in the forced decoding lattice. Okay, this
is assuming that you could produce the exact output. What if you could not? What if you don't have even
one derivation that could produce the output? That actually happens a lot, so I give you this example.
Phrase-based translation has a limit of distortion, distortion limit, where it says you cannot jump too far.
You cannot jump more than four steps in one jump, so that's to make it tractable. So this sentence pair is
perfect. It's like United Nations sent 50 observers to monitor, but then there is a big jump -- to monitor
what? To monitor the first election, which is mentioned last on the Chinese side, and then you go back
and jump back and stuff like that, but this jump is too far. It's five steps, five words, so it's disallowed by
our distortion limit of four, for example, and this whole sentence is not reachable, is not reproducible in
the sense that we cannot even have one derivation that can produce the output. What can we do? But this
sentence is really perfect, perfect translation. It's not bad translation or whatever. This is very literal
translation, so it's a shame that we cannot use the whole data. For now, we have a hack that we just use
the prefix, which is perfectly fine, which uses a prefix pair that United Nations sent 50 observers to
monitor, but not a full sentence. That would help a little bit. That can recover some of the data. Right.
Okay, so here is the statistics for how many sentences are reachable, or in other words have at least one
derivation that is correct. It turns out that it is not that many. For the majority of Chinese-English data
set, most sentences cannot be reproduced, is not reachable. It depends, of course, on the distortion limit,
so if the distortion limit is zero, which means that you have to do monotone translation, then you get only
very short sentences covered. If you increase to six, which in our experiments, you have about 20%
maybe, and this ratio drops very quickly as soon as it gets longer and longer. Chris?
>>: And the reason that you can't achieve sentences even with unlimited distortion is because you also
have a limit on your phrase length?
>> Liang Huang: That's right, that's right.
>>: You can also do this plot with ->> Liang Huang: Phrase length limits. That's a very good point, yes. Because if your alignment is
wrong and if garbage collection kind of behavior, then you have to extract huge phrase pairs, which is
disallowed by the phrase limit. And so for a lot of sentences, you actually cannot -- it's not reachable,
even if you have unlimited distortion. So this curve, we cannot afford to run it for any longer sentences,
because it's too slow, and that we would have to use, like you said, beam search, even for forced
decoding. For now, because our distortion limit is more constant, it's pretty fast so that we don't need to
bother. But it's a shame that for longer sentences, longer than 30 words, the vast majority of sentences are
not reachable. It's mostly on the short side. But we also argue that forced decoding that has a byproduct
that we can use it as a data selection kind of module, in the sense that we would prune away those nonlegible translations, those translations that have part of the English side not mentioned on the Chinese or
part of the Chinese not mentioned on the English, or just the wrong translation, or just kind of noisy
translations -- would just be gone by this kind of pruning, filtering. So those remained, who survived the
test, are often those you can argue are easy to translate, easier to translate. But they are also more faithful,
in a sense. Yes, question.
>>: So how much they have survived?
>> Liang Huang: Yes, how much they have survived. So let's see, how much they have survived? If you
have a small data set, then full sentence reachability is only 20%. If a larger data set, it's about one-third.
>>: Is this after significant pruning? You mentioned you do significant pruning.
>> Liang Huang: Yes, it's after the significant pruning, I think. But the reason why it's on a small data
set it's much worse than a larger data set is because these are trained on the small and large, also. The
word alignment quality is much worse on the small data set, and when you get larger data, word
alignment is improved. But, anyways, it's only like one-third of the sentences are fully reachable, and
they are short sentences. And because they are relatively short sentences, the number of words is actually
much smaller, although one-third of sentences, but they only represent 13% of words. So we added back
the prefixes, the prefix pairs like this. There are some prefix pairs for those unreachable sentences, but
partially reachable sentences. Then we can recover a lot more, so we can have about one-third of the
words used in the training. So, finally, we use this part of the training data, like only one-third of the
words in the training data. Yes?
>>: So can this filtering also severely distort the lowering model?
>> Liang Huang: Yes, exactly. That's a very good point. So they would most often just favor a short
distortion, more like monotone translation, because they cannot even do anything very long, right? So
that's the bigger point. So most of the translation we saw, it will be very easy to translate in the sense that
they are more like monotone. So for a perfect translation example like this one, which is a really, really
good translation pair, we just cannot afford it. That's a shame. But if we use [HIRO] or other syntaxbased methods, it's a perfect example. It's a textbook example for [HIRO] in this kind of behavior, really
textbook example. And I think maybe Michel's other work on phrase-based translation with syntactic
distortion or something like that, jumps, with the reduced-style jumps can handle this sentence, but I'm
not sure. But I think it's better than just a distortion limit. The distortion limit is just too crude. Longdistance rewordings are so common between English and Chinese, but not between Spanish and English.
So if you look at this curve for Spanish-English, it's very different. It's very interesting. So they are not
sensitive to the distortion limit, so even if the distortion limit is zero, it's still very good, not too much
different from the distortion limit of six, because translation -- these two languages are really just almost
the same word order, except for the local swapping, and local swapping is handled by the phrases
themselves. So it's like you can have reorderings within phrases, but you don't really need long-distance
reorderings between large phrases. And at the 20 years of MT workshopping at MLP, but maybe Peter
Brown said the reason why IBM's model succeeded was because French and English are basically the
same language. And that is true, I think, for Spanish and English. So, really, you don't see too much
interesting stuff going on, unlike Chinese, and that's why Chinese is a much more interesting language to
work on, and we also even tried it on Japanese. And could you guys guess how much the reachability is
for Japanese? It's so low. It's worse than 10% or 5%. We could not even use it, so we ended up not
reporting those results, but that's interesting. For Spanish, it's like this. Anyways, it's more than 50%
covered for Spanish. So this is how many derivations are there on average for each sentence, if they are
reachable, and depending on the distortion limit, but either way they're exponential, so it's not like just a
few numbers, but actually they are huge numbers of derivations, of correct derivations. These are the
latent variables, but they are packed in the lattice, basically. So if you just use N best list, I guess it
doesn't work that well, because really you have just too many possible derivations. Okay, any other
questions before we move on to the learning part? So here's the central part of this talk, is how we can
not fix search errors, but accommodate search errors, because you cannot really improve search quality, in
my assumptions. You can only learn to live with search errors, to kind of compensate the search errors or
reduce the bad effects of search errors. Okay, so let's look back at this picture. So that's how we do the
updates. The problem is the correct translation, the correct derivation, falls off the beam very early on, so
if you just do it kind of full sequence update, it doesn't work very well. That's well known. That's why
Percy Liang's work didn't work out well. That's the main reason, I think, and we have data to support
that. And now, the search errors cold be like the gold derivations fall off the beam. Like, for example,
this is the gold derivation lattice, and you can imagine that somewhere here, the gold state, the correct
state like this state, falls off the beam. They didn't make up to the top four, because their model score is
not good enough. In a sense, the model has a problem here, that they should have this guy survived in the
search. They should have scored this guy higher up, and another possibility is that this guy, this state, is
merged by an equivalent state, which like on a signature is exactly the same but has different derivation.
But this is the wrong derivation, and our correct derivation is being merged, so that's another case of
search error, and so on and so forth. So in a sense we should address the search errors in the middle of
the search by some other update method, by some prefix update method, rather than wait until the very
end. If you wait until the very end, you don't see the signal of where is the problem. Okay, so fixing
search error, we have two methods. One is a relatively old method called early update. It works, but
doesn't work that well. So the idea is very intuitive. So you have a beam, and let's forget about latent
variable for now. Let's assume there is only one correct translation. Let's say we are doing part-of-speech
tagging or just parsing, which there is a unique answer, unique derivation. Then, what if the correct
derivation falls off the beam at, say, step seven, now what do you do? You lost the positive signal, so
what previous people, Collins and Roark, said is that you stop and update right here and forget about the
rest of sentence. Just update on prefix. So that's called an early update, and why early update works? So
actually, early update does work for incremental parsing and others, those kind of beam search tasks.
Most of the incremental parsing papers, following that paper, uses early update. And they were found to
be much better than standard update, but why? Do you guys know why it works? Actually, nobody
knows why it works, and I proved why it works in one of my earlier papers. Two years ago, I found a
notion called violation. Then I can prove that early update guarantees that each update is not a violation.
A violation basically means that the correct prefix scores lower than incorrect prefix, which should not
happen, and in a perfect model, they should score better than anything incorrect, so that's our separability
assumption, and anything violating that is a violation, and early update makes sure that each update is a
violation, whereas the standard update is not guaranteeing that, because it's very likely that, at the end of
the search, the correct derivation actually scores higher than anything in the beam, although in the middle
of the search it doesn't survive. But in the sense that the model as a whole does prefer the correct one, so
in a sense, if you have exact search, the model would return the correct derivation. It just doesn't survive
the beam if your beam is too small. So who is to blame, search or model? Is it model error or search
error? In a sense, it's kind of the model is correct, if you have perfect search, but our notion of model
error, it's kind of dependent on search, so it's search-specific model error, because you have to live with
this particular search quality in both training and testing, so if you make a mistake, it's really still kind of
being misled by the model. So the model should kind of lead the search so that the correct guy doesn't
fall off and survives all the way to the end. So in a sense, it's still model's problem and you should fix the
model to guide the search. Although I know the search is really bad, it will stay as bad as it is, still, your
model should guide it toward something as good as you can go, so that's our intuition. So I approve that,
as long as each update is a violation on a violation, then you still can have convergence. We have the
same theorem, the same guarantee, the same generalization bounds of perceptron, like you have cost and
number of updates and stuff like that. This is intuitive, because if the model points up, which is in this
case that here is the best one in a beam, like each step, and if you fall off a beam, it's because you would
score lower. And each update should point downwards, because you made a mistake and you should
have the negative feedback to pull you back to fix the problem. So early updates are correct, the full
update is wrong. It's not always wrong, but it could likely be wrong if this guy goes up. And our
statistics will show that this actually happens a lot of times. Nobody actually took the pain to really see
how many times you have this situation, but ours, it was true that more than half the time you have this
situation. That's why if you just do a normal perceptron, standard perceptron, it just doesn't work out of
the box. You have to break the search as a white box instead of as a black box. So that's my point.
Search and learning should be mingled together. Chris?
>>: I'm curious. At some point, are you going to describe how this relates to the CERN work?
>> Liang Huang: Yes, CERN and LASSO.
>>: Because it seems like the intuitions are really similar, although the details of the algorithm may be
substantially different.
>> Liang Huang: Yes, LASSO, I'll show it as a special instance of this framework. CERN is still -- I'm
not quite sure about CERN. LASSO, I'm pretty sure. LASSO is the precursor of CERN. So this is good,
but how does it extend to latent variables? That's our first question. Because MT has latent variables, so
you just use unique correct derivations, so you have many derivations. Now you can do something like if
you have lattice, imagine you have many, many correct derivations. I just draw two as an example. Then
somebody falls off the beam very early on, or somebody stayed in the beam, but as some step, at some
point, maybe at step 10, everybody falls off the beam. So at this point, you can be sure that there's no
way to recover a correct translation. You can be sure that it's already impossible to reach the reference.
At this point, just say stop and make an update, because that's where my hope drops and so that's the new
definition of early update, extended. You can still prove it's a violation guaranteed, and so on and so
forth, and stop decoding and forget the rest of sentence. Right, okay, so early update works okay. It
works okay. It's much better than standard update. It's just not been applied to translation. It's being
applied to parsing, mostly, but it has a big problem that it learns very slowly. It converges much slower
than the standard update. It converges higher, but it converges much slower, and it is intuitive why it is
the case, because it only updates on a prefix, very small prefix. So you get the first word wrong, you stop.
You get the third action wrong, you stop in a sense. You just skip too many. You did not take advantage
of the rest of the sentence. So the updates are relatively short, and that's why you need more iterations to
learn more stuff. So I proposed in a previous work of mine another update method called max-violation,
which is also very intuitive, and the idea is to update at a place where the mistake is the maximum. And
the mistake is defined as the amount of violations. It's basically at step say, whatever, five or 10, the
difference between the best correct derivation and the best incorrect derivation in the beam, if that
distance is the maximum across all different steps, then that place is the maximum violation place to
update towards. It is intuitively the largest amount of violation or the worst mistake. If you have to fix
one mistake, fix the worst mistake or biggest mistake, so that's called max-violation. And it must be to
the right of early update, and much -- to the very much right, like roughly speaking, in practice, I found it
mostly like 70% of the sentence, and then gradually it would increase to 80%, 90%, but not 100%. So it
updates on a much longer prefix, and also because the amount of variations is bigger, you can show but
not prove -- you can show in the perceptron proof that mathematically that makes a lot of sense, that
convergence should be faster, and it is. In all of our experiments on parsing, tagging, all kinds of parsing,
we found that max-violation is always more than three times faster than early update, to reach the same
level of accuracy. And also, if you let it run for longer, it always converged a little bit higher than early
update, so it's better and faster. Mostly, it's because it's faster, more than three times faster than early
update. Because for translation, it's just way too slow to train on a training set. The data is too big. If
you use early update, it will cost you weeks, but if you use max-violation, you can do much faster. Okay.
So, anyways, all of these instances are kind of special instances of my framework called violation-fixing
perceptron, which is a framework designed to handle search errors. And these guys, you can prove they
all converge in this framework, because they all point downwards. The updates are in the reverse
direction of the current model, so that product is negative, but the standard update is wrong, because it
points up, in a sense that they reinforce the error, but not fixing the error. So as long as your update
points down, it's going to converge. And LASSO, I can show it's a very simple special case in this
framework. So you can prove a lot of theorems in this framework. You can propose a lot of other update
methods, and I proposed many others, but it turns out that max-violation always works the best,
consistently, over all methods. It's also very easy to define. Actually, when you have latent variables, it's
actually a lot easier to define max-violation. Early update, it's actually harder to implement. Okay,
anyways, that's the comparison between nonlatent variable and latent variable. Okay, but it's the same
idea, just extended. So here's the roadmap of techniques. It started off on structured perceptron. Then,
on one side people extended to handle latent variables. Then on the other side, people extend it to handle
inexact search. Part of that is my work, and then we just combine it in this work, this latent variable
perceptron with inexact search. That's what we do. And we tried it for phrase-based translation, and we
are trying it for many other tasks, like parsing, semantics, transliteration, all kinds of stuff. And I argue
that it can largely replace part of EM, because part of EM is still dealing with partially observed data, like
weakly supervised. You have input and output but not a derivation. And then this framework can largely
replace that kind of application of EM. It cannot replace all applications of EM, when even the output is
hidden. But if the output is known but just the derivation is hidden, then this framework has a lot of
application, because you can define all kinds of features and EM cannot.
Okay, so let's get back to the experiments. The features, we have tons of rich features, but that's kind of a
relatively boring part. We have dense features, we have rule ID features, we have -- the most important
thing is the WordEdges edges features. That's basically kind of the lexicalized translation window, so
let's say we are translating this row, R2, right now, and we have already covered R1, Bush. So the first
and last word of the Chinese side on the row, the first and last word of the Chinese side in the row, the
first and last word on the English side and the boundaries, out of the boundary, like the neighboring
words on the Chinese side, there is static information. You can use as much as you want, and you use all
kinds of combinations of this information, and all kinds of combos. Nonlocal features is more interesting.
You can have rule bigrams that captures the interplay between rules, and it will be more important if you
have minimum translation rules, like what you guys did -- Chris, you did, those minimum translation
units. And also the current rule and last two words generated and all kinds of combinations of these. So
actually, we used only a very little amount of nonlocal features. It's less than like 0.3% of our feature
instances, but it helps a lot. It helps almost a one BLEU point. The majority of our features is
WordEdges, the static WordEdges, the local WordEdges features, which 99% of our features are
WordEdges features. It helps a lot, but nonlocal features also helps a lot, even though it's a very tiny
fraction of the features. Okay, our experiments. We have three data sets, Chinese-English, small, large
and Spanish-English large, but they are not really large. They should be called medium, because they're
[indiscernible] kind of scale. We cannot afford to train even bigger data sets, because it just takes us too
long, because this is one of the first works to train really on the training set, and these are the reachability
curves I already showed. On Chinese, it's not that great. On Spanish, it's a lot. Even for sentence level,
it's more than half. That's why we don't even bother to add prefixes, because it's good enough. So even
though it's not that good reachability ratio, it's still a lot bigger than the dev set, so for small data, it's 10
times bigger. For the data set that we care about, it's more than 100 times bigger. For Spanish, three
times bigger. So yes, so we will report results on these three data sets. Sure, sure.
>>: In the first case, your 30K is the original training data, and then once what's reachable is one-tenth of
that, so that's only about 6K sentences, then?
>> Liang Huang: You mean this number?
>>: Yes, after 20%, so you're only left with -- oh, but you add the prefix back.
>> Liang Huang: Yes, and the prefix back. Otherwise, it's too small. Yes. Okay, so here is probably the
most interesting results slide. So we compared different update methods, different variations of
perceptron, standard perceptron, max-violation perceptron, early update, and there's another one called
local update, which I'll explain. But, first of all, standard perceptron is really bad. It actually goes down,
and it's much worse than MERT, which is -- Percy Liang reported roughly the same thing. Then Percy
Liang had another variant called local update, which is the local update is more like Percy Liang's local
and the standard update is more like Percy Liang's bold. So local update is like update towards the best
translation in the N-best list at the final stage, final step. So in the final bin, there is some translation that
stores lower than the 1-best in the current model, but according to sentence level BLEU, it's actually
closer to the reference, so it updates towards that. So you can show that it's still converged, because the
update still points down in the framework, but the amount of violations is too small, so that it doesn't
work our way, but it's actually much better than standard update, which is consistent with what Percy
Liang found. This local update should be better than standard update, but it's not too much. It learns too
slowly. Our interesting curves are max-violation and early update. So max-violation is really fast and
good, and this is after overfitting on the dev. This kind of BLEU curve only held out on the dev set. So
early update, if you let it run for much longer time, maybe after three to five times more, it will reach
somewhere close to but not as high as max-violation's highest point, in our experience, but we just could
not afford to run it for any longer, because it's just way too slow, so we have to stop here. But either way,
max-violation is a lot better, about two points better than MERT baseline here.
>>: So I'm curious on the huge difference in the starting point on the left. Is that after one iteration?
>> Liang Huang: Yes, this is first iteration's data. First iteration. It's possible that standard update peaks
at something like half of an interaction. If I just took half of the data, it actually already is overfitting and
it drops, so it's possible there is some peak somewhere in the half, less than one. Yes. But this kind of
behavior is well justified in parsing, as well. We see this kind of behavior in parsing, as well. It's kind of
shocking. If you first look at it, first of all, why perceptron update is so much worse just on first iteration,
and secondly, why it drops down? It's because of this. You have tons of search errors, and most of the
updates are invalid, so this curve says how many updates are indeed invalid updates, invalid updates
meaning if you go up and you're reinforcing the error? How many times in standard perceptron do you
make those bad updates and you're not even aware of that? Well, actually, if your beam is one, which is
greedy search, then most of the time. If your beam is 30, which is basically these experiments, you still
have about 60% of updates being wrong, and if people just blindly tried perceptron, they are not even
aware that most of the updates are wrong, and they're reinforcing the error. You would rather just even
skip them, or even that is better, but people never look at the statistics, how many updates are actually
wrong. I showed you that it's more than half. And the beam size doesn't help that much, as I said. If you
doubled the beam size, it's just going to be a little bit lower, just because the search base is exponential.
Really, the full search base is exponential, and if you just increase the beam size, it doesn't help that
much. It doesn't help that much. So there is no way to fix search errors that much, to improve search
quality that much. This is all due to search errors, right? Otherwise, there was no such behavior of
invalid updates. If there is exact search, there is 0%, whereas in tagging you can see this curve goes down
to zero very quickly. With beam level five, you see no search errors in tagging, because the search is so
simple. In parsing, it's almost like this. It's slightly better than this. In translation, the search is really
hard. That's the whole point, the take-home message. Okay, and then we have to scale it up, so we use a
parallelized perceptron, which is another paper of mine, by my student Kai, who has an offer from
Carnegie to do the internship this summer. So it's much better than, much faster than Ryan McDonald's
in terms of parameter mixing. McDonald's work doesn't have much speedup, actually, maybe two times,
three times, but we have like seven times speedup, sevenfold speedup, if you use 24 CPUs on one
machine. And if you just use six CPUs, you have about four times speedup. That's a lot. That makes our
work finally kind of tractable on a large data set. Otherwise, it's not even runnable. Okay, so then we
compare the future contributions.
>>: How long?
>> Liang Huang: How long? The final one, on a full data set, it took about 30 hours, using 24 CPUs.
>>: And that's on the biggest data set, like the 150?
>> Liang Huang: Yes, it's like FBIS scale. It's not that big. Otherwise -- right now, we have more
machines, so we can probably run more experiments, but at that time we don't have a very big machine.
So dense features, only 11 features, you're about two points worse than MERT. Now, you add rule IDs,
improve one BLEU point with just rule ID features. The most improvement comes from WordEdges,
which most of the features actually are in WordEdges. You got more than two points improvement, then
you finally beat MERT, and then the final icing on the cake is nonlocal features, only 0.3%, but that gives
you almost one point.
>>: So why is it that you can't actually reach more with the dense features?
>> Liang Huang: Yes, there are two reasons. That was the same question? Great. So Michael asked me
the same question, actually, before this talk, and I think there are at least two reasons. First of all,
perceptron is not designed to handle those features with different scales. So often people use AdaGrad or
something like second-order perceptron, which is better at handling features with different scales.
Perceptron is well known to be best at sparse features, just the minimum sparse features. We didn't do
anything special here. We just used perceptron. We didn't even bother using AdaGrad. Actually, after
that paper, some of my students used AdaGrad. Sometimes, it helps a little bit, sometimes not. It's not
that consistent to us, but it's often helping a little bit, especially if you have a mix of dense features and
some sparse features. But it's not going to help that much. And the second reason is that we trained on a
training set. This is reporting the dev set BLEU, but by training on the training set, so it's not that
comparable to MERT, which is trained on the dev set.
>>: Don't you think that there's potentially some difference because you've introduced a sort of loss?
You're trying to optimize this single derivation.
>> Liang Huang: Zero, one loss.
>>: As often as possible.
>> Liang Huang: Not single derivation, but single reference.
>>: That's right, that's right. Any derivation in the class that leads to that reference.
>> Liang Huang: Yes, but zero, one loss on that ->>: But you might be better off getting this one word wrong so that you can get everything right on a
subsequent sentence.
>> Liang Huang: That's right. That's exactly right.
>>: You've lost this original BLEU by picking perceptron instead. Do you think there's some difference
because of that loss?
>> Liang Huang: Totally, totally. For simplicity, we just do zero, one loss, because you have to get
exactly -- you cannot even be one word wrong. But otherwise, you have to have some sentence-level
BLEU, and we don't have anything like that. We're just very clean. Right now, my postdoc is trying to
work on something closer to that direction that is trying to say what if the sentence is not reachable, first
of all? Because here, you assume every sentence is reachable, so you are restricted to the small subset,
and what if a lot of them is not reachable? Then you just try to be as close as possible to the reference,
but you don't want it to be exactly correct.
>>: Going back to [Chan's] question at the beginning, by picking sentences that are reachable, have you
selected a biased subset?
>> Liang Huang: I think we do.
>>: It would be interesting to see what your BLEU score is on the reachable dev sentences versus the full
reachable set, because maybe you're doing really, really well on those, but you're having issues with the
ones that are less ->> Liang Huang: Yes. My hypothesis is that for short sentences you actually got even more
improvement, but for long sentences it doesn't have much, because it doesn't have much signal from
longer sentences. We're highly biased towards easy ones, but that's unfortunate. Another student of mine
is trying it on [HIRO], where the reachability is very high. Because [HIRO], you can do all kinds of longdistance reorderings and reachability can be more than 80% of the training data. Okay, anyways, so these
are the individual contributions of -- or cumulative contributions of features, and these are comparing
with MERT and PRO, so this is our max-violation, and MERT is not very stable. It jumps. And PRO is
more stable and it gets a little bit better with medium-scale features. And the final result on big data, it
took 47 hours? Wait, I think it's 47 hours is like 15 iterations, but the peak arrives around 35 hours, I
think. But, anyways, it finished within two days, so it's not that bad, but we cannot afford to run it for a
week, I think, 23 million features. So if you just have MERT -- so we have two systems. Cubit is very
similar to Moses, my own system in Python. With 11 features, we got roughly the same on dev and test.
With PRO, we got slightly better. PRO is slightly better, but with PRO, on more features, like mediumscale features, we got a lot better. But that's 3,000 features is very hard to engineer, so you have to be
very careful not to be too specific, not to be too sparse, and I think to hand-engineer this feature set is
extremely hard and not a general approach, but we don't engineer features at all. We just throw in all
kinds of features. We don't even do any feature selection or whatever. But PRO doesn't do well with a
larger amount of features. That's kind of well known.
>>: So why not run PRO on the entire training set?
>> Liang Huang: That's too slow. Nobody has actually reported that, running PRO.
>>: They have, but it took quite a lot of time.
>> Liang Huang: Okay, does it work well?
>>: There wasn't a lot of improvement, less than one BLEU point, as I recall. One of Stefan Riezler's
students and Chris [indiscernible].
>> Liang Huang: Okay, so they did do something like online style of PRO, the perceptron style of PRO.
It's not real PRO, but perceptron-ized PRO or something like that. I think I know the paper that you
mentioned.
>>: But again, it wasn't my intent [indiscernible].
>> Liang Huang: Right. But this is all trained on dev. Dev, it would have quickly overfit. You would
imagine that. And max-violation on the training set, 23 million features, you got 2.3 improvement on dev
and two points improvement on test over MERT. That's considered a lot. Okay, then sorry, questions? If
you move on to Spanish, Spanish, you only have one reference in the standard data sets, but our
improvements are, roughly speaking, one point, 1.3 and 1.1, and as a kind of a common wisdom in our
field, if you have one BLEU point improvement in one reference BLEU, it's roughly speaking equivalent
to two points of improvement in four reference BLEUs. It's not exact, but it's roughly speaking. So our
results are consistent with the Chinese improvement and two points of BLEU, but just because the
reachability ratio is a lot higher for Spanish, so we can use a lot more percentage of data for Spanish.
Okay, so to conclude, I presented a very simple, very clean method. It doesn't use any sentence-level
BLEU, like hope or fear or loss-augmented decoding, anything like that. There's no loss. There's just
zero, one loss, as Chris said. Very simple, and scaled to a large portion of the training set and able to
incorporate millions of features, no need to define anything else, no learning rate or parameters to tune.
It's just perceptrons at a very constant learning rate of one. And no initialization parameters. It's always
zero to start with and a lot of improvements on BLEU than MERT and PRO. And the three most
important ingredients that made it work are, first of all, most importantly, learning to accommodate search
errors, violation fixed in perceptron is designed to do that, and max-violation works the best. And then to
handle latent variable, I used forced decoding lattice, which previous people argued it's not a good idea,
because that's too rigid, and some forced decoding derivations are kind of using bad rules to be lucky, and
when you're lucky, you got the exact reference translation, but you used wrong rules. But personally, I
argue that why his work doesn't work out is because he can sometimes get a good reference translation
with bad rules. I argue that's hard to argue, because you cannot really tell what derivations get which
derivations wrong. As long as they produce the reference, I think they are okay. So you can only use the
model to choose which one to update towards. And the reason why his work doesn't work out is all
because of search errors, so that's it. These two are his curves. The most important thing is, if you just
use standard perceptron, then because it doesn't deal with search errors, then you just have very bad
performance, because most of your updates are wrong, actually, without even noticing that. Okay, that's
the most important take-home message. And our learning framework works the best when your search
has a lot of search errors, so if your search is mostly correct, like in tagging, you don't need to bother
using our method. You can just use perceptron, standard perceptron. But if your search is so hard and it
makes tons of search errors, then you have to use our method. Otherwise, it's just too bad. Okay, then
also we have parallelized perceptron to scale it up to big data. And the roadmap again, latent variable
perceptron with inexact search, and we hope it's a very general -- it's become a very general technique and
replacing EM largely. And questions.
>>: So I thought I mentioned this earlier, but your last slide maybe had it [indiscernible]. Within the
lattice, within the forced decoding lattice, you have a longer different derivation.
>> Liang Huang: That's right.
>>: Which one are you updating towards?
>> Liang Huang: If you're updating towards step five, you choose the highest-scoring one as step five. If
you update at step three, you update towards the highest-scoring one at step three. So it could be different
ones. So this one is different from this one.
>>: So towards the best derivation?
>> Liang Huang: Up to that point. Up to that point.
>>: The highest-scoring derivation up to that point.
>> Liang Huang: The highest-scoring derivation, prefix derivation up to that point.
>>: Okay, and so you're not worried about this bad phrase thing?
>> Liang Huang: Yes, I just don't bother, because I cannot tell. I cannot just say, hey, this part of the
derivation, you don't use that. There's no way that you can select.
>>: So my other question has to do with -- so you made a strong point about search error and ->> Liang Huang: Yes.
>>: So then I'm a little surprised about the fact that it's doing just as well in Spanish as compared to
Chinese, because some of the other data you showed about completion statistics, etc., would seem to
indicate that search error is much less of a problem in Spanish.
>> Liang Huang: No, Spanish has the same thing.
>>: Well, if it's mostly monotone, then you set a reasonably medium-sized distortion limit of four or five,
your search bases come down very drastically, without much impact on reachability at all. So wouldn't
that imply that the search spaces is much smaller? The useful ->> Liang Huang: It is useful. The interesting search space is much smaller. I think that's correct.
>>: So search error is not as much of an issue.
>> Liang Huang: Maybe, maybe. We didn't do an analysis on Spanish. We didn't draw these curves on
Spanish, actually, but I think we tried to stand it up. It just doesn't work well, either, but maybe the
difference is not that big and doesn't go down that much. But you are probably right, that the interesting
part of this search base is smaller on the Spanish side. So if your language is mostly -- if language is
totally monotone, then the complexity is actually linear time.
>>: And I think people who have previously asserted that search error is not a big deal have been
working with things like English and French and things like that.
>> Liang Huang: Yes, that's right. Probably the case.
>>: So I have one more question about the data bias issue. So for anybody, have you tried using MERT,
but only on a data set that had only reachable?
>> Liang Huang: Yes, that's a good point. That's a good point. So MERT is usually trained on a dev that
has four references, because on test sets, you also have four references. I don't know if people have
trained MERT on a training set or were part of the training set. I don't know. I think one of my students
maybe have tried training MERT on the reachable subset and just to do a fair comparison. But it doesn't - because the domain between dev and test are usually very similar, but on a training set, the domain is
often pretty far away from dev and test. We do have a disadvantage by training on a training set. We
don't use dev set at all, except for telling us when to stop, just like preventing [indiscernible]. Most
people use dev set in a much more interesting way, but I think from a machine learning point of view, dev
set is supposed to just be held out. You shouldn’t tune your parameters on dev set. For some reason ,
most of MT research has been training on the dev set, for scalability issues, I think.
>>: This other potential baseline, in some systems, we actually ship a discriminative model that's trained
on millions of features, but it's trained in a less-sophisticated way, right? Like, what we do is we look at
each component of a rule and we try to optimize the likelihood of the correct translation from the training
data, built a large-scale discriminative model, just optimizing log likelihood, and then throw that in as an
additional feature?
>> Liang Huang: A lot of people tried that, yes.
>>: I mean, it helps, right?
>> Liang Huang: Yes, it does help.
>>: And of course it's much less satisfying, but in Spanish, do you think must of your gain is coming
from just learning these simple lexical features?
>> Liang Huang: It might be. I should do more like join these curves on Spanish, but I guess the shapes
will be similar. Just the difference will be kind of shrunk down on Spanish, I guess, but I don't know.
But my guess is if you just use standard perceptron, it's not going to have that much improvement, for
sure. Yes, it will be interesting to draw this curve, especially, the number of invalid updates on Spanish.
It might be very different. It's a very good question. I should look into that.
>>: You said you tried it on Japanese?
>> Liang Huang: The reachability ratio is too low, like less than 10%. I just cannot use much data.
That's unfortunate, because the distortion is just huge in Japanese. Okay, thank you very much.
>> Michael Auli: Let's just thank our speaker.
>> Liang Huang: Thank you.
Download